Best tips for data scientists, most common option setup in python pandas

data science

Publish Date: 2022-08-06

The most commonly used data table tool for data scientists is pandas; through pandas tables, tables (dataframes) and analysis tables can be easily displayed. The formatting of the table is not good, which will affect the analysis efficiency.

If a worker wants to do well, he must first sharpen his tools. Here are some commonly used setting methods:

Show more lines
Show more columns
Change the column width
Set the precision of the float column
Number formatted display
Change the drawing method
Configure the output of info()
Print out current settings and reset all options

1. show more rows

By default, pandas does not exceed the display range of the screen. If there are many rows in the table, it will truncate the middle row and display only part of it. Sometimes, if the total number of rows of data to be viewed is not many, you can control the maximum number of rows displayed by setting display.max_rows, for example, setting the display of 200 rows of data to view the data at one time:

import pandas as pd
pd.set_option('display.max_rows', 200)
# or set like this
# pd.options.display.max_rows = 200

But when the number of rows of data exceeds display.max_rows, then display.min_rows will determine how many rows are in the displayed part. Because the default number of rows in display.min_rows is 10, the data is generally displayed as the first 5 rows of data and the last 5 rows of data.

In the same way, you can also display the number of rows that can be displayed according to your own habits, such as 20

pd.set_option('display.min_rows', 20)
# pd.options.display.min_rows = 20

If you need to restore the default settings, you can reset to default conditions like this:

# reset
pd.reset_option('display.max_rows')

2. Show more columns

The row can be set, the same column can be set, display.max_columns controls the number of columns that can be displayed, the default value is 20.

pd.get_option('display.max_columns')
# pd.options.display.max_columns
20

3. Change column width

pandas has some limits on the number of characters displayed in a column, the default is 50 characters. Therefore, if some value characters are too long, an ellipsis will be displayed. If you want to display all, you can set display.max_colwidth, for example, set it to 500.

pd.set_option('display.max_colwidth',500)
# pd.options.display.max_colwidth = 500

4. Set the precision of the float column

For float data, pandas displays only 6 decimal places by default. We can set display.precision in advance to display only 2 digits to avoid repeated operations later.

pd.set_option('display.precision',2)
# pd.options.display.precision = 2

This setting does not affect the underlying data, it only affects the display of floating columns.

5. Format the numbers

There is an option display.float_formatoption in pandas which can be used to format any float column. This only applies to floating point columns, for other data types, they must be converted to floating point numbers.

Format large value numbers with commas

Large numbers like 1200000 seem inconvenient, so we separate them with commas.

pd.set_option('display.float_format','{:,}'.format)

pandas display format example

set number precision

Similar to display.precision above, if we only care about 2 digits after the decimal point, we can format it like this:

pd.set_option('display.float_format', '{:,.2f}'.format)

percentage sign formatting

If we want to display a percentage column, we can set it like this.

pd.set_option('display.float_format', '{:.2f}%'.format)

6. Change the drawing method

By default, pandas uses matplotlib as the plotting backend. Starting from version 0.25, pandas provides options for using different backends, such as plotly, bokeh and other third-party libraries, but the premise is that you need to install them first.

The setup is very simple, as long as the third-party library is installed, it also only needs one line.

import pandas as pd
import numpy as np
pd.set_option('plotting.backend', 'altair')
data = pd.Series(np.random.randn(100).cumsum())
data.plot()

7. Configure the output of info()

In pandas, we often use info() to quickly view the data of the DataFrame. However, the info method has a default limit on the maximum number of columns to analyze, and if there are nulls in the dataset, it will be very slow when counting statistics on large datasets.

pandas provides two options:

display.max_info_columns: Set the maximum number of columns to analyze, the default is 100.
display.max_info_rows: Set the threshold when the count is null, the default is 1690785.

For example, when analyzing a dataset with 150 features, we can set display.max_info_columns to a value that covers all columns, such as setting it to 200:

pd.set_option('display.max_info_columns', 200)

When analyzing large datasets, df.info() is slow due to calculating all nulls. So we can simply set display.max_info_rows to a small value to avoid counting, e.g. only count null if the number of rows does not exceed 5:

pd.set_option('display.max_info_rows', 5)

8. Print out current settings and reset all options

pd.describe_option() will print out the description of the setting and its current value.

pd.describe_option()

You can also print certain options, such as line display.

# specific search
pd.describe_option('rows')

Finally, we can also reset all directly.

pd.reset_option('all')

Summarize

The above is the use of commonly used set_option, you can set it as follows at one time:

pd.set_option('display.max_rows',xxx) # maximum number of rows
pd.set_option('display.min_rows',xxx) # Minimum display rows
pd.set_option('display.max_columns',xxx) # Maximum number of display columns
pd.set_option ('display.max_colwidth',xxx) #Maximum column characters
pd.set_option( 'display.precision',2) # floating point precision
pd.set_option('display.float_format','{:,}'.format) #Comma-separated numbers
pd.set_option('display.float_format', '{:,.2f}'.format) #Set floating point precision
pd.set_option('display.float_format', '{:.2f}%'.format) #Percent sign formatting
pd.set_option('plotting.backend', 'altair') # Change the backend drawing method
pd.set_option('display.max_info_columns', 200) # info output maximum number of columns
pd.set_option('display.max_info_rows', 5) # info count threshold when null
pd.describe_option() #Show all settings and descriptions
pd.reset_option('all') #Reset all setting options

robot learner

https://datasciencebyexample.github.io/2022/08/06/2022-08-06-1/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

pandas

What are the risks when more people start to use AI writers such as GPT3 to produce articles?

2022-08-11 machine learning

gpt3

StagingBucket already exists error during cdk Bootstrap

2022-07-28 stack overflow

aws