Introducing the DataFrame and Cleaning NaN values

Import and alias Pandas:

import pandas as pd

Load up the table from the link, and extract the dataset out of it. If you're having issues with this, look carefully at the sample code provided in the reading:

url='http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2018/seasontype/2'
df = pd.read_html(url, header=0)[0]

Next up, rename the columns so that they are similar to the column definitions provided to you on the website. Be careful and don't accidentally use any column names twice. If a column uses special characters, you can replace them with regular characters to make it easier to work with:

df.columns=['RK', 'PLAYER','TEAM','GP','G','A','PTS','+/-','PIM','PTS/G','SOG','PCT','GWG','\G','\A','+G','+A']

type(df)
df.head()

	RK	PLAYER	TEAM	GP	G	A	PTS	+/-	PIM	PTS/G	SOG	PCT	GWG	\G	\A	+G	+A
0	RK	PLAYER	TEAM	GP	G	A	PTS	+/-	PIM	PTS/G	SOG	PCT	GWG	G	A	G	A
1	1	Connor McDavid, C	EDM	82	41	67	108	20	26	1.32	274	15.0	7	5	15	1	3
2	2	Claude Giroux, LW	PHI	82	34	68	102	28	20	1.24	193	17.6	1	9	27	0	0
3	3	Nikita Kucherov, RW	TB	80	39	61	100	15	42	1.25	279	14.0	7	8	28	0	0
4	4	Evgeni Malkin, C	PIT	78	42	56	98	16	87	1.26	239	17.6	7	14	24	0	0

Convert argument to a numeric type.

If ‘raise’, then invalid parsing will raise an exception
If ‘coerce’, then invalid parsing will be set as NaN
If ‘ignore’, then invalid parsing will return the inpu

df['GP']= pd.to_numeric(df.GP, errors='coerce')
df['G']= pd.to_numeric(df.GP, errors='coerce')
df['A']= pd.to_numeric(df.GP, errors='coerce')
df['PTS']= pd.to_numeric(df.GP, errors='coerce')

Detect missing values (NaN in numeric arrays, None/NaN in object arrays)

selector=df.GP.isnull() & df.G.isnull()& df.A.isnull() & df.PTS.isnull()

Locate the NaN data using the index

#selector
bad_rows=df[selector].index

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names

df.drop(bad_rows, inplace=True)

Delete the 'RK' column:

del df['RK']
df.head()

	PLAYER	TEAM	GP	G	A	PTS	+/-	PIM	PTS/G	SOG	PCT	GWG	\G	\A	+G	+A
1	Connor McDavid, C	EDM	82.0	82.0	82.0	82.0	20	26	1.32	274	15.0	7	5	15	1	3
2	Claude Giroux, LW	PHI	82.0	82.0	82.0	82.0	28	20	1.24	193	17.6	1	9	27	0	0
3	Nikita Kucherov, RW	TB	80.0	80.0	80.0	80.0	15	42	1.25	279	14.0	7	8	28	0	0
4	Evgeni Malkin, C	PIT	78.0	78.0	78.0	78.0	16	87	1.26	239	17.6	7	14	24	0	0
5	Nathan MacKinnon, C	COL	74.0	74.0	74.0	74.0	11	55	1.31	284	13.7	12	12	20	0	1

Make sure there are no holes in your index by resetting it. There is an example of this in the reading material. By the way, drop the original index.

df.reset_index()
df.head()

	PLAYER	TEAM	GP	G	A	PTS	+/-	PIM	PTS/G	SOG	PCT	GWG	\G	\A	+G	+A
1	Connor McDavid, C	EDM	82.0	82.0	82.0	82.0	20	26	1.32	274	15.0	7	5	15	1	3
2	Claude Giroux, LW	PHI	82.0	82.0	82.0	82.0	28	20	1.24	193	17.6	1	9	27	0	0
3	Nikita Kucherov, RW	TB	80.0	80.0	80.0	80.0	15	42	1.25	279	14.0	7	8	28	0	0
4	Evgeni Malkin, C	PIT	78.0	78.0	78.0	78.0	16	87	1.26	239	17.6	7	14	24	0	0
5	Nathan MacKinnon, C	COL	74.0	74.0	74.0	74.0	11	55	1.31	284	13.7	12	12	20	0	1

Do the data-types of each column reflect the values you see when you look through the data using a text editor / spread sheet program? If you see object where you expect to see int32 or float64, that is a good indicator that there might be a string or missing value or erroneous value in the column.

Check the data type of all columns, and ensure those that should be numeric are numeric.

df.dtypes

    PLAYER     object
    TEAM       object
    GP        float64
    G         float64
    A         float64
    PTS       float64
    +/-        object
    PIM        object
    PTS/G      object
    SOG        object
    PCT        object
    GWG        object
    \G         object
    \A         object
    +G         object
    +A         object
    dtype: object

Convert argument to a numeric type

df['+/-']= pd.to_numeric(df['+/-'], errors='coerce')
df['PIM']= pd.to_numeric(df.PIM, errors='coerce')
df['PTS/G']= pd.to_numeric(df['PTS/G'], errors='coerce')
df['SOG']= pd.to_numeric(df.SOG, errors='coerce')
df['PCT']= pd.to_numeric(df.PCT, errors='coerce')
df['GWG']= pd.to_numeric(df['GWG'], errors='coerce')
df['\G']= pd.to_numeric(df['\G'], errors='coerce')
df['\A']= pd.to_numeric(df['\A'], errors='coerce')
df['+G']= pd.to_numeric(df['+G'], errors='coerce')
df['+A']= pd.to_numeric(df['+A'], errors='coerce')
df.head()

	PLAYER	TEAM	GP	G	A	PTS	+/-	PIM	PTS/G	SOG	PCT	GWG	\G	\A	+G	+A
1	Connor McDavid, C	EDM	82.0	82.0	82.0	82.0	20	26	1.32	274	15.0	7	5	15	1	3
2	Claude Giroux, LW	PHI	82.0	82.0	82.0	82.0	28	20	1.24	193	17.6	1	9	27	0	0
3	Nikita Kucherov, RW	TB	80.0	80.0	80.0	80.0	15	42	1.25	279	14.0	7	8	28	0	0
4	Evgeni Malkin, C	PIT	78.0	78.0	78.0	78.0	16	87	1.26	239	17.6	7	14	24	0	0
5	Nathan MacKinnon, C	COL	74.0	74.0	74.0	74.0	11	55	1.31	284	13.7	12	12	20	0	1

df.dtypes

    PLAYER     object
    TEAM       object
    GP        float64
    G         float64
    A         float64
    PTS       float64
    +/-         int64
    PIM         int64
    PTS/G     float64
    SOG         int64
    PCT       float64
    GWG         int64
    \G          int64
    \A          int64
    +G          int64
    +A          int64
    dtype: object

Try use your_data_frame['your_column'].unique() or equally, your_data_frame.your_column.unique() to see the unique values of each column and identify the rogue values.

If you find any value that should be properly encoded to NaNs, you can convert them either using the na_values parameter when loading the dataframe. Or alternatively, use one of the other methods discussed in the reading.

df.PCT.unique()

    array([15. , 17.6, 14. , 13.7, 17.5, 13. ,  9.4, 11.7, 13.8, 12.7, 18.7,
           12.9,  8.8, 14.4, 16.3, 10.6, 11.8, 14.2, 14.9, 23.4, 11.9, 10.5,
           17.4,  9.5, 10.8, 10.1, 12.4, 18.3, 13.3, 14.8, 13.2, 11.3,  6.3])

Figure out which indexing method you would need to use in order to index your dataframe with: [2:4, 'col3']. Finally, display the results:

df.iloc[2:4, [3]]

	G
3	80.0
4	78.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing_DataFrame_and_Cleaning_NaN_values.md

Introducing_DataFrame_and_Cleaning_NaN_values.md

Introducing the DataFrame and Cleaning NaN values

Files

Introducing_DataFrame_and_Cleaning_NaN_values.md

Latest commit

History

Introducing_DataFrame_and_Cleaning_NaN_values.md

File metadata and controls

Introducing the DataFrame and Cleaning NaN values