Skip to content

Latest commit

 

History

History
698 lines (639 loc) · 13.5 KB

Introducing_DataFrame_and_Cleaning_NaN_values.md

File metadata and controls

698 lines (639 loc) · 13.5 KB

Introducing the DataFrame and Cleaning NaN values

Import and alias Pandas:

import pandas as pd

Load up the table from the link, and extract the dataset out of it. If you're having issues with this, look carefully at the sample code provided in the reading:

url='http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2018/seasontype/2'
df = pd.read_html(url, header=0)[0]

Next up, rename the columns so that they are similar to the column definitions provided to you on the website. Be careful and don't accidentally use any column names twice. If a column uses special characters, you can replace them with regular characters to make it easier to work with:

df.columns=['RK', 'PLAYER','TEAM','GP','G','A','PTS','+/-','PIM','PTS/G','SOG','PCT','GWG','\G','\A','+G','+A']
type(df)
df.head()
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG PCT GWG \G \A +G +A
0 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG PCT GWG G A G A
1 1 Connor McDavid, C EDM 82 41 67 108 20 26 1.32 274 15.0 7 5 15 1 3
2 2 Claude Giroux, LW PHI 82 34 68 102 28 20 1.24 193 17.6 1 9 27 0 0
3 3 Nikita Kucherov, RW TB 80 39 61 100 15 42 1.25 279 14.0 7 8 28 0 0
4 4 Evgeni Malkin, C PIT 78 42 56 98 16 87 1.26 239 17.6 7 14 24 0 0

Convert argument to a numeric type.

  • If ‘raise’, then invalid parsing will raise an exception
  • If ‘coerce’, then invalid parsing will be set as NaN
  • If ‘ignore’, then invalid parsing will return the inpu
df['GP']= pd.to_numeric(df.GP, errors='coerce')
df['G']= pd.to_numeric(df.GP, errors='coerce')
df['A']= pd.to_numeric(df.GP, errors='coerce')
df['PTS']= pd.to_numeric(df.GP, errors='coerce')

Detect missing values (NaN in numeric arrays, None/NaN in object arrays)

selector=df.GP.isnull() & df.G.isnull()& df.A.isnull() & df.PTS.isnull()

Locate the NaN data using the index

#selector
bad_rows=df[selector].index

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names

df.drop(bad_rows, inplace=True)

Delete the 'RK' column:

del df['RK']
df.head()
PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG PCT GWG \G \A +G +A
1 Connor McDavid, C EDM 82.0 82.0 82.0 82.0 20 26 1.32 274 15.0 7 5 15 1 3
2 Claude Giroux, LW PHI 82.0 82.0 82.0 82.0 28 20 1.24 193 17.6 1 9 27 0 0
3 Nikita Kucherov, RW TB 80.0 80.0 80.0 80.0 15 42 1.25 279 14.0 7 8 28 0 0
4 Evgeni Malkin, C PIT 78.0 78.0 78.0 78.0 16 87 1.26 239 17.6 7 14 24 0 0
5 Nathan MacKinnon, C COL 74.0 74.0 74.0 74.0 11 55 1.31 284 13.7 12 12 20 0 1

Make sure there are no holes in your index by resetting it. There is an example of this in the reading material. By the way, drop the original index.

df.reset_index()
df.head()
PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG PCT GWG \G \A +G +A
1 Connor McDavid, C EDM 82.0 82.0 82.0 82.0 20 26 1.32 274 15.0 7 5 15 1 3
2 Claude Giroux, LW PHI 82.0 82.0 82.0 82.0 28 20 1.24 193 17.6 1 9 27 0 0
3 Nikita Kucherov, RW TB 80.0 80.0 80.0 80.0 15 42 1.25 279 14.0 7 8 28 0 0
4 Evgeni Malkin, C PIT 78.0 78.0 78.0 78.0 16 87 1.26 239 17.6 7 14 24 0 0
5 Nathan MacKinnon, C COL 74.0 74.0 74.0 74.0 11 55 1.31 284 13.7 12 12 20 0 1

Do the data-types of each column reflect the values you see when you look through the data using a text editor / spread sheet program? If you see object where you expect to see int32 or float64, that is a good indicator that there might be a string or missing value or erroneous value in the column.

Check the data type of all columns, and ensure those that should be numeric are numeric.

df.dtypes
    PLAYER     object
    TEAM       object
    GP        float64
    G         float64
    A         float64
    PTS       float64
    +/-        object
    PIM        object
    PTS/G      object
    SOG        object
    PCT        object
    GWG        object
    \G         object
    \A         object
    +G         object
    +A         object
    dtype: object

Convert argument to a numeric type

df['+/-']= pd.to_numeric(df['+/-'], errors='coerce')
df['PIM']= pd.to_numeric(df.PIM, errors='coerce')
df['PTS/G']= pd.to_numeric(df['PTS/G'], errors='coerce')
df['SOG']= pd.to_numeric(df.SOG, errors='coerce')
df['PCT']= pd.to_numeric(df.PCT, errors='coerce')
df['GWG']= pd.to_numeric(df['GWG'], errors='coerce')
df['\G']= pd.to_numeric(df['\G'], errors='coerce')
df['\A']= pd.to_numeric(df['\A'], errors='coerce')
df['+G']= pd.to_numeric(df['+G'], errors='coerce')
df['+A']= pd.to_numeric(df['+A'], errors='coerce')
df.head()
PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG PCT GWG \G \A +G +A
1 Connor McDavid, C EDM 82.0 82.0 82.0 82.0 20 26 1.32 274 15.0 7 5 15 1 3
2 Claude Giroux, LW PHI 82.0 82.0 82.0 82.0 28 20 1.24 193 17.6 1 9 27 0 0
3 Nikita Kucherov, RW TB 80.0 80.0 80.0 80.0 15 42 1.25 279 14.0 7 8 28 0 0
4 Evgeni Malkin, C PIT 78.0 78.0 78.0 78.0 16 87 1.26 239 17.6 7 14 24 0 0
5 Nathan MacKinnon, C COL 74.0 74.0 74.0 74.0 11 55 1.31 284 13.7 12 12 20 0 1
df.dtypes
    PLAYER     object
    TEAM       object
    GP        float64
    G         float64
    A         float64
    PTS       float64
    +/-         int64
    PIM         int64
    PTS/G     float64
    SOG         int64
    PCT       float64
    GWG         int64
    \G          int64
    \A          int64
    +G          int64
    +A          int64
    dtype: object

Try use your_data_frame['your_column'].unique() or equally, your_data_frame.your_column.unique() to see the unique values of each column and identify the rogue values.

If you find any value that should be properly encoded to NaNs, you can convert them either using the na_values parameter when loading the dataframe. Or alternatively, use one of the other methods discussed in the reading.

df.PCT.unique()
    array([15. , 17.6, 14. , 13.7, 17.5, 13. ,  9.4, 11.7, 13.8, 12.7, 18.7,
           12.9,  8.8, 14.4, 16.3, 10.6, 11.8, 14.2, 14.9, 23.4, 11.9, 10.5,
           17.4,  9.5, 10.8, 10.1, 12.4, 18.3, 13.3, 14.8, 13.2, 11.3,  6.3])

Figure out which indexing method you would need to use in order to index your dataframe with: [2:4, 'col3']. Finally, display the results:

df.iloc[2:4, [3]]
G
3 80.0
4 78.0