Skip to content

Tries to shrink your Pandas column dtypes with no data loss so you have more spare RAM

License

Notifications You must be signed in to change notification settings

noklam/dtype_diet

 
 

Repository files navigation

dtype_diet

Attempt to shrink Pandas dtypes without losing data so you have more RAM (and maybe more speed)

This file will become your README and also the index of your documentation.

Install

pip install dtype_diet

Documentation

https://noklam.github.io/dtype_diet/

How to use

This is a fork of https://github.com/ianozsvald/dtype_diet to continue supoprt and develop the library with approval from the original author @ianozsvald.

This tool checks each column to see if larger dtypes (e.g. 8 byte float64 and int64) could be shrunk to smaller dtypes without causing any data loss. Dropping an 8 byte type to a 4 (or 2 or 1 byte) type will keep halving the RAM requirement for that column. Categoricals are proposed for object columns which can bring significant speed and RAM benefits.

Here's an minimal example with 3 lines of code running on a Kaggle dataset showing a reduction of 957 -> 85MB, you can find the notebook in the repository:

#slow
# sell_prices.csv.zip 
# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/
import pandas as pd
from dtype_diet import report_on_dataframe, optimize_dtypes
df = pd.read_csv('data/sell_prices.csv')
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')
Original df memory: 957.5197134017944 MB
Propsed df memory: 85.09655094146729 MB
#slow
proposed_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Current dtype Proposed dtype Current Memory (MB) Proposed Memory (MB) Ram Usage Improvement (MB) Ram Usage Improvement (%)
Column
store_id object category 203763.920410 3340.907715 200423.012695 98.360403
item_id object category 233039.977539 6824.677734 226215.299805 97.071456
wm_yr_wk int64 int16 26723.191406 6680.844727 20042.346680 74.999825
sell_price float64 None 26723.191406 NaN NaN NaN

Recommendations:

  • Run report_on_dataframe(your_df) to get recommendations
  • Run optimize_dtypes(df, proposed_df) to convert to recommeded dtypes.
  • Consider if Categoricals will save you RAM (see Caveats below)
  • Consider if f32 or f16 will be useful (see Caveats - f32 is probably a reasonable choice unless you have huge ranges of floats)
  • Consider if int32, int16, int8 will be useful (see Caveats - overflow may be an issue)
  • Look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html which recommends Pandas nullable dtype alternatives (e.g. to avoid promoting an int64 with NaN items to float64, instead you get Int64 with NaNs and no data loss)
  • Look at Extension arrays like https://github.com/JDASoftwareGroup/rle-array (thanks @repererum for the tweet)

Look at report_on_dataframe(your_df) to get a printed report - no changes are made to your dataframe.

Caveats

  • reduced numeric ranges might lead to overflow (TODO document)
  • category dtype can have unexpected effects e.g. need for observed=True in groupby (TODO document)
  • f16 is likely to be simulated on modern hardware so calculations will be 2-3* slower than on f32 or f64
  • we could do with a link that explains binary representation of float & int for those wanting to learn more

Development

Contributors

Local Setup

$ conda create -n dtype_diet python=3.8 pandas jupyter pyarrow pytest
$ conda activate dtype_diet

Release

make release

Contributing

The repository is developed with nbdev, a system for developing library with notebook.

Make sure you run this if you want to contribute to the library. For details, please refer to nbdev documentation (https://github.com/fastai/nbdev)

nbdev_install_git_hooks

Some other useful commands

nbdev_build_docs
nbdev_build_lib
nbdev_test_nbs

About

Tries to shrink your Pandas column dtypes with no data loss so you have more spare RAM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 78.1%
  • Python 20.7%
  • Makefile 1.2%