Polars DataFrame PlayGround

In this post I quickly covered what I view as the limitations in Pandas library:

High memory usage
Limited multi-core algorithms
No ability to execute SQL statements (like SparkSQL & DataFrame)
No query planning/lazy-execution
NULL values only exist for floats not ints (this changed in Pandas 1.0+)
Using strings is inefficient (this too changed in Pandas 1.0+)

Many of these issues have been addressed by the Pandas 2.0 release, but I still feel the API is awkward!

So in this post I go over two alternatives:

Polars for dataframes
DuckDB for SQL queries

I cover how to get set up in with Juptyer lab using Docker on AWS as well as some basics of Polars, DuckDB and how to use the two in combination. The benefits of Polars is that,

It allows for fast parallel querying on dataframes.
It uses Apache Arrow for backend datatypes making it efficient for memory.
It has both lazy and eager execution mode.
It allows for SQL queries direcly on dataframes.
Its API is similar to Spark's API and allows for highly readable queries using method chaining.

DuckDB is a blazingly fast OLAP SQL query engine. In the context of this blog post I cover how to use it to run SQL queries against Pandas/Polars dataframes and even local Parquet files!

Using The Notebook

You can install the dependencies and access the notebook using Docker by building the Docker image with the following:

docker build -t polars_nb .

Followed by running the command container:

docker run -ip 8888:8888 -v `pwd`:/home/jovyan -t polars_nb

See here for more info.

Otherwise without Docker, make sure to use Python 3.10 and install the libraries listed in requirements.txt. These can be installed with the command,

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

notebook.ipynb

notebook.ipynb

requirements.txt

requirements.txt

Repository files navigation

Polars DataFrame PlayGround

Using The Notebook

About

Releases

Packages

Languages

License

mdh266/PolarsDuckDBPlayGround

Folders and files

Latest commit

History

Repository files navigation

Polars DataFrame PlayGround

Using The Notebook

About

Topics

Resources

License

Stars

Watchers

Forks

Languages