Skip to content

mdh266/PolarsDuckDBPlayGround

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Polars DataFrame PlayGround


In this post I quickly covered what I view as the limitations in Pandas library:

  1. High memory usage
  2. Limited multi-core algorithms
  3. No ability to execute SQL statements (like SparkSQL & DataFrame)
  4. No query planning/lazy-execution
  5. NULL values only exist for floats not ints (this changed in Pandas 1.0+)
  6. Using strings is inefficient (this too changed in Pandas 1.0+)

Many of these issues have been addressed by the Pandas 2.0 release, but I still feel the API is awkward!

So in this post I go over two alternatives:

I cover how to get set up in with Juptyer lab using Docker on AWS as well as some basics of Polars, DuckDB and how to use the two in combination. The benefits of Polars is that,

  • It allows for fast parallel querying on dataframes.
  • It uses Apache Arrow for backend datatypes making it efficient for memory.
  • It has both lazy and eager execution mode.
  • It allows for SQL queries direcly on dataframes.
  • Its API is similar to Spark's API and allows for highly readable queries using method chaining.

DuckDB is a blazingly fast OLAP SQL query engine. In the context of this blog post I cover how to use it to run SQL queries against Pandas/Polars dataframes and even local Parquet files!

Using The Notebook


You can install the dependencies and access the notebook using Docker by building the Docker image with the following:

docker build -t polars_nb .

Followed by running the command container:

docker run -ip 8888:8888 -v `pwd`:/home/jovyan -t polars_nb

See here for more info.

Otherwise without Docker, make sure to use Python 3.10 and install the libraries listed in requirements.txt. These can be installed with the command,

pip install -r requirements.txt