GSOC 2021 Application RADIS Gagan Aryan: Reduce Memory Usage

SYNOPSIS

RADIS [1] is a fast line-by-line code for synthesizing and fitting infrared absorption and emission spectra, such as encountered in laboratory plasmas or exoplanet atmospheres. It uses a line-by-line approach, where each absorption or emission line of a molecule is computed by accounting for the influence of environmental parameters.

Even though the current implementation can compute high-temperature spectra of millions of lines in just a few seconds, it is quite memory hungry. There is a limited amount of RAM available in a machine. To run computations on high-temperature infrared spectra from databases of tens of millions of lines, it is essential to optimize memory consumption so that we can run reasonably large data on a regular machine (say one that has 8 - 16 GB of RAM).

RADIS uses Pandas dataframe for handling all the databases currently. Quoting the words of Wes (the core dev of Pandas), “pandas rule of thumb: have 5 to 10 times as much RAM as the size of the dataset” [3]. Which makes it impossible to read, say, a database of size 5GB on a machine with a RAM of 16GB.

The goal of this project would be first to reduce the memory usage of the current calculations. Then, we replace pandas with libraries that are better suited for handling larger-than-memory databases, which would make it possible to compute spectral databases of up to billions of lines (of the scale of hundreds of GB or terabytes).

THE PROJECT

Aim

The aim of this project would be to equip radis with the feature to be able to compute spectral databases of up to billions of lines.

Deliverables

First Evaluation

#118 contains a checklist of several improvisations that can be done with the existing pandas dataframe. Implement and merge those changes.
Merge the code to track the changes in the memory consumption in the codebase to radis-benchmark.

Second Evaluation

Proof-of-concept with Vaex / Dask.
Code, Test and Documentation with the new library.

Implementation Details

The entire implementation of the project can be classified into two phases. The first one being improvements in the existing Pandas core and the next is the addition of an alternative core to the existing one.

Improvements in the existing core

Pandas shows significant improvement in performance when the number of columns that are loaded is reduced [3]. This can be achieved in two ways.

Check for parameters that are related to each other and refrain from loading all of them.

Currently linestrengths (“int” in HITRAN/HITEMP) and Einstein emission coefficients are both loaded into the memory. But, these two parameters, though different, are related. Therefore we don’t have to load both of them. We can load just one and recompute the other if need be.
Avoid loading columns that are not required for calculations.

Track down the columns that are not read for SpectrumFactory calculations and avoid loading them. Besides, some calculations require more parameters than the others and since the database loading takes place at runtime we cannot identify the exact parameters that are required for calculations. Implement loading of these additional columns at runtime.

For instance -
1. Non equilibrium calculations require more columns as compared to equilibrium calculations.
2. We need all energy levels to compute the populations first, but there is no need of reading the broadening coefficients before we enter the broadening step.

Replace the current core

After optimising the existing core, I will look at the best alternative out there with the capacity to read databases of the order of 10s of gigabytes. I will first set up proof-of-concept with these libraries. With discussions from the mentors, we will finally decide which fits best as per our requirements.

Later on we will decide whether to keep Pandas, replace it entirely or have the two libraries side-by-side. Proper memory benchmarks will be set up, with detailed documentations. This part is fairly exploratory in nature.

Timeline

Due to COVID-19, our classes are currently running in online mode and it is quite unlikely that there will be any summer term. Fortunately vacations overlap with the GSoC period and I have plenty of time.

Community Bonding Period - (May 17 - June 7)

I do not want to create a week wise schedule for this period since the main purpose of this period is to become a part of the community. Nonetheless, these are some of the things I hope to do -

Engage with the community. Understand the use cases and the motivation of spectroscopy users.
Go through the radis papers [1] and [2].
Complete Spectroscopy 101 and 102 (if not done already by that time).
Familiar with tests, memory performance benchmarks.
Get used to the RADIS architecture.
Ask a lot of questions and maybe resolve a few bugs as well :D

Coding Begins

June 7 - June 14

Figure out all the optimizations that can be made with respect to loading the columns and also changing the datatypes of a few (to categorical for instance). This period would also be used to further deepen my understanding of the RADIS architecture.
I will be investigating base.py, calc.py, factory.py. Basically most of the files of the radis/lbl module. By the end of this week. I aim to prepare a list of the columns I plan on dropping, the columns whose loading can be avoided and the columns whose dtype can be changed to categorical.
I will also implement a few of these planned changes.

June 14 - June 21

Implement the remaining of the planned changes to the code base.
Set up memory performance benchmarks to track the effect of the changes made in the code.

June 21 - June 5

I am not familiar with .h5 files so I have set aside a period of two weeks for the task of replacing the buffer database with a .h5 file.
I will begin with understanding the currently proposed implementation for this task laid out in #176.

June 5 - July 12

After implementation of the .h5 file and resolving #176, I hope to gain a good understanding of how radis works and the files interplay. If I come up with a few more improvements with the help of this new knowledge I will implement them as well.
I will avoid adding a new column for Qgas and instead read the contents of a dictionary.
Even though I expect considerable improvement in the memory usage, we can be met with a trade-off in terms of speed. I will discuss these stuff with the mentors and if faced with these issues, I will try to identify an optimal approach to tackle this.

First Evaluation - (July 12 - July 16)

July 16 - July 30

In order to choose the best suited for partial replacement of pandas I will first start with defining various criterias to compare. A few I can think of at this point - as little “bloat” as possible - a 4-byte type (float for instance) should really take 4 byte, minimal loading time, minimal memory usage, minimal conversion time from raw data to the library data format.
I will take HITEMP - CH4 as a reference case and compare which of the library performs the best with the current criteria. (This database has already been tested with the current core - check #118 crash test case.

July 30 - August 16

Decide on the portion of Pandas that is to be replaced with the new dataframe library.
Integrate Vaex / Dask into the code.

August 16 - August 23

Prepare for the final evaluation. Clean up - tying loose ends (bug fixes and documenting any undocumentated work if left).

Final Evaluation - (August 24 - August 31)

ABOUT ME

Background

I am a sophomore undergraduate student at IIT Kanpur, pursuing a Bachelors of Technology in Computer Science and Engineering. I spent most of my teenage years either reading fiction or solving a hard math problem. So, I was only exposed to programming in the first year of my college with the course of Programming Fundamentals (which I aced xD). I started my contribution to open source the previous summer and worked on a cryptographically secure messaging application. I was instantly drawn to open source. I especially liked how efficient things are. The idea of being able to communicate your thoughts and ideas with different people and bring them to code excites me. After contributing to this for a period of 2-3 months, I was invited to startup with the maintainer of the same repository. (Check https://www.poshle.com/)

Involvements with Radis

I am currently going through this highly informative tutorial on Spectroscopy [!] and plan on switching to the original Radis paper [1]. Prior to this I created the following PRs that fixed a few bugs

#204 that fixed #78 and #80 - fixed errors in missing configuration file parameters. The fix involved a few minor changes in loader.py. I had to set up the dev version of Radis for resolving this and ran my first spectral code. I learned about the ~/.radis config file since the error was mainly due to the missing parameters in this file. This was my first involvement with Radis.
#217 that fixed #214 and #81 - fixed mismatch in the range of the provided and output wavelength range. Involved changes in base.py and calc.py and added a manual test in test_calc.py. Since the parameters were passed to the SpectrumFactory, I had to read and understand a few parts of calc.py as well. When I created my PR my tests were failing. So I got to know about the code linting style and a few quality standards that need to be ensured before committing the code to the develop branch.

I plan to be associated with Radis for a long time and learn enough stuff from both the programming and spectroscopic side of things to plan, propose, and deliver various features. I will also be setting up a blog or writing on medium explaining my experience in this project. Not only to be able to articulate and document but also because I believe this project can serve as an excellent example for big data handling with lazy loading features.

What can I bring to the project ?

I have experience working with Python for a year now and I have participated in various hackathons wherein I have used this language. Also, I have plenty of experience working with the dataframe libraries and brushed up on the same since this project caught my eye a bit early. Therefore I believe I satisfy the desired knowledge criterion.
I do not have any other major commitments during the summers. So, giving 17-20 hours would not be a problem in any way. I give upto 25 hours as well if the project or a specific task undertaken demands it.
“I’m the type of person if you ask me a question, and I don’t know the answer, I’m gonna tell you I don’t know. But I bet you what. I know how to find the answer, and I will find the answer. Is that fair enough?”

Coding Platform

Device	Asus Vivobook 15
Operating System	Ubuntu 20.04.4 LTS x86_64
RAM	8GB
CPU	Intel i7-8565U (4 cores) @ 1.80GHz
Preferred IDE	Visual Studio Code

Contact Details

Legal Name	Gagan Aryan
Github Username	gagan-aryan
University	Indian Insitute of Technology, Kanpur
Phone Number	+91 7899725234
Primary Email	gagan@iitk.ac.in
Git Commits Email	gaganaryan19@gmail.com

References

[1] RADIS: A nonequilibrium line-by-line radiative code for CO2 and HITRAN-like database species, E. Pannier & C. O. Laux, doi.org/10.1016/j.jqsrt.2018.09.027

[2] A discrete integral transform for rapid spectral synthesis, D. v.d. Bekerom & E. Pannier https://linkinghub.elsevier.com/retrieve/pii/S0022407320310049

[3] Apache Arrow and the “10 Things I Hate About pandas”, Wes Mckinney, https://wesmckinney.com/blog/apache-arrow-pandas-internals/

[4] Reduce Pandas memory usage #1: lossless compression, Itamar Turner Trauring, https://pythonspeed.com/articles/pandas-load-less-data/

[5] Discussions on the topic scattered across slack and github.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly