Using eLabFTW as a flexible data entry system #4857

din14970 · 2024-01-16T15:28:42Z

din14970
Jan 16, 2024
Collaborator

Context & problem

I am sharing this because I think it will be recognizable for many research groups collecting lab data. Perhaps it can serve as an inspiration.

I currently work at an applied research institute, helping out research teams with various data related topics. A group in our chemistry department is working on lignin depolymerization and has problems with data management. Currently they record all their data across multiple Excel files on Sharepoint. This has the following issues:

Data quality is very bad.
- Inconsistency in naming (e.g. sample in one Excel is called 20L012, in another 20Lign0012). This means that it is impossible to "join" tables. Essentially foreign key relations are managed manually, which is a bad idea.
- Different data types in the same columns ("95" vs ">95%")
- Inconsistency in using terms to represent the same thing, e.g. "not requested" or "not reported" ...
- Typos (capitalization, spaces, switched letters, ...)
- Non uniformity. Each Excel has a slightly different layout.
- One-to-many relations are often solved by splitting/joining rows.
- ...
High cost of boring overhead work
- Data must be manually copied from one Excel into another. For example, the sample characterization group returns their results in one Excel, which must then be incorporated into "the database".
- Managing versions and collaborating with multiple people is hard.
- Finding anything back requires opening all Excels and searching manually. The person managing this data often gets requests for sample suggestions based on a number of criteria, and she is tired of this.
Impossible to do any analysis on the data
- See bad data quality.

So essentially, a database is managed that can not be used as a database. One can not search or query it. One can not analyze the data. One can not be sure about the quality of any of the data.

Solution requirements

The team wants a better way to record data and keep track of samples, experiments and provenance relationships between them.
They want to ensure that fewer data entry errors can happen.
They want to be able to automate work.
They want to be able to search and query the data.
They want to be able to analyze the data.
They want something that is as accessible as Excel.
They want something flexible and fast.

The implemented solution, part 1

As I had prior experience using eLabFTW, and had similar issues when I was working as a researcher myself, I proposed we could pilot eLabFTW for their case. Since their data was quite structured I proposed to make heavy use of the recent metadata fields feature. We approached the problem as follows:

We sat together and made an initial design in how we would like to model "entities" in the workflow. Effectively this came down to thinking about relationships. What things are we tracking? What things are related to what? Are these one-to-one, one-to-many or many-to-many relationships? These entities would then map onto item and experiment templates. We came up with a data model that looks something like this:

Sourced lignins come in from external parties or suppliers. These are then depolymerized in various ways yielding a depolymerized lignin. There is a many-to-one relationship from depolymerized lignin to sourced lignin, which is why it is split into another item. At this moment we include the depolymerization conditions information directly into the depolymerized lignin template, but one might imagine that if depolymerization becomes very complex with multiple different protocols it may need to be split into a separate template. Characterizations are split out into experiment templates, although we may move this to resource templates in the future (since we discovered that experiments do not track which templates they originate from, and experiment templates are generally not admin controlled).
We defined how the metadata form in these items should look. I wrote a python library to interact with the API so I could create/update/maintain templates easily. I intend to open source this when I can split it from the case-specific code. Essentially, many of the forms map a column in the original Excels to an appropriate field. I maintain the JSON for these templates in my library code. Ultimately we end up with templates that look like:

Which render to (when filled in):

in view mode and in edit mode:

At this stage the researcher already has en environment to start recording new data. Notice that we use the newest feature of being able to link to other entities directly from the metadata.
Migration of the Excel "database". This was the most painful part of the exercise because it meant actually cleaning up the excel database and ensuring I am "filling in" the templates and links correctly. Another reason why I needed to write a Python library to assist me. In the end we now have a couple 100 items in the database, each entity corresponding approximately to a row in an Excel.
Automation. Part of their workflow is incorporating data they receive from the characterization team into the database. I created a CLI to convert this Excel and upload this data.

Ultimately we end up with a system that behaves conceptually like this:

What we achieve with this:

The researcher/user can interact with eLabFTW through the web UI to manually enter data. We ensure that data entered is higher quality (numbers where we expect numbers, part of the options when we expect it through drop down menus, ...) than in Excel. Creating, updating and maintaining the templates is managed in code.
We are able to encode provenance relationships between items and experiments in the metadata.
The researcher/user can automatically create/update items through custom CLI tools where manual data entry becomes too onerous.
The researchers are able to use the search functionality for some basic queries they might have, e.g. find me all the lignins of a specific supplier.

What we can not yet achieve with this:

Not all queries can be answered easily. For example, a query that can not be answered directly by eLabFTW might be: which are all the sourced items with carbohydrate content between 0 and 6%? If the data were stored directly in a relational model this would require a JOIN and WHERE clause.
We can not easily analyze, plot or aggregate data. Again a relational model would be helpful.

The implemented solution part 2

To address the requirements related to data analysis, there are multiple different possible solutions. These are not necessarily eLabFTW specific but I thought I would share anyway to show what we have done and to serve as inspiration.

The simplest approach would be to write a script to extract and combine all the data again into tables, which are downloaded and stored locally in some tabular form. These could then be analyzed/queried with a tool of choice (Excel, python, matplotlib, R, pandas, DuckDB, ...). The main downsides to this approach:

Everyone is responsible for keeping their "local db" copy up to date. As the database grows, running this process over and over is not efficient. Can become cumbersome.
Everyone must install the dependencies for being able to download the data.
The local DB copy is not findable and reusable for others.

We opted for another approach. At our research institute we are also implementing an AWS based research data platform that supports building automated data flows and pipelines. I used this system to extract and transform the eLabFTW data into tables where the data lives on AWS S3 and the schemas are registered in AWS Glue databases. These tables can then be queried directly using e.g. SQL in AWS Athena or using tools like awswrangler (for a pandas interface).

The pipeline is also written in python, containerized, and deployed on Airflow:

At the moment it runs once per day at night.

I map each item back to a row. So each entity type becomes a table. Links are parsed so that they refer to row numbers in another table (so that we can do joins). The end result are data artifacts on S3:

Tables can be queried directly in Athena:

As you can see, this now does allow us to write custom queries to answer any question. I don't have a screenshot of a notebook but we also showed that we can now easily use awswrangler to read the data as pandas tables, which we can then process and analyze with plots.

Finally, we are also experimenting with a data catalogue called DataHub which should make these kinds of datasets searchable across the entire organization. It also would allow one to register more metadata about the tables themselves (e.g. descriptions on columns). For example:

Datahub also has a data lineage feature, which allows one to show which tasks produced which datasets etc which is pretty cool:

So as a summary, the system now looks like this:

Summary

We are testing eLabFTW as a flexible data entry system which allows us to quickly implement forms to record high quality lab data. Through the API we can easily manage the templates and even the items and do migrations as necessary. We can also implement automation scripts/tools. If we would have to implement a custom web application for each specific research group it would take longer than any research project. eLabFTW enabled us to build this entire project in about 3 months (not full time effort, and with more than 1 month spent on migrating the Excels).

We also employ our AWS data analysis platform to make the same data available in a format that can be easily queried or analyzed using various programming languages. This allows the researcher to only have to care about entering data in eLab, and they or others can later actually get insights from this data easily.

It is still early stage for both eLabFTW in this group and for our data platform but we are excited to develop it further, bring it to more teams and see where it goes.

NicolasCARPi · 2024-01-16T15:45:57Z

NicolasCARPi
Jan 16, 2024
Maintainer

If the data were stored directly in a relational model

Well, it is! It's not normalized because everything is tucked up in the metadata column with JSON datatype. You just need to work with the JSON related functions of MySQL to extract what you want. See also: #4731 (it's not simple...).

1 reply

din14970 Jan 18, 2024
Collaborator Author

Yes in the end all data is in a relational DB and of course in principle everything is possible. However, for all intents and purposes I would no longer consider the data in the metadata column as being stored in a relational way. Not only is querying this data directly from the MySQL impractical, but also it is never recommended to directly query the database underneath an application, as then you break the abstraction that the application makes on top of the data. If there is ever a serious schema migration, queries will no longer work. This should ideally not happen if one only interacts with the data through the API. Additionally, if the application and DB are used in a serious way, expensive analytical queries on the production database can interfere with the application. Therefore best practice is typically performing analyses on transformed copies of application data.

NicolasCARPi · 2024-03-11T11:17:03Z

NicolasCARPi
Mar 11, 2024
Maintainer

Hey @din14970 Would you be interested in presenting your system (10 minutes) at the next Community Meeting?

4 replies

din14970 Mar 12, 2024
Collaborator Author

Hey @NicolasCARPi , when is it exactly? Currently I am on parental leave and it is quite busy for me and it will be difficult, if it is in April or later things should be better and I would be open to it.

NicolasCARPi Mar 12, 2024
Maintainer

We're aiming for May 16th

din14970 Mar 13, 2024
Collaborator Author

Yes that is probably fine. Would be great if you could confirm/remind me sometime near that date.

NicolasCARPi Apr 29, 2024
Maintainer

ping @din14970 please start working on your presentation for the upcoming Community Meeting ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using eLabFTW as a flexible data entry system #4857

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using eLabFTW as a flexible data entry system #4857

din14970 Jan 16, 2024 Collaborator

Context & problem

Solution requirements

The implemented solution, part 1

The implemented solution part 2

Summary

Replies: 2 comments · 5 replies

NicolasCARPi Jan 16, 2024 Maintainer

din14970 Jan 18, 2024 Collaborator Author

NicolasCARPi Mar 11, 2024 Maintainer

din14970 Mar 12, 2024 Collaborator Author

NicolasCARPi Mar 12, 2024 Maintainer

din14970 Mar 13, 2024 Collaborator Author

NicolasCARPi Apr 29, 2024 Maintainer

din14970
Jan 16, 2024
Collaborator

Replies: 2 comments 5 replies

NicolasCARPi
Jan 16, 2024
Maintainer

din14970 Jan 18, 2024
Collaborator Author

NicolasCARPi
Mar 11, 2024
Maintainer

din14970 Mar 12, 2024
Collaborator Author

NicolasCARPi Mar 12, 2024
Maintainer

din14970 Mar 13, 2024
Collaborator Author

NicolasCARPi Apr 29, 2024
Maintainer