Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

message queue for meteorological data processing #3163

Open
mdietze opened this issue Apr 27, 2023 · 3 comments
Open

message queue for meteorological data processing #3163

mdietze opened this issue Apr 27, 2023 · 3 comments

Comments

@mdietze
Copy link
Member

mdietze commented Apr 27, 2023

Description

Currently the processing of input data for the PEcAn workflow is done sequentially at run time. As an initial test case for being able to move to a more cloud-based workflow that is asynchronous, distributed, and event driven I propose that we start with met.process as an initial test case.

Proposed Solution

Put either just met.process, or all of do.conversions, within its own container with its own message queue. The message queue would need to pass in the relevant portion of the settings: which met data source, what site (name, lat, lon) or vector of sites, what data range, which model's file format is the target, etc.

Some issues to consider:

  • met.process currently does a lot of talking to the BETY database. Do we want to continue to support this?

  • If so, do we want to give the met.process containers access to BETY or do we want to have a single bit of code (e.g. in the workflow container) responsible for all database i/o? Do we want to use this task as an excuse/opportunity to reduce the dependence on BETY by logging less in the database?

  • Database communication is currently managed by convert.inputs, which already has the option to run a conversion step locally or in the HPC. This would in some ways be the "easiest" place to insert a rabbitMQ + Docker option, but this option might require either putting each met operator (there are dozens) in its own container or creating a general container that holds all of them (meaning that one thing that would be included in the message would be which operator to apply). The later task seems easier to implement/maintain but gives us less granular control over the scaling.

In general, met.process has the following steps

  1. download the raw met data
  2. convert this to netCDF CF format
  3. extract [regional] or gapfill [site-level] the data
  4. convert from netCDF CF to model specific format

Relevant bits of code to look at:

  1. base/workflow/R/do.conversions.R
  2. modules/data.atmosphere/R/met.process.R
  3. base/db/R/convert.input.R
  4. individual modules/data.atmosphere/R/download* , met2CF*, and extract* functions
  5. individual models/[model]/R/met2model.[model].R functions
@ankurdesai
Copy link
Contributor

met.process is such a swiss army knife that I agree it is a useful standalone tool and cloud compatible (reprise of browndog functionality?). From that perspective, I would be in favor of separating out the database portions from the general steps (download, standardize, extract, gapfill, convert), with a wrapper that receives necessary updates to be made to bety to inputs and filepath records. May make it easier to debug too.

@computate
Copy link

computate commented May 4, 2023

@mdietze Would you be able to share the "steps to reproduce" a way to configure Pecan that will launch a Workflow that could be more distributed, and the "acceptance criteria" of a first deliverable for this issue?

Copy link

github-actions bot commented May 4, 2024

This issue is stale because it has been open 365 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants