58 Branches 167 Tags

Name	Name	Last commit message	Last commit date
Latest commit stephen-soltesz Prevent rows that are too large (#1133 ) Oct 22, 2024 c5f7e36 · Oct 22, 2024 History 1,977 Commits
active	active	Migrate parser to Gardener v2 Jobs API (#1098 )	Jul 6, 2022
cmd	cmd	Require a partition filter for standard tables (#1132 )	Apr 5, 2024
etl	etl	Add ArchiveSize and FileSize to parser records (#1118 )	Jun 2, 2023
factory	factory	Remove explicit dependencies on the decommissioned annotation-service (…	Jun 10, 2022
k8s/data-pipeline	k8s/data-pipeline	Rename data-pipeline configurations (us-central1 migration) (#1124 )	Aug 24, 2023
legacy	legacy	update readme with validation info	Nov 21, 2017
metrics	metrics	Migrate ndt5 to use standard columns (#1045 )	Feb 25, 2022
parser	parser	Prevent rows that are too large (#1133 )	Oct 22, 2024
row	row	Remove explicit dependencies on the decommissioned annotation-service (…	Jun 10, 2022
schema	schema	Remove old description file. (#1131 )	Nov 30, 2023
storage	storage	Add ArchiveSize and FileSize to parser records (#1118 )	Jun 2, 2023
task	task	Add ArchiveSize and FileSize to parser records (#1118 )	Jun 2, 2023
testfiles	testfiles	Upload archive to mlab-testing before integration testing (#1011 )	Jul 22, 2021
travis @ 4599d5d	travis @ 4599d5d	Automate the schema change in travis (#815 )	Feb 12, 2020
web100	web100	NormalizeIP for traceroute parser (#1006 )	Jul 19, 2021
worker	worker	Add job source bucket to output path (#1101 )	Aug 2, 2022
.gitmodules	.gitmodules	Add etl-schema as a submodule	Oct 18, 2018
.personalize_deploy.sh	.personalize_deploy.sh	Fix bug with personalize deploy	Jul 7, 2017
.travis.yml	.travis.yml	Update coveralls config for etl (#1130 )	Nov 28, 2023
Dockerfile	Dockerfile	Update go version to 1.20 (#1126 )	Aug 24, 2023
Dockerfile.testing	Dockerfile.testing	Update go version to 1.20 (#1126 )	Aug 24, 2023
LICENSE	LICENSE	Initial commit	Mar 29, 2017
README.md	README.md	Rename data-pipeline configurations (us-central1 migration) (#1124 )	Aug 24, 2023
apply-cluster.sh	apply-cluster.sh	Complete etl deployment migration to cloudbuild.yaml 2 of 2 (#973 )	Mar 18, 2021
cloudbuild.yaml	cloudbuild.yaml	Rename data-pipeline configurations (us-central1 migration) (#1124 )	Aug 24, 2023
delete-appengine-services.sh	delete-appengine-services.sh	Do not prompt for user feedback (#1043 )	Jan 4, 2022
go-pre-commit	go-pre-commit	Revert "Revert "Various go vet and go fix fixes""	Apr 30, 2018
go.mod	go.mod	Require a partition filter for standard tables (#1132 )	Apr 5, 2024
go.sum	go.sum	Require a partition filter for standard tables (#1132 )	Apr 5, 2024
integration-testing.sh	integration-testing.sh	Upload archive to mlab-testing before integration testing (#1011 )	Jul 22, 2021

Repository files navigation

etl

branch	travis-ci	report-card	coveralls
master
integration

ETL (extract, transform, load) is a core component of the M-Lab data processing pipeline. The ETL worker is responsible for parsing data archives produced by pusher and publishing M-Lab measurements to BigQuery.

Local Development

go get ./cmd/etl_worker
gcloud auth application-default login
~/bin/etl_worker -service_port :8080 -output_location ./output -output local

From the command line (or with a browser) make a request to the /v2/worker resource with a filename= parameter that names a valid M-Lab GCS archive.

URL=gs://archive-measurement-lab/ndt/ndt7/2021/06/14/20210614T003000.696927Z-ndt7-mlab1-yul04-ndt.tgz
curl "http://localhost:8080/v2/worker?filename=$URL"

Generating Schema Docs

To build a new docker image with the generate_schema_docs command, run:

$ docker build -t measurementlab/generate-schema-docs .
$ docker run -v $PWD:/workspace -w /workspace \
  -it measurementlab/generate-schema-docs

Writing schema_ndtresultrow.md
...

GKE

The universal parser will run in GKE, using a parser node pool, defined in terraform-support.

The parser images are built in Cloud Build environment, pushed to gcr.io, and deployed to the data-pipeline cluster. The build trigger can be found with:

gcloud builds triggers list --filter=github.name=etl

Migrating to Sink interface

The parsers currently use etl.Inserter as the backend for writing records. This API is overly shaped by bigquery, and complicates testing and extension.

The row.Sink interface, and row.Buffer define cleaner APIs for the back end and for buffering and annotating. This will streamline migration to Gardener driven table selection, column partitioned tables, and possibly future migration to BigQuery loads instead of streaming inserts.

Factories

The TaskFactory aggregates a number of other factories for the elements required for a Task. Factory injection is used to generalize ProcessGKETask, and simplify testing.

SinkFactory produces a Sink for output.
SourceFactory produces a Source for the input data.
AnnotatorFactory produces an Annotator to be used to annotate rows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

etl

Local Development

Generating Schema Docs

GKE

Migrating to Sink interface

Factories

About

Releases 148

Packages

Contributors 9

Languages

License

m-lab/etl

Folders and files

Latest commit

History

Repository files navigation

etl

Local Development

Generating Schema Docs

GKE

Migrating to Sink interface

Factories

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 148

Packages 0

Contributors 9

Languages

Packages