Skip to content

Community Meetings

a_git_a edited this page Feb 28, 2024 · 47 revisions

28th of February 16:00 CET - How to build a RAG pipeline

RAG

ADD to

Community meetings are held to keep all people interested in the project up to date.
Meetings are recorded and available to the public.
Please don't hesitate to start a conversation, ask questions in the chat or raise your hand.
If you don't feel shy, we'll appreciate if you have your camera on.

Agenda

  1. quick intro of VDK, the team, and people who have been working/using VDK (Agi)
  2. Latest Release (Dilyan)
  3. RAG pipeline demo (Yoan)

Data Sources - Guided Workshop w/ Versatile Data Kit - recording

Agenda

  1. quick intro of VDK, the team, and people who have been working/using VDK (Agi)
  2. Latest Release? (Antoni)
  3. Workshop (Anotni)

VDK Machine Learning Roadmap - recording

Agenda

  1. quick intro of VDK, the team, and people who have been working/using VDK
  2. Latest Release? (Stanley)
  3. VDK Roadmap for ML Projects

In this community meeting, Paul Murphy will present how VDK can help with all aspects of ML workflows. We'll discuss the benefits of running your data creation and model training on the platform and all the benefits you will get!!

Streamlining Dataset Creation and Debugging in AI/Data Models - recording

Agenda

  1. Latest Release (Antoni)
  2. Streamlining Dataset Creation and Debugging in AI/Data Models (Antoni)
  3. Data Makers Fest (Agi)
  4. Thank yous, ⭐ and next meeting - 29th of Nov.

Streamlining Dataset Creation and Debugging in AI/Data Models

In an era where data is vital for machine learning models, efficient dataset creation and debugging mechanisms are the need of the hour. While platforms like HuggingFace offer a plethora of pre-existing datasets, they need more tools for easy dataset creation and management by end-users. Our presentation focuses on solving these challenges by extending the Versatile Data Kit (VDK), an open-source framework for developing and managing data pipelines. Key Challenges:

  1. Dataset Creation: Existing platforms offer limited user-driven options for generating and managing datasets from diverse sources.
  2. Data Integrity: Ensuring the quality and integrity of mutable datasets is a significant concern.
  3. Traceability: Lack of transparency from data origin to consumption in data models complicates debugging. Proposed Solution: We propose an integrated VDK-based solution encompassing:
  4. Source Plugins: To simplify dataset creation from diverse sources like databases, streams, or APIs.
  5. Metrics Abstraction: A layer in source plugins that calculates standard metrics for datasets.
  6. Automated Update Mechanism: Fetches the latest data modifications and computes metrics automatically.
  7. Report Generation: Produces detailed reports highlighting anomalies and metrics, streamlining debugging.
  8. Centralized Repository: Data and reports are stored centrally for easy access and examination.

Agenda

quick intro of VDK, the team and people who have been working / using VDK

  1. Latest Release (Antoni)
  2. Multiple Python Versions Support (Andy)
  3. VDK Run logs: Simplified and readable. Quick overview (Dilyan)

Next meeting 25th of October 16:00 CET
Support us by ⭐

Agenda

  1. Release (Antoni)
  2. Productionizing Jupyter Notebooks (Duygu + Antoni)

Next meeting 27th of September 16:00 CET
Support us by ⭐

Agenda

quick intro of VDK and the team and people who have been working / using VDK

  1. Latest release (Antoni)
  2. Huggingface + VDK to train and use LLMs (Paul) He will show how running Hugging Face on VDK will augment its functionality.

Workflows:

  • Finetuning an LLM
  • Creating a dataset
  • Catching regressions in LLMs ahead of time
  • Q&A
  1. Roadmap (Antonio)
  • Q&A

28th of June - Practical Kimball Patterns - Dimensional modeling 101 Watch Recording

Agenda:

  1. VDK quick intro and latest release (Antoni)
  2. VDK team meeting/workshop (Agi)
  3. Dimensional modeling 101 - Practical Kimball data patterns (Antoni)

31st of May - Generative Data Packs and DevOps for Data Watch Recording

Agenda:

  1. VDK’s latest release (Antoni)
  2. Generative Data Packs (Iva)
  3. DevOps for Data (Agi)

26th of April - VDK UI demo Watch Recording

Agenda:

  1. Intro to the agenda and people - what are you working on lately?
  2. RADME updates and VDK intro (Agi)
  3. VDK’s latest release (Antoni)
  4. VDK UI demo (Paul)
  5. Next meeting topic. Date: 31th of May (Agi)

22nd of February Jupyter Integration - Watch recording

Shoutout to the recent VDK contributors and their work!

Agenda:

  1. VDK’s latest release
  2. VDK Jupyter integration
  3. FOSDEM experience and conclusions
  4. Next meeting date

11th of January Watch recording

Agenda:

  1. VDK’s latest release - Stanislav
  2. Introduction to Versatile Data Kit Control Service - Paul
  3. Demo of the current installation process - Iva
  4. Discussion on a proposal to implement the “Three Click Rule” to make the installation faster and easier for users. - Iva
  5. Decide on date for next community meeting (provisionally 15th of feb) - Paul

30th of November Watch Recording

Agenda

  1. Welcome - Agi
  2. Release - Antoni
  3. Newest industry DB adoption stats suggest that PostgreSQL gets quite some traction lately. We have recently introduced PostgreSQL embedded support so that for the control service it is a configurational option to choose the database type deployed by default (in case no external data source is set). It could now be either CockroachDB (by default), or PostgreSQL - Iva
  4. We have just returned from Data Science Conference Europe 2022, and we’ll talk about our experience there - Vic, Antoni, Dimira

Discussion:

  • templates for community meetings - do we need one?
  • next community meeting x-mas/NY themed - 21st of Dec
  • YT live community meetings

26th of October Watch recording

Agenda

  1. Welcome and intro - Agi
  2. Release - Antoni
  3. Hackathon. We've applied for the Borathon, and we'll demo what we did there! - Antoni
  4. Demo of a new feature that allows skipping the remaining steps of a data job execution via the job input object - Momchil
  5. Latest articles about VDK

28th of September: Creating First Data Job Watch recording

Agenda:

  1. Welcome and intro, if you are new to VDK I encourage you to say hi :)
  2. Quick intro to the project (Agi)
  3. Release announcement (Antoni)
  4. GitHub Star History example demo (Agi)
  5. Discussion topics:
  • Two PR reviewers
  • VDK catchphrase

VDK catchphrase (also anchor text):

  1. unique value
  2. clear
  3. short and sweet

Examples:

  • A high-performance observability data pipeline.
  • Declarative continuous deployment for Kubernetes.
  • The easiest way to coordinate your dataflow
  • A cloud-native Pipeline resource.
  • Always know what to expect from your data.
  • Data-Centric Pipelines and Data Versioning
  • An orchestration platform for the development, production, and observation of data assets.
  • Build powerful pipelines in any programming language.
  • Build data pipelines, the easy way
  • Machine Learning Pipelines for Kubeflow

I have included data pipelines and other tools that have more than 2000 stars only Airbyte has a longer message:

  • Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

I think the difference is felt in the readability of the message So, for this catchphrase it would be nice to come up with something that is #uniquevalue

Ideas:

  • Building and Managing Data Pipelines with SQL or Python
  • Data Pipelines covering full DataOps lifecycle
  • Building and managing your data pipelines with python or SQL on the cloud (or Kubernetes)
  • Build, run and manage your data jobs
  • Build, run and manage your data pipelines
  • Develop, run and manage your data pipelines on the cloud
  • Automate and abstract the Data and DevOps cycle
  • Automate and abstract the Data Journey and the DevOps cycle
  • Orchestrate

A bit more abstract and unclear ideas:

  • Efficient data engineering
  • Enable everyone to focus on work that requires their core skills

(because SQL or python is maybe not our unique value prop)

Questions:

  • Add cloud or Kubernetes ?
  • Data Pipelines OR DataOps pipelines ?

Helpful questions:

  • What do you think is the unique value of VDK
  • How would you google to find this framework? (if you don't know it exists)

"VDK I think rather has a lot of possibilities in the “T” part - templates (kimball or generic), managed connection plugins enable quality, lineage (when implemented). And also in the abstracting DevOps part - though we need to do more around testing."

Action items: create a form where we can rank the catchphrase

24th of August: VDK Templates https://youtu.be/HIRt4bX4ddk

Attendees:

Agenda:

  1. Welcome and team (Agi)
  2. Intro to the project (Agi)
  3. Momchil Zhivkov about templates:
    Templates are reusable code in the context of data jobs. They are intended to solve a common use case among different users. A template is executed through a data job. An example of a common use case is loading data into a data warehouse.

This presentation will demo:

  • what is a template
  • how does it look
  • the purpose of templates
  • using and developing templates
  • our already existing templates that can be reused
  1. Duygu - csv-export
    A new feature was added to the already existing CSV plugin, which allows people to export the result of a SQL query to a CSV file.
  2. Toni - VDK release v0.6
  3. Open discussion

20th of July: How to promote an opensource project https://youtu.be/wmdx7ngocr4

15:00 (GMT+01:00) - Add to Google calendar

Attendees:

Agenda:

  1. Welcome and team (Agi)
  2. Michael Gasch about how to promote an opensource project, tips, and questions

22nd of June: Airflow integration https://youtu.be/c3j1aOALjVU

11:00 (GMT+01:00) - Add to Google calendar

Attendees:

  • Agita Jaunzeme aka Agi (VDK Community Manager)
  • Gabriel Georgiev
  • Antoni Ivanov
  • Dimira Petrova

Agenda:

  1. Welcome and team (Agi)
  2. Intro to the project (Agi)
  3. Announcement of recent changes (Antoni)
  4. Airflow Provider Demo by Gabriel
  5. Discussion:
  • VDK community update (Agi)
  • how to find community meeting links
  • Community and Resources page
  • ODSC Europe conference, volunteering, speakers, Jacob Tomlinson Guglielmo Iozzia Carl Osipov Shawn Kyzer on Data Mesh
  • Invitation to be DataOps community lead for Techies of Baltics - devops.lv also this guy from CDK James Craig
  • next meeting possibly someone will join to tell us their story of growing an OSS community govmomi OR rapids
  • Next meeting date
  • Next meeting time (Let’s make next community meeting during US friendly time zone)

Useful links:

May 25, 2022 : KubeCon https://youtu.be/w0teqOw9qjc

Attendees:

  • Agita Jaunzeme aka Agi (VDK Community Manager)

Agenda:

  1. Welcome and team (Agi)
    / intro to the project
  2. VDK community update (Agi):
  1. Latest release
  2. Roadmap (Dako)
  • Apache Airflow integration
  • Security Improvements
  • Provide users with better notifications/information about non-gracefully failed data job execution
  1. Open questions about Kubecon – discussion
  2. Conclusion and relevant links (Twitter / Slack / YT / blogs etc. )

Discussion Topics:

Useful links:

Attendees:

  • Agita Jaunzeme
  • Dimira Petrova
  • Dako Dakov
  • Antoni Ivanov
  • Gabriel Georgiev

Agenda:

  • Welcome (Agi)
  • Intro of the team (all)
  • Intro of the project (Agi)
  • What are we doing lately (Antoni)
  • What are we planning to do in the near future (Dako)
  • Discussion
  • Conclusion (Agi)

Discussion Topics:

  • Kubernetes / ..
  • Meeting frequency / next meeting - the week of 23rd of May
  • Agenda for the next meeting to be more specific

Useful links:

Clone this wiki locally