Skip to content

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

License

Notifications You must be signed in to change notification settings

open-sciencelab/GraphGen

Folders and files

NameName
Last commit message
Last commit date
Apr 17, 2025
Apr 16, 2025
Apr 27, 2025
Apr 22, 2025
Apr 17, 2025
Apr 28, 2025
Apr 7, 2025
Apr 21, 2025
Jan 13, 2025
Apr 24, 2025
Apr 16, 2025
Apr 21, 2025
Apr 27, 2025
Apr 17, 2025
Apr 22, 2025

Repository files navigation

stars forks open issues issue resolution documentation

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

πŸ“š Table of Contents

πŸ“ What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper and best practice.

It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

πŸš€ Quick Start

Experience it on the OpenXLab Application Center and FAQ.

Gradio Demo

python webui/app.py

ui

Run from PyPI

  1. Install GraphGen

    pip install graphg
  2. Run in CLI

    SYNTHESIZER_MODEL=your_synthesizer_model_name \
    SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
    SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
    TRAINEE_MODEL=your_trainee_model_name \
    TRAINEE_BASE_URL=your_base_url_for_trainee_model \
    TRAINEE_API_KEY=your_api_key_for_trainee_model \
    graphg --output_dir cache

Run from Source

  1. Install dependencies
    pip install -r requirements.txt
  2. Configure the environment
    • Create an .env file in the root directory
      cp .env.example .env
    • Set the following environment variables:
      # Synthesizer is the model used to construct KG and generate data
      SYNTHESIZER_MODEL=your_synthesizer_model_name
      SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
      SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
      # Trainee is the model used to train with the generated data
      TRAINEE_MODEL=your_trainee_model_name
      TRAINEE_BASE_URL=your_base_url_for_trainee_model
      TRAINEE_API_KEY=your_api_key_for_trainee_model
  3. (Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.
    # configs/graphgen_config.yaml
    # Example configuration
    data_type: "raw"
    input_file: "resources/examples/raw_demo.jsonl"
    # more configurations...
  4. Run the generation script
    bash scripts/generate.sh
  5. Get the generated data
    ls cache/data/graphgen

πŸ“Œ Latest Updates

  • 2025.04.21: We have released the initial version of GraphGen.

πŸ—οΈ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

workflow

πŸ€ Acknowledgements

  • SiliconCloud Abundant LLM API, some models are free
  • LightRAG Simple and efficient graph retrieval solution
  • ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework

πŸ“š Citation

If you find this repository useful, please consider citing our work:

@software{Chen_GraphGen_2025,
author = {Chen, Zihong and Jiang, Wanli and Li, Jingzhe and Yuan, Zhonghang and Wang, Chenyang and Kong, Huanjun and Dong, Nanqing},
month = apr,
title = {{GraphGen}},
url = {https://github.com/open-sciencelab/GraphGen},
year = {2025}
}

πŸ“œ License

This project is licensed under the Apache License 2.0.