GitHub - open-sciencelab/GraphGen: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

📚 Table of Contents

📝 What is GraphGen?
🚀 Quick Start
📌 Latest Updates
🏗️ System Architecture
🍀 Acknowledgements
📚 Citation
📜 License

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper and best practice.

It begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

🚀 Quick Start

Experience it on the OpenXLab Application Center and FAQ.

Gradio Demo

python webui/app.py

Run from PyPI

Install GraphGen
```
pip install graphg
```

Run in CLI

SYNTHESIZER_MODEL=your_synthesizer_model_name \
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
TRAINEE_MODEL=your_trainee_model_name \
TRAINEE_BASE_URL=your_base_url_for_trainee_model \
TRAINEE_API_KEY=your_api_key_for_trainee_model \
graphg --output_dir cache

Run from Source

Install dependencies
```
pip install -r requirements.txt
```

Configure the environment

Create an .env file in the root directory
```
cp .env.example .env
```

Set the following environment variables:

# Synthesizer is the model used to construct KG and generate data
SYNTHESIZER_MODEL=your_synthesizer_model_name
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
# Trainee is the model used to train with the generated data
TRAINEE_MODEL=your_trainee_model_name
TRAINEE_BASE_URL=your_base_url_for_trainee_model
TRAINEE_API_KEY=your_api_key_for_trainee_model

(Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.

# configs/graphgen_config.yaml
# Example configuration
data_type: "raw"
input_file: "resources/examples/raw_demo.jsonl"
# more configurations...

Run the generation script
```
bash scripts/generate.sh
```
Get the generated data
```
ls cache/data/graphgen
```

📌 Latest Updates

2025.04.21: We have released the initial version of GraphGen.

🏗️ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

🍀 Acknowledgements

SiliconCloud Abundant LLM API, some models are free
LightRAG Simple and efficient graph retrieval solution
ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework

📚 Citation

If you find this repository useful, please consider citing our work:

@software{Chen_GraphGen_2025,
author = {Chen, Zihong and Jiang, Wanli and Li, Jingzhe and Yuan, Zhonghang and Wang, Chenyang and Kong, Huanjun and Dong, Nanqing},
month = apr,
title = {{GraphGen}},
url = {https://github.com/open-sciencelab/GraphGen},
year = {2025}
}

📜 License

This project is licensed under the Apache License 2.0.

Name	Name	Last commit message	Last commit date
Latest commit tpoisonooo Merge pull request #20 from open-sciencelab/split-model-providers Apr 28, 2025 c443180 · Apr 28, 2025 History 295 Commits
.github/workflows	.github/workflows	ci: update pylint path	Apr 17, 2025
baselines	baselines	refactor: change project structure	Apr 16, 2025
graphgen	graphgen	fix: delete unused difficulty_level design	Apr 27, 2025
resources	resources	Delete resources/images/interface.jpg	Apr 22, 2025
scripts	scripts	refactor: reorganize modules	Apr 17, 2025
webui	webui	fix: fix lint error	Apr 28, 2025
.env.example	.env.example	refactor: replace name TEACHER/STUDENT with SYNTHESIZER/TRAINEE	Apr 7, 2025
.gitignore	.gitignore	fix: fix tiny issues	Apr 21, 2025
.pylintrc	.pylintrc	feat(charts): plot length distribution	Jan 13, 2025
CITATION.cff	CITATION.cff	docs: update citation	Apr 24, 2025
LICENSE	LICENSE	feat(project): optimize gradio UI	Apr 16, 2025
MANIFEST.in	MANIFEST.in	fix: fix config path when packaging	Apr 21, 2025
README.md	README.md	docs: update README	Apr 27, 2025
requirements.txt	requirements.txt	feat(webui/app.py): use translate.json in relative path	Apr 17, 2025
setup.py	setup.py	fix(setup.py): rename whl to graphg	Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📝 What is GraphGen?

🚀 Quick Start

Gradio Demo

Run from PyPI

Run from Source

📌 Latest Updates

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

About

Releases

Contributors 4

Languages

License

open-sciencelab/GraphGen

Folders and files

Latest commit

History

Repository files navigation

📝 What is GraphGen?

🚀 Quick Start

Gradio Demo

Run from PyPI

Run from Source

📌 Latest Updates

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

About

Topics

Resources

License

Citation

Stars

Watchers

Forks

Releases

Contributors 4

Languages