Skip to content

This repository contains experiments on comparing the similarity of Python repositories using ML models.

License

Notifications You must be signed in to change notification settings

RepoAnalysis/RepoSim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RepoSim

An approach to detect semantically similar python repositories using pre-trained language models.

About

This repository contains the notebooks and scripts conducted for our approach to detect semantically similar python repositories using pre-trained language models.

Currently our best performing model is UniXCoder fine-tuned on code search task with AdvTest dataset. For evaluations of different language models on repository similarity comparison, please refer to this Jupyter notebook: notebooks/BiEncoder/Embeddings_evaluation.ipynb

More details on our approach's implementations and applications can be found under the scripts folder.

Applications

RepoSnipy is a neural search engine for discoving similar Python repositories on GitHub, powered by RepoSim. Please feel free to give it a try!

Directory Structure

RepoSim
├── LICENSE
├── README.md
├── data
│   ├── df2txt.py  # Convert PoolC dataset for clone detection fine-tuning script
│   ├── repo_topic.json # Topic-Repos mapping
│   └── repo_topic.py  # Script to select repos from topics
├── notebooks
│   ├── BiEncoder
│   │   ├── Embeddings_evaluation.ipynb  # Evaluations for comparing different language models
│   │   ├── RepoSim.ipynb  # Our approach's implementation
│   │   └── UnixCoder_C4_Evaluation.ipynb
│   └── CrossEncoder
│       ├── Clone_Detection_C4_Evaluation.ipynb
│       ├── HungarianAlgorithm.ipynb  # Cross-encoder approaches for repo similarity comparison
│       └── keonalgorithms-TheAlgorithmsPython.csv  # Evaluation results by ungarianAlgorithm.ipynb
└── scripts
    ├── LICENSE
    ├── PlayGround.ipynb  # For experimenting with repo embeddings
    ├── README.md
    ├── pipeline.py  # Our approach's implementation as a HuggingFace pipeline
    ├── repo_sim.py
    └── requirements.txt

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

About

This repository contains experiments on comparing the similarity of Python repositories using ML models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published