code_search/notebooks at master · hamelsmu/code_search · GitHub

Name		Name	Last commit message	Last commit date
parent directory ..
diagram		diagram
1 - Preprocess Data.ipynb		1 - Preprocess Data.ipynb
2 - Train Function Summarizer With Keras + TF.ipynb		2 - Train Function Summarizer With Keras + TF.ipynb
3 - Train Language Model Using FastAI.ipynb		3 - Train Language Model Using FastAI.ipynb
4 - Train Model To Map Code Embeddings to Language Embeddings.ipynb		4 - Train Model To Map Code Embeddings to Language Embeddings.ipynb
5 - Build Search Index.ipynb		5 - Build Search Index.ipynb
README.md		README.md
fastai		fastai
general_utils.py		general_utils.py
lang_model_utils.py		lang_model_utils.py
seq2seq_utils.py		seq2seq_utils.py

README.md

Table of Contents

Each step in the above diagram corresponds to a Jupyter notebook in this repo. Below is a high level description of each step:

1 - Preprocess Data: describes how to get python files from BigQuery, and use the AST module to clean code and extract docstrings.

2 - Train Function Summarizer: build a sequence-to-sequence model to predict a docstring given a python function or method. The primary purpose of this model is for a transfer learning task that requires the extraction of features from code.

3 - Train Language Model: Build a language model using Fastai on a corpus of docstrings. We will use this model for transfer learning to encode short phrases or sentences, such as docstrings and search queries.

4 - Train Code2Emb Model: Fine-tune the model from step 2 to predict vectors instead of docstrings. This model will be used to represent code in the same vector space as the sentence embeddings produced in step 3.

5 - Build Search Engine: Use the assets you created to created in steps 3 and 4 to create a semantic search tool.