This repository contains supplementary material for the paper "Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey", submitted to the ACM Computing Surveys journal.

Modern language models (LMs) have been successfully employed in source code generation and understanding, leading to a significant increase in research focused on learning-based code intelligence, such as automated bug repair, and test case generation. Despite their great potential, language models for code intelligence (LM4Code) are susceptible to potential pitfalls, which hinder realistic performance and further impact their reliability and applicability in real-world deployment. Such challenges drive the need for a comprehensive understanding - not just identifying these issues but delving into their possible implications and existing solutions to build more reliable language models tailored to code intelligence. Based on a well-defined systematic research approach, we conducted an extensive literature review to uncover the pitfalls inherent in LM4Code. Finally, 67 primary studies from top-tier venues have been identified. After carefully examining these studies, we designed a taxonomy of pitfalls in LM4Code research and conducted a systematic study to summarize the issues, implications, current solutions, and challenges of different pitfalls for LM4Code systems. We developed a comprehensive classification scheme that dissects pitfalls across four crucial aspects: data collection and labeling, system design and learning, performance evaluation, and deployment and maintenance. Through this study, we aim to provide a roadmap for researchers and practitioners, facilitating their understanding and utilization of LM4Code in reliable and trustworthy ways.

Please feel free to send a pull request to add papers and relevant content that are not listed here. We uploaded our completed paper lists to Google Drive with detailed reviewed information.

Papers

Type Inference

Dos and Don'ts of Machine Learning in Computer Security (2022), USENIX Security, D Arp, et al. [pdf]
Machine/deep learning for software engineering: A systematic literature review (2022), TSE, Simin Wang, et al. [pdf]
Trustworthy AI: From principles to practices (2023), arxiv, BO Li, et al. [pdf]

Data Collection and Labeling

Unbalanced Distribution

Deep Learning Based Vulnerability Detection (2021), arxiv, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays! (2023), ICSE, X Yang, et al. [pdf]
On the Value of Oversampling for Deep Learning in Software Defect Prediction (2021), TSE, R Yedida, T Menzies. [pdf]
Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets (2022), ASE, Z Li, et al. [pdf]
An empirical study of deep learning models for vulnerability detection (2023), arxiv, B Steenhoek, et al. [pdf]

Label Errors

Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets (2022), ASE, Z Li, et al. [pdf]
XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training (2022), TOSEM, Z Lin, et al. [pdf]
Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper) (2023), ISSTA, X Nie, et al. [pdf]

Data Noise

Slice-Based Code Change Representation Learning (2023), SANER, F Zhang, et al. [pdf]
Are we building on the rock? on the importance of data preprocessing for code summarization (2022), FSE, L Shi, et al. [pdf]
Neural-Machine-Translation-Based Commit Message Generation: How Far Are We? (2018), ASE, Z Liu, et al. [pdf]

System Design and Learning

Data Snooping

AutoTransform: automated code transformation to support modern code review process (2022), ICSE, Thongtanunam, Patanamon, Chanathip Pornprasit, and Chakkrit Tantithamthavorn. [pdf]
Can Neural Clone Detection Generalize to Unseen Functionalitiesƒ (2021), ASE, C Liu, et al. [pdf]
CD-VulD: Cross-Domain Vulnerability Discovery Based on Deep Domain Adaptation (2020), TDSC, S Liu, et al. [pdf]
Deep just-in-time defect prediction: how far are we? (2021), ISSTA, Z Zeng, et al. [pdf]
Patching as translation: the data and the metaphor (2020), ASE, Y Ding, et al. [pdf]
An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, et al. [pdf]
Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence Models (2302), ICSE, S Gao, et al. [pdf]
Revisiting Learning-based Commit Message Generation (2023), ICSE, J Dong, Y Lou, D Hao, et al. [pdf]
Syntax and Domain Aware Model for Unsupervised Program Translation (2302), ICSE, F Liu, J Li, L Zhang. [pdf]
How Effective Are Neural Networks for Fixing Security Vulnerabilities (2023), ISSTA, Y Wu, N Jiang, HV Pham, et al. [pdf]
Towards More Realistic Evaluation for Neural Test Oracle Generation (2305), ISSTA, Z Liu, K Liu, X Xia, et al. [pdf]
On the Evaluation of Neural Code Summarization (2022), ICSE, E Shi, Y Wang, L Du, et al. [pdf]

Spurious Correlations

Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
Diet code is healthy: simplifying programs for pre-trained models of code (2022), FSE, Z Zhang, H Zhang, B Shen, et al. [pdf]
Explaining mispredictions of machine learning models using rule induction (2021), FSE, J Cito, I Dillig, S Kim, et al. [pdf]
Interpreting Deep Learning-based Vulnerability Detector Predictions Based on Heuristic Searching (2021), TOSEM, D Zou, Y Zhu, S Xu, et al. [pdf]
Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code (2021), ASE, M Paltenghi, M Pradel. [pdf]
Vulnerability detection with fine-grained interpretations (2021), FSE, Y Li, S Wang, TN Nguyen. [pdf]
What do they capture? a structural analysis of pre-trained language models for source code (2022), ICSE, Y Wan, W Zhao, H Zhang, et al. [pdf]
An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, MM Rahman, R Jiles, et al. [pdf]
Towards Efficient Fine-Tuning of Pre-trained Code Models: An Experimental Study and Beyond (2023), ISSTA, E Shi, Y Wang, H Zhang, et al. [pdf]

Inappropriate Model Design

Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
Enhancing DNN-Based Binary Code Function Search With Low-Cost Equivalence Checking (2022), TSE, H Wang, P Ma, Y Yuan, et al. [pdf]
Improving automatic source code summarization via deep reinforcement learning (2018), ASE, Y Wan, Z Zhao, M Yang, et al.[pdf]
Patching as translation: the data and the metaphor (2020), ASE, Y Ding, B Ray, P Devanbu, et al.[pdf]
Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention (2020), TSE, W Wang, Y Zhang, Y Sui, et al. [pdf]
XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training (2022), TOSEM, Z Lin, G Li, J Zhang, et al. [pdf]
RepresentThemAll: A Universal Learning Representation of Bug Reports (2023), ICSE, S Fang, T Zhang, Y Tan, et al. [pdf]
Template-based Neural Program Repair (2023), ICSE, X Meng, X Wang, H Zhang, et al. [pdf]

Performance Evaluation

Inappropriate Baseline

Towards More Realistic Evaluation for Neural Test Oracle Generationr (2023), ARXIV, Z Liu, K Liu, X Xia, et al. [pdf]

Inappropriate Evaluation Dataset

Deep Learning Based Program Generation From Requirements Text: Are We There Yet? (2020), TSE, H Liu, M Shen, J Zhu, et al. [pdf]
Generating realistic vulnerabilities via neural code editing: an empirical study (2022), FSE, Y Nong, Y Ou, M Pradel, et al. [pdf]

Low Reproducibility

An extensive study on pre-trained models for program understanding and generation (2022), ISSTA, Z Zeng, H Tan, H Zhang, et al. [pdf]

Inappropriate Performance Measures

Deep Learning Based Vulnerability Detection: Are We There Yet? (2021), TSE, S Chakraborty, R Krishna, Y Ding, et al. [pdf]
Improving automatic source code summarization via deep reinforcement learning (2018), ASE, Y Wan, Z Zhao, M Yang, et al. [pdf]
Multi-task learning based pre-trained language model for code completion (2020), ASE, F Liu, G Li, Y Zhao, et al. [pdf]
On the Value of Oversampling for Deep Learning in Software Defect Prediction (2021), TSE, R Yedida, T Menzies. [pdf]
Patching as translation: the data and the metaphor (2020), ASE, Y Ding, B Ray, P Devanbu, et al. [pdf]
Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention (2020), TSE, W Wang, Y Zhang, Y Sui, et al. [pdf]
SynShine: Improved Fixing of Syntax Errors (2022), TSE, Ahmed T, Ledesma N R, Devanbu P. [pdf]
An empirical study of deep learning models for vulnerability detection (2023), ICSE, B Steenhoek, MM Rahman, R Jiles, et al. [pdf]
Revisiting Learning-based Commit Message Generation (2023), ICSE, J Dong, Y Lou, D Hao, et al. [pdf]
Tare: Type-Aware Neural Program Repair (2023), ICSE, Q Zhu, Z Sun, W Zhang, et al. [pdf]
How Effective Are Neural Networks for Fixing Security Vulnerabilities (2023), ISSTA, Y Wu, N Jiang, HV Pham, et al. [pdf]
Towards More Realistic Evaluation for Neural Test Oracle Generation (2305), ISSTA, Z Liu, K Liu, X Xia, et al. [pdf]
GitHub Copilot AI pair programmer: Asset or Liability? (2023), JSS, AM Dakhel, V Majdinasab, A Nikanjam, et al. [pdf]

Deployment and Maintainance

Real-World Constraints

Examining Zero-Shot Vulnerability Repair with Large Language Models (2023), S&P, H Pearce, B Tan, B Ahmad, et al. [pdf]
A Performance-Sensitive Malware Detection System Using Deep Learning on Mobile Devices (2020), TIFS, R Feng, S Chen, X Xie, et al. [pdf]
Diet code is healthy: simplifying programs for pre-trained models of code (2022), FSE, Z Zhang, H Zhang, B Shen, et al.[pdf]
When Code Completion Fails: A Case Study on Real-World Completions (2019), ICSE, VJ Hellendoorn, S Proksch, HC Gall, et al. [pdf]
Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants (2023), arxiv, G Sandoval, H Pearce, T Nys, et al. [pdf]
Grounded Copilot: How Programmers Interact with Code-Generating Models (2023), OOPSLA1, S Barke, MB James, N Polikarpova. [pdf]
LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning (2308), arxiv, J Lu, L Yu, X Li, et al.[pdf]
Compressing Pre-trained Models of Code into 3 MB (2022), ASE, J Shi, Z Yang, B Xu, et al.[pdf]

Attack Threats

You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion (2021), USENIX Security, R Schuster, C Song, E Tromer, et al. [pdf]
Adversarial Robustness of Deep Code Comment Generation (2022), TOSEM, Y Zhou, X Zhang, J Shen, et al. [pdf]
An extensive study on pre-trained models for program understanding and generation (2022), ISSTA, Z Zeng, H Tan, H Zhang, et al. [pdf]
Generating Adversarial Examples for Holding Robustness of Source Code Processing Models (2020), AAAI, H Zhang, Z Li, G Li, et al. [pdf]
Semantic Robustness of Models of Source Code (2020), SANER, G Ramakrishnan, J Henkel, Z Wang, et al. [pdf]
You see what I want you to see: poisoning vulnerabilities in neural code search (2022), FSE, Y Wan, S Zhang, H Zhang, et al. [pdf]
Contrabert: Enhancing code pre-trained models via contrastive learning (2023), ICSE, S Liu, B Wu, X Xie, et al. [pdf]
On the robustness of code generation techniques: An empirical study on github copilot (2023), ICSE, A Mastropaolo, L Pascarella, E Guglielmi, et al. [pdf]
Two sides of the same coin: Exploiting the impact of identifiers in neural code comprehension (2023), ICSE, S Gao, C Gao, C Wang, et al. [pdf]
Multi-target Backdoor Attacks for Code Pre-trained Models (2023), ACL, Y Li, S Liu, K Chen, et al. [pdf]
Backdooring Neural Code Search (2023), ACL, W Sun, Y Chen, G Tao, et al. [pdf]
ReCode: Robustness Evaluation of Code Generation Models (2022), ACL, S Wang, Z Li, H Qian, et al. [pdf]
Natural Attack for Pre-trained Models of Code (2022), ICSE, Z Yang, J Shi, J He, et al. [pdf]
Coprotector: Protect open-source code against unauthorized training usage with data poisoning (2022), WWW, Z Sun, X Du, F Song, et al. [pdf]
On the Security Vulnerabilities of Text-to-SQL Models (2211), ISSRE, X Peng, Y Zhang, J Yang, et al. [pdf]

Security Concerns in Generated Code

Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions (2022), S&P, H Pearce, B Ahmad, B Tan, et al. [pdf]
Automated repair of programs from large language models (2023), ICSE, Z Fan, X Gao, M Mirchev, et al. [pdf]
Cctest: Testing and repairing code completion systems (2023), ICSE, Z Li, C Wang, Z Liu, et al. [pdf]
Analyzing Leakage of Personally Identifiable Information in Language Models (2023), S&P, N Lukas, A Salem, R Sim, et al. [pdf]
CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot (2023), USENIX Security, L Niu, S Mirza, Z Maradni, et al. [pdf]

Public Tools

ACCENT (Adversarial Code Comment gENeraTor)
- masked training
- an identifier substitution approach to craft adversarial code snippets, which are syntactically correct and semantically close to the original code snippet, but may mislead the DNNs to produce completely irrelevant code comments.
- TOSEM 2022
- [pdf], [code]
AutoTransform
- a Transformer-based NMT architecture to handle long sequences
- through addressing the out-off vocabulary, BPE(Byte-Pair Encoding)
- ICSE 2022
- [pdf], [code]
Functionality-generalization
- training set with diverse functionalities，out-of-vocabulary problem， incorporating locality into model architecture
- ASE 2021
- [pdf], [code]
CD-VulD
- a new system for Cross Domain Software Vulnerability Discovery using deep learning (DL) and domain adaptation (DA).
- learn cross-domain representations
- TDSC 2022
- [pdf]
DietCode
- explain to decrease the tokens
- aims at lightweight leverage of large pre-trained models for source code
- FSE 2022
- [pdf], [code]
BinUSE
- a practical and efficient equivalence check, using under-constrained symbolic execution (USE)
- TSE 2023
- [pdf]
mmd
- model-agnostic explaination
- The output of the technique can be useful for understanding limitations of the training data or the model itself
- FSE 2021
- [pdf], [code]
Metropolis-Hastings Modifier (MHM)
- Adversarial Training
- generates adversarial examples for DL models specialized for source code processing
- AAAI 2020
- [pdf], [code]
ReVeal
- Data clean
- TSE 2021
- [pdf], [code]
ghost-dl
- a oversampling method
- non-DL technique, (artificially generating members of a minority class prior to running a learner) dramatically improves deep learning
- TSE 2021
- [pdf], [code]
HAN
- Hierarchical Attention Network
- multiple structural code features (including control flow graph and AST) to reflect the code hierarchy, a two-layer attention network (a token layer and a statement layer)
- TSE 2022
- [pdf]
RobustTrainer
- learning deep predictive models on raw training datasets where the mislabelled samples and the imbalanced classes coexist.
- ASE 2022
- [pdf], [code]
SYNSHINE
- input with compiler errors, large neural model leveraging unsupervised pre-training, multi-label classification
- TSE 2023
- [pdf], [code]
CARROTA
- adversarial training and detection, an optimization-based attack technique
- TOSEM 2022
- [pdf], [code]
CAT
- automated code-comment cleaning tool
- FSE 2022
- [pdf], [code], [pyton]
apr4codex
- APR / prompts for repair, APR techniques fix the incorrect solutions produced by language models in LeetCode contests.
- ICSE 2023
- [pdf], [code]
CCTEST
- test and repair code completion systems in black-box setting
- ICSE 2023
- [pdf], [HomePage]
ContraBERT
- an approach aims to improve the robustness of pre-trained models via contrastive learning, Contrastive Learning
- ICSE 2023
- [pdf], [HomePage]
REPEAT
- a novel method for continual learning of code intelligence models
- ICSE 2023
- [pdf], [code]
RepresentThemAll
- a pre-trained approach that can learn the universal representation of bug reports and handle multiple downstream tasks
- ICSE 2023
- [pdf], [code]
TENURE
- Template-based Neural Program Repair, which simultaneously absorbs the advantages of template- based and NMT-based APR methods
- ICSE 2023
- [pdf], [code]
CREAM
- A counterfactual reasoning-based framework, multi-task learning and counterfactual inference
- ICSE 2023
- [pdf], [code]
Telly
- four probing tasks related to lexical, syntactic, semantic, and structural code properties
- Telly-𝐾 (efficiently fine-tunes pre-trained code models via selective layer freezing)
- ISSTA 2023
- [pdf], [code]
TEval+
- a more realistic evaluation method TEval+ for NTOG and summarize seven rules of thumb to boost NTOG approaches into their practical usage
- ISSTA 2023
- [pdf], [code]
analysing_pii_leakage
- rigorous game-based definitions for three types of PII leakage via black-box extraction, inference, and reconstruction attacks with only API access to an LM
- S&P 2023
- [pdf], [code]
CodexLeaks
- a semi-automated filtering method using a blind membership inference attack
- USENIX Security 2023
- [pdf], [code]
CoProtector
- Utilizing data poisoning techniques to arm source code repositories for defending against such exploits
- WWW 2022
- [pdf], [code]
LLaMA-Reviewer
- an innovative framework that leverages the capabilities of LLaMA, a popular LLM, in the realm of code review
- ISSRE 2023
- [pdf], [code]
Compressor
- a novel approach that can compress the pre-trained models of code into extremely small models with negligible performance sacrifice
- ASE 2022
- [pdf], [code]
NNGen
- a simpler and faster approach to generate concise commit messages using the nearest neighbor algorithm
- ASE 2018
- [pdf], [code]

New model

Actor-critic network
- an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework
- ASE 2018
- [pdf]
CugLM
- a multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture
- ASE 2021
- [pdf]
SDA-Trans
- adversarial training, unsupervised training
- a syntax and domain-aware model for program translation, which leverages the syntax structure and domain knowledge to enhance the cross-lingual transfer ability
- ICSE 2023
- [pdf]
Tare
- a type-aware model for neural program repair to learn the typing rules
- ICSE 2023
- [pdf], [code]

XAI

Interpreting Deep Learning-based Vulnerability Detector Predictions Based on Heuristic Searching
- a framework for interpreting predictions of deep learning-based vulnerability detectors
- The framework is centered at identifying a small number of tokens that make important contributions to a particular prediction, the novelty of the framework can be characterized as follows: (1) it does not assume the detector’s local decision boundary is linear; (2) it does not assume the features are independent of each other but instead braces the association between features when searching for important features; (3) it searches important features by perturbing examples, while considering feature combinations rather than individual features.
- TOSEM 2021
- [pdf]
Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code
- A methodology for recording human attention
- ASE 2021
- [pdf], [code]
Vulnerability detection with fine-grained interpretations
- IVDetect, an interpretable vulnerability detector with the philosophy of using Artificial Intelligence (AI) to detect vulnerabilities
- FSE 2021
- [pdf], [code]
What do they capture? a structural analysis of pre-trained language models for source code
- NaturalCC, a sequence modeling toolkit that allows researchers and developers to train custom models for many software engineering tasks
- ICSE 2022
- [pdf], [code]
ReCode: Robustness Evaluation of Code Generation Models
- a Robustness Evaluation framework for Code, aiming to provide comprehensive assessment for robustness of code generation models
- define robustness metrics based on over 30 transformations for code on docstrings, function and variable names, code syntax, and code format
- ACL 2023
- [pdf], [code]

New benchmark

Deep Learning Based Program Generation From Requirements Text: Are We There Yet?
- [ReCa], a large scale dataset that is composed of longer requirements as well as validated implementations
- TSE 2022
- [pdf]
Deep Learning Based Vulnerability Detection: Are We There Yet?
- [Chromium_And_Debian_Vulnerability_Data], curated a real-world dataset from developer/user reported vulnerabilities of Chromium and Debian projects
- TSE 2021
- [pdf]
Generating realistic vulnerabilities via neural code editing: an empirical study
- [SARD], Synthetic dataset
- [Real-world dataset], a real-world dataset based on BigVul and PatchDB
- FSE 2022
- [pdf]
ReCode: Robustness Evaluation of Code Generation Models
- robustness evaluation metrics for code-generation tasks: Robust Passs@k, Robust Drops@k, and Robust Relatives@k
- ACL 2023
- [pdf]

Research Groups

Venues

Conferences

AI Domain
- AAAI, the Association for the Advancement of Artificial Intelligence
- ACL, the Association for Computational Linguistics
SE Domain
- ICSE, the International Conference on Software Engineering
- FSE, Symposium on the Foundations of Software Engineering
- ASE, the International Conference on Automated Software Engineering
- ISSTA, the International Symposium on Software Testing and Analysis
- ISSRE, IEEE International Symposium on Software Reliability
- SANER, IEEE International Conference on Software Analysis, Evolution, and Reengineering Engineering
- OOPSLA, the ACM Conference on Systems, Programming, Languages, and Applications
Security Domain
- S&P, IEEE Symposium on Security and Privacy
- USENIX Security, USENIX Security Symposium
Internet and Web technology Domain
- WWW, International World Wide Web Conference

Journals

SE Domain
- TSE, the IEEE Transactions on Software Engineering
- TOSEM, ACM Transactions on Software Engineering and Methodology
- JSS, Journal of Systems and Software
Security Domain
- TDSC, IEEE Transactions on Dependable and Secure Computing
- TIFS, IEEE Transactions on Information Forensics and Security

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

review_results.md

review_results.md

Papers

Type Inference

Data Collection and Labeling

Unbalanced Distribution

Label Errors

Data Noise

System Design and Learning

Data Snooping

Spurious Correlations

Inappropriate Model Design

Performance Evaluation

Inappropriate Baseline

Inappropriate Evaluation Dataset

Low Reproducibility

Inappropriate Performance Measures

Deployment and Maintainance

Real-World Constraints

Attack Threats

Security Concerns in Generated Code

Public Tools

New model

XAI

New benchmark

Research Groups

Venues

Conferences

Journals

Files

review_results.md

Latest commit

History

review_results.md

File metadata and controls

Papers

Type Inference

Data Collection and Labeling

Unbalanced Distribution

Label Errors

Data Noise

System Design and Learning

Data Snooping

Spurious Correlations

Inappropriate Model Design

Performance Evaluation

Inappropriate Baseline

Inappropriate Evaluation Dataset

Low Reproducibility

Inappropriate Performance Measures

Deployment and Maintainance

Real-World Constraints

Attack Threats

Security Concerns in Generated Code

Public Tools

New model

XAI

New benchmark

Research Groups

Venues

Conferences

Journals