Skip to content

CodeOrNot: does a repository contain software source code, or not?

License

Notifications You must be signed in to change notification settings

casics/codeornot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CASICS CodeOrNot

CASICS CodeOrNot is a Python 3 package implementing heuristic methods for determining whether a file or directory contains software source (or not). The analysis is oriented towards detecting code: a repository containing a mix of documents and even one source code file will be considered to contain code, and conversely, if there is no sign of source code, it will be labeled as "not" being source code.

Authors: Michael Hucka
Repository: https://github.com/casics/codeornot
License: Unless otherwise noted, this content is licensed under the GPLv3 license.

☀ Introduction

In performing source code repository analysis for classification tasks, a basic first step is to decide whether a source code repository actually contains code. Some repositories contain documents or other files and are not actually repositories for software; those are cases that a system for analyzing source code could skip. CodeOrNot is a Python 3 package that uses heuristics to answer the question "does it contain code, or not?"

Some cases are quite easy to decide: if a collection of files contains even one .c file, it can be reasonably assumed to contain C code, and thus the answer returned by CodeOrNot will be "code". Some other cases are more difficult. For example, files may contain code but not have file name extensions, and so determining whether they contain code or not requires examining the content. Other examples are gray zones: should a repository containining LaTeX files and a single Makefile be considered to contain code? After all, a Makefile can contain code—does that count? (The position taken by CodeOrNot is no, a single Makefile is not enough to consider the repository to be a code repository.)

CodeOrNot also provides some simple utilities modules that may be useful in other contexts:

  • The textcheck module provides functions such as majority_language(), which takes a list of text strings and reports the most likely human language in which the text strings are written. (It does this by using a combination of ftfy, cld2, and a majority vote.)

  • The codecheck module provides functions such as code_filename() and noncode_filename, which can be used to infer whether a file is likely to be code or noncode based on its name. These work by using built-in lists of file name rules.

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

♬ Contributing — info for developers

A lot remains to be done on CASICS in many areas. We would be happy to receive your help and participation if you are interested. Please feel free to contact the developers either via GitHub or the mailing list casics-team@googlegroups.com.

Everyone is asked to read and respect the code of conduct when participating in this project.

❤️ Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


             

About

CodeOrNot: does a repository contain software source code, or not?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages