Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce database transactions during analysis #35

Open
brianwarner opened this issue Apr 1, 2019 · 0 comments
Open

Reduce database transactions during analysis #35

brianwarner opened this issue Apr 1, 2019 · 0 comments
Assignees

Comments

@brianwarner
Copy link
Owner

The analysis portion of a facade-worker.py run is very database intensive, for a number of reasons. When designing the analysis functions, I wanted to be able to log neurotically, and stashed data as soon as it was computed so that if/when facade-worker.py failed, very little data would need to be recalculated. (FWIW recovery from an unplanned exit isn't an issue, as facade-worker.py just sees where it needs to pick up and resumes from there)

However, when a commit has a lot of files or there are a lot of commits in an analysis, the repeated database access can really slow things down. The mysqldb module is supposed to be the fastest connector, but at a certain point the sheer volume is the issue.

One potential solution would be to accumulate analyzed data in a temporary in-memory database, and then write everything out to storage in one big transaction at the end of a repo analysis. This should have minimal impact on short runs, but potentially a much larger impact on long runs.

One other massive advantage of reducing database transactions is that it could give us the option to use the native python MySQL library, pymysql. In my tests pymysql is considerably slower than mysqldb for an individual transaction, but pymysql is necessary if we want to use PyPy. In past testing PyPy runs were slower, which is counterintuitive. The best explanation I can think of is that PyPy was hamstrung by the number of database transactions. There will be a push/pull performance tension here, but there's a pretty good chance that if we optimize database transactions, the gains from PyPy will make up for pymysql's latency.

@brianwarner brianwarner self-assigned this Apr 1, 2019
@brianwarner brianwarner added this to the Next major release milestone Apr 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant