You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The analysis portion of a facade-worker.py run is very database intensive, for a number of reasons. When designing the analysis functions, I wanted to be able to log neurotically, and stashed data as soon as it was computed so that if/when facade-worker.py failed, very little data would need to be recalculated. (FWIW recovery from an unplanned exit isn't an issue, as facade-worker.py just sees where it needs to pick up and resumes from there)
However, when a commit has a lot of files or there are a lot of commits in an analysis, the repeated database access can really slow things down. The mysqldb module is supposed to be the fastest connector, but at a certain point the sheer volume is the issue.
One potential solution would be to accumulate analyzed data in a temporary in-memory database, and then write everything out to storage in one big transaction at the end of a repo analysis. This should have minimal impact on short runs, but potentially a much larger impact on long runs.
One other massive advantage of reducing database transactions is that it could give us the option to use the native python MySQL library, pymysql. In my tests pymysql is considerably slower than mysqldb for an individual transaction, but pymysql is necessary if we want to use PyPy. In past testing PyPy runs were slower, which is counterintuitive. The best explanation I can think of is that PyPy was hamstrung by the number of database transactions. There will be a push/pull performance tension here, but there's a pretty good chance that if we optimize database transactions, the gains from PyPy will make up for pymysql's latency.
The text was updated successfully, but these errors were encountered:
The analysis portion of a
facade-worker.py
run is very database intensive, for a number of reasons. When designing the analysis functions, I wanted to be able to log neurotically, and stashed data as soon as it was computed so that if/whenfacade-worker.py
failed, very little data would need to be recalculated. (FWIW recovery from an unplanned exit isn't an issue, asfacade-worker.py
just sees where it needs to pick up and resumes from there)However, when a commit has a lot of files or there are a lot of commits in an analysis, the repeated database access can really slow things down. The
mysqldb
module is supposed to be the fastest connector, but at a certain point the sheer volume is the issue.One potential solution would be to accumulate analyzed data in a temporary in-memory database, and then write everything out to storage in one big transaction at the end of a repo analysis. This should have minimal impact on short runs, but potentially a much larger impact on long runs.
One other massive advantage of reducing database transactions is that it could give us the option to use the native python MySQL library,
pymysql
. In my testspymysql
is considerably slower thanmysqldb
for an individual transaction, butpymysql
is necessary if we want to use PyPy. In past testing PyPy runs were slower, which is counterintuitive. The best explanation I can think of is that PyPy was hamstrung by the number of database transactions. There will be a push/pull performance tension here, but there's a pretty good chance that if we optimize database transactions, the gains from PyPy will make up forpymysql
's latency.The text was updated successfully, but these errors were encountered: