Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

and then I loaded 300 projects / 6,815 repos into facade and tried building cache 🤣 #31

Open
sgoggins opened this issue Mar 29, 2019 · 3 comments

Comments

@sgoggins
Copy link

128 gig of ram, solid state drives …. got all the repos and analysis_data .. its been 21 hours loading project_weekly_cache .. no cpu usage, so I am guessing I have the database eating disk …

I’ve already made a set of modification and database config notes on my fork at sgoggins/facade:augur branch … I’m thinking I an rewrite the query that loads cache to go after one repository or project group at a time .. since this is a nominal, 4 hour thing for me (a very experienced database guy / formerly well compensated Oracle DBA) I thought I would circle back and see if you would approve a pull request that modularized some of the functions in facade-worker.py into another python file. Or how you would recommend doing this.

The refactoring would change how cache is built and have options for execution. I think:

  1. Cache would not be rebuilt at the project level when "recache" is tagged automatically
  2. Cache at the project level would be rebuilt one project at a time
  3. I will explore a process of accumulating project level details from repository level cache, which may require some changes to the repository cache.
  4. Cache would be rebuilt at the repo level one repo at at time.
  5. i would take a parameter that enabled wholesale cache building at the project and repo level as is the case today for smaller scale implementations
  6. I will explore the potential to keep cache without destroying it on each recache

What do you think @brianwarner ?

@sgoggins
Copy link
Author

I do think I can put the mysql temp directory on another device and squeeze more out of the database, btw ... but I think I have reached a scale where db optimizations alone are not going to be sufficient. As an old DBA, I recognize that when I can't blame the network, I can definitely blame the app 🤣

@brianwarner
Copy link
Owner

I continue to be astounded by how you're using this at scale, @sgoggins. Very, very impressive (and way more than I've ever tried!)

I would agree, it's probably the database that's the issue. If I had to point my finger to one suspect, it's the way that Facade neurotically stuffs data into the database at every opportunity. I had originally done it this way so that if it failed mid-run, not much would be lost.

That said, the sheer volume of database transactions creates a maaaaaaasive amount of overhead in big databases. I think there's a fairly simple fix that will be low hanging fruit, whereby the database transactions for a single repo are accumulated into a temporary in-memory database and then only pushed into the analysis_data table at the very end. In the case where facade-worker.py fails, you'll still lose the stuff that was in-memory, but the performance gains should make up for it.

The reason I say this is the lowest hanging fruit is that by reducing database traffic, it may become feasible to use the native Python MySQL connector (PyMySQL). In a drag race it's considerably slower than the mysqldb, but using pymysql gives us the ability to use PyPy. Everything in Facade works with PyPy already, and it'll even choose pymysql when run under PyPy.

@sgoggins
Copy link
Author

sgoggins commented Apr 3, 2019

@brianwarner : did a lot of optimization fo the MariaDB parameters (which are the same as MySQL, but you know that story I am sure.) I have them in a file in my fork right now, but perhaps they belong instead in a read me.

Speeds up sorting when building facade cache

tmp_table_size = 16106127360
max_heap_table_size = 16106127360
query_cache_limit = 4M
query_cache_size = 512M
join_buffer_size = 8M

Timeouts deal with long running connections. Mostly needed when first scanning a large number of repos

thread_pool_idle_timeout = 40000
connect_timeout = 280000
deadlock_timeout_short = 100000
deadlock_timeout_long = 50000000
delayed_insert_timeout = 120000
innodb_lock_wait_timeout = 200000
lock_wait_timeout = 150000
interactive_timeout = 86500
net_read_timeout = 30000
net_write_timeout = 50000
slave_net_timeout = 40000
wait_timeout = 1500000

Yes, this array of timeout parameters is a bit "shotgun"

128 Gigs of RAM on server.

innodb_buffer_pool_size = 107374182400

Helps with sorting

key_buffer = 256M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 16

Connections are not a facade issue

max_connections = 250

A little caching helps with some of the queries

table_cache = 16K

Nice to know

log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 20
expire_logs_days = 10
max_binlog_size = 100M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants