Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with a lot of rows (20k+) #1125

Closed
sneko opened this issue Jun 9, 2020 · 5 comments
Closed

Performance issue with a lot of rows (20k+) #1125

sneko opened this issue Jun 9, 2020 · 5 comments

Comments

@sneko
Copy link

sneko commented Jun 9, 2020

Issue description

Hi,

I know issue section is not for free support, I just present my problem because I think I might miss a concept when using the driver and I'm interested in having your thoughts (I tried on some Slack workspace but without much success).

Context : I need to query 20k+ rows in my database to show multiple charts, but when doing it with this library it takes around 6 seconds to get all results (but with a raw MySQL client only 80ms).

I did different tests listed below, if you have an idea what could improve performance I would appreciate.

Note that when mentioning below "cluster" I mean where I'm hosting the client, and the DB server is hosted with Google Cloud SQL. They both are in the same region/zone so network latency "should not be" part of this problem. And I made sure the client program has enough CPU/memory (same for the database).

Test 1

As mentioned above:

  • Server: Google Cloud SQL
  • Client: Golang program in the cluster
rows, err := db.Query("SELECT * FROM analytics WHERE scenario_id='XXXXX'")

i := 0
for rows.Next() {
	i++
}

Duration: 6 seconds

Test 2

  • Server: Google Cloud SQL
  • Client: MySQL docker image deployed in the cluster
SELECT * FROM analytics WHERE scenario_id='XXXXX';

Duration: 0.08 second

Test 3

  • Server: local database (after exporting the Cloud one and imported locally)
  • Client: Golang program locally
rows, err := db.Query("SELECT * FROM analytics WHERE scenario_id='XXXXX'")

i := 0
for rows.Next() {
	i++
}

Duration: 0.4 second

Test 4

The weird thing compared to above Go test is that if I do:

  • Server: Google Cloud SQL
  • Client: Golang program in the cluster
rows, err := db.Query("SELECT * FROM analytics WHERE scenario_id='XXXXX'")

Duration: 0.08 second

In this case, without dealing with .Next(), it catches the raw MySQL client query duration in the Test 2.

Test 5

  • Server: Google Cloud SQL
  • Client: Golang program in the cluster
result, err := db.Exec("SELECT * FROM analytics WHERE scenario_id='XXXXX'")

Duration: 6 seconds

In this case, using .Exec() should not expect any result as returned value, but it takes the same time than if I was using all .Next()

"Conclusion"

I'm a bit lost, I don't succeed in understanding what could cause such a difference between the raw MySQL client and using Golang. And also the difference between .Query() and .Exec().

(I have only 9 columns, if I reduce the SELECT to just 1 column, indeed the duration is pulled down)

If you have an advice, I would really appreciate 👍

Thank you,

Configuration

Driver version (or git SHA): v.1.5.0
Go version: go version go1.14.3 darwin/amd64
Server version: MySQL 5.7
Server OS: (hosted by Google Cloud SQL)

@julienschmidt
Copy link
Member

julienschmidt commented Jun 9, 2020

This might be related to #1053.
Right now the driver transmits all results, regardless of whether you close the rows early.

Query and Exec are just conveniences of database/sql. Under the hood it still executes the exact same query. I guess at least in the Exec case we handle that smarter in a a rather easy and safe way. However, this is a corner case. You should never perform a SELECT on 20k rows if you don't want to actually read them.

@methane
Copy link
Member

methane commented Jun 9, 2020

Please measure CPU/network usage too.

My expectation is, your enabled compression option in MySQL cli.
This driver doesn't support compression yet. (#24)

@sneko
Copy link
Author

sneko commented Jun 9, 2020

Thanks both for your answers 😃

@julienschmidt you say all results are transmitted and that Query and Exec are the same, so I don't understand the difference of duration between following ones:

  1. Query + for Next() = 6 seconds
  2. Query = 0.08 second
  3. Exec = 6 seconds

If all results are "downloaded" as soon as the Query is called, it would mean "browsing" all downloaded rows with for Next() is what makes things longer?

(Note that my example codes are just for tests, behind the hood I definitely want to use the 20k rows)

@methane since when requesting with the "raw" MySQL docker image takes also around 80ms while displaying all results in the terminal, for sure it seems there is a difference since the environment and resource capacity are the same than with the Golang program in the cluster.

I didn't specified any --compress with the MySQL CLI, and while reading the MySQL docs it seems by default it's disabled.

I didn't see anormal CPU/memory consumption with metrics charts, but I will dig into it with pprof &co within the cluster.

@methane
Copy link
Member

methane commented Jun 9, 2020

If all results are "downloaded" as soon as the Query is called, it would mean "browsing" all downloaded rows with for Next() is what makes things longer?

Query() doesn't wait result data. It only receive result header (e.g. column name, etc).
If you don't call rows.Next() or rows.Close(), your program doesn't wait the result data at all.
That is why Query() without Next() is so fast.

I didn't specified any --compress with the MySQL CLI, and while reading the MySQL docs it seems by default it's disabled.

--compress option is not the only way to enable the compression option. Configuration file that you didn't notice might enable the compression option silently.
Try mysql --help | grep compress, and mysql --skip-compress -e 'SELECT * FROM analytics WHERE scenario_id="XXXXX"'.

@sneko
Copy link
Author

sneko commented Jun 25, 2020

Just to follow up, and sorry for the late response I was working on other things.

In the meantime I upgraded the CPU/memory of the database, but also the "client" service, and it made it more suitable for a production product.

I cannot definitely say it's solved on my side, it would need more testing on my side. But a solution we have since it's to deal with multiple frontend charts: we're gonna apply an algorithm (Douglas-Peucker) to remove meaningless chart points in the database (it keeps extremes) (while keeping full data points in another dedicated database more suitable for this).

Thank you for your answer, I mark this as closed since it probably does not come from your library.

@sneko sneko closed this as completed Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants