Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating query runtime without output #281

Open
AKheli opened this issue Feb 23, 2022 · 1 comment
Open

Evaluating query runtime without output #281

AKheli opened this issue Feb 23, 2022 · 1 comment
Assignees
Labels

Comments

@AKheli
Copy link

AKheli commented Feb 23, 2022

Hello,

I am using PyDruid to evaluate a query runtime in Druid without taking in account the results output that are obtained on the API.

from pydruid.db import connect
import time

conn = connect(host='localhost', port=8082, path='/druid/v2/sql/', scheme='http')
curs = conn.cursor()
start = time.time()
curs.execute("""
    SELECT id_station, count(*) FROM bafu_comma where id_station IN (32, 54, 8, 25, 95, 13, 80, 16, 83, 27) group by id_station
""")
end1 = time.time()
print('exeution runtime:', (end1 - start) * 1000, 'ms')
print('number of rows:', sum(1 for _ in curs))
end2 = time.time()
# for row in curs:
#      print(row)
print('total time: ',(end2 - start) * 1000, 'ms')

Is this a correct way of measuring the runtime. My execution time is always around 200ms or 50ms which is a bit suspecious. Also, the total runtime that I obtain is much higher than the results that I obtain in the API.

Any ideas on how to properly evaluate a query execution time in Druid?

Thanks!

@betodealmeida betodealmeida self-assigned this Nov 3, 2022
@betodealmeida
Copy link
Contributor

I'm not sure if that's correct. The DB API connector will stream the results from Druid, so unless you have iterated over all the result set I don't think you can assume that the query execution has finished.

pydruid/pydruid/db/api.py

Lines 365 to 380 in bd7b741

# Druid will stream the data in chunks of 8k bytes, splitting the JSON
# between them; setting `chunk_size` to `None` makes it use the server
# size
chunks = r.iter_content(chunk_size=None, decode_unicode=True)
Row = None
for row in rows_from_chunks(chunks):
# update description
if self.description is None:
self.description = (
list(row.items()) if self.header else get_description_from_row(row)
)
# return row in namedtuple
if Row is None:
Row = namedtuple("Row", row.keys(), rename=True)
yield Row(*row.values())

The correct time is probably closer to end2 - start in this case, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants