-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intern Tasks 2023/2024 #58394
Comments
Hi @alexey-milovidov, can I work on #29922? It seems the best fit for my background now. Thank you. |
Hi @alexey-milovidov, I am interested in the task "Integration with SQLSmith". Could you assign it to me? Thanks |
Hi @alexey-milovidov . I am interested in working on "Default expressions in INSERT query", #51509 |
Hi, do those tasks above open for everyone, or only available for Clickhouse intern? |
Hi @alexey-milovidov, can I book ‘Integration of ClickHouse with AWS Lambda’? I wanted to deep dive in AWS Lambda runtime and stumbled upon this task. Fortunately, there is already C++ runtime implementation at https://github.com/awslabs/aws-lambda-cpp so it was relatively easy to integrate it with ClickHouse by hooking up Lambda payload as external table data and capturing a query result into string to send it as Lambda request response. Despite Lambda payload limitation of 6MB, it looks nice having the ability to retrieve data directly from Lambda requests, for example, for Change Data Capture (CDC) scenarios. Upon the research I thought it would be good to have following features:
|
@alexey-milovidov May I ask if the above features are available for external contributions, or reserved for your company's official interns only? |
This is the list of proposed tasks. It is to be extended. You can propose more tasks.
You can also find the previous lists here:
2022/2023: #42194
2021/2022: #29601
2020/2021: #15065
2019/2020: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9
2018/2019: https://gist.github.com/alexey-milovidov/6735d193762cab1ad3b6e6af643e3a43
2017/2018: https://gist.github.com/alexey-milovidov/26cc3862eb87e52869b9dac64ab99156
The tasks should be:
Copy-on-Write and Backup disks
@yulikdaniel
#42155
#42154
Check
IDisk
and how various disk types are implemented.With the backup-disk it will be possible to quickly attach a backup for read-only access without fully restoring it (without copying the data). As an example, it can be used for quickly attaching example datasets (although it is already possible, see https://github.com/ClickHouse/web-tables-demo).
With COW disks, you can turn a read-only disk to read/write, but lazily - the data will be instantly accessible, but the changes will be written on demand.
Optimization of vector indices with Intel AMX and GPU
@omgronny
This is for those who are interested in low-level code and SIMD CPU instructions.
Here is the main task: #48831
For example,
partitionByHyperplanes
needs to calculate many scalar products - and this is exactly the same as the multiplication of a vector to a matrix.Intel AMX - an instruction set to accelerate matrix products. And this operation is needed for vector quantization.
If you have highly-dimensional vectors and you need to do an approximate search of nearest neighbors, there is a question - how to make the vectors smaller or rougher to make the search faster but without a large drop in quality?
Integration of ClickHouse with AWS Lambda
Available
This is a pretty straightforward task if you aren't afraid to read AWS docs.
Implement a mode (app) for the clickhouse binary that will listen to events according to the AWS Lambda API, run the corresponding queries, and output the result. This app is similar to clickhouse-local.
https://docs.aws.amazon.com/lambda/latest/dg/runtimes-custom.html
Frankly speaking, it is trivial to put clickhouse-local into AWS Lambda by using a small wrapper in bash or Python: https://github.com/aws-samples/aws-lambda-clickhouse
But we want the same without any wrappers but provided directly inside the clickhouse binary.
Indexing with space-filling curves
@Artemmm91
ClickHouse already supports the calculation of the z-order (Morton) curve, as well as its usage in the index analysis.
This task is about adding support for more types of curves: Hilbert and Moore curves. See #55520
This is not too hard but gives beautiful results.
Extending LIMIT clause with boundary conditions
Available
The LIMIT clause in the SQL language allows the cutoff of the result after a number of records. What if we extend it by allowing it to stop after certain criteria are met? This is useful for log analysis. Imagine something like "give me five records before and after every record with the column
text
containing 'upyachka'".#15341
Extensions for the ORDER BY operator
@aretaval
We will add the
SHUFFLE
clause for SQL queries andSHUFFLE ... LIMIT
that will work faster thanORDER BY rand()
: under O(n) instead of O(n * log(n)).As a bonus, we can add support for topological sorting: #34343
Large-scale internet technology survey
Available
This is for a full-stack engineer or data analyst.
This task does not require C++ knowledge and is implemented mostly in SQL.
Given a dataset of 1 TB of HTML pages (ClickHouse's Minicrawl) or 200 TB of HTML pages (CommonCrawl) do research around technologies used on the web and prepare a dataset with extracted metadata. Similar examples: https://w3techs.com/, https://builtwith.com/, https://httparchive.org/
You can check the preparation and current results in this task: #18842
The goal is to build a mini web app. The app will allow answering questions like - which JavaScript frameworks are the most popular on the Internet (the correct answer is - JQuery); and how their popularity changes depending on the website's popularity, country, SSL certificate vendor, etc. There is a boatload of information, but take care - if this topic is not interesting for you, it's unlikely you will make a good app.
Usability improvements for Pretty formats
@Volodyachan
A small but nice task, C++.
#56912
#49153
#58363
Additionally, we will:
Improvements for import/export data formats HDF5, ION, CBOR, XML, FlatBuffers
@zahartd, @ttser1085
ClickHouse supports a wide variety of data formats for import and export, like CSV, JSON, Protobuf, etc. See IInputFormat, IOutputFormat.
HDF5 - a format for multidimensional arrays of scientific/technical data.
ION - a format for semistructured data from Amazon, somewhat resembling MsgPack, Avro.
For example, take a look at the pull request introducing the Numpy format: #55982 - that was an easy task.
XML: #29822
Advanced code coverage collection on a per-test basis
I've accidentally implemented this task, so it is no longer available 😢
Integration with FSST
Booked by ...
https://github.com/cwida/fsst
See #34246
We will integrate it as a compression method and also provide a way to push-down filters into the decompressor for skipping the data without decompression.
Index advisor
Booked by ...
Users must currently choose primary and skipping indexes manually, e.g. by looking at which predicates in their queries will be costly. It would be better if ClickHouse could suggest candidate index columns automatically and show them in a system view. Most algorithms for automatic index selection are based on what-if cost estimation in the query optimizer (see here), given some workload. Since we don't have cost estimation in the optimizer yet, we can try to make a recommendation from the number of scanned rows (simplest approach) or use some light-weight statistics about output cardinalities of scans (during a collection phase).
Contact: @rschu1ze
GitHub public SSH keys as a user dictionary
Booked by @brezhart, and...
ClickHouse supports authentication via SSH keys: the private key resides on your machine, and you don't share it with anyone; the public key is located on the server, and you can publish it anywhere.
Then, you don't even need to copy the keys to the server; instead, you can instruct the server to fetch your public keys through some API. For example, here are my public keys for GitHub access: https://github.com/alexey-milovidov.keys
If we support key retrieval in this way on the server, we can achieve two things:
It will be sufficient to configure only the username (for example, that
alexey-milovidov
has access) and specify that the keys should be taken from GitHub, and there is no need to copy anything to the server.You can allow access to all GitHub users and let them be represented within ClickHouse, for example, with a certain prefix (e.g.,
gh_alexey-milovidov
), if such a user is not explicitly configured; or allow authentication of all GitHub users as one user on the ClickHouse side (for example,github_users
).In addition to (2), you can allow access not to all users but only to users with certain properties. For example, let only those who have given a start to the ClickHouse repository on GitHub use our demo server (this is just a funny example).
HTTP API and UI for ClickHouse Keeper
@speeedmaster
C++ and a bit of JavaScript.
#55332
Best effort sorting for data compression
@ElderlyPassionFruit
A very interesting task; see: #4413 (comment)
There is a proposal on how it could possibly be done.
Comparison of memory allocators in ClickHouse
@ilyaleshchik, @maximdervis, @h0tmi
#34157
Watch these two videos (for those who understand Russian; otherwise, you'll easily find better-quality videos on YouTube):
https://www.youtube.com/watch?v=BTkvaPefAKw
https://www.youtube.com/watch?v=i87W3KyZgPw
PostgreSQL as a dialect in ClickHouse
@nikitakeba, @seshWCS
ClickHouse supports switchable languages.
SET dialect = 'prql' | 'kusto'
, etc. It is proposed to plug in a PostgreSQL query parser library and create a translator from the PostgreSQL AST to the ClickHouse AST. This way, compatibility with a subset of the PostgreSQL dialect can be achieved. It's a subset because PostgreSQL has many unusual operators and functions that will be transformed into unimplemented functions. Additionally, there may be incompatibilities in behavior in edge cases.Query translation layer for MongoDB
Available
Add support for the MongoDB protocol to ClickHouse and the ability to handle queries in that inconvenient JSON format, as expected by MongoDB (which has SQL support only in its commercial product). As an example, you can take a look at https://github.com/fcoury/oxide
Support for InfluxDB line protocol and streaming inserts
@gitnabi
https://docs.influxdata.com/influxdb/v2.7/reference/syntax/line-protocol/
And #18273
Default expressions in INSERT query
Available
#51509
Multithreaded data import in the native ClickHouse protocol
Available
This is a trivial idea. In the INSERT query, the server receives blocks of data from the client over the network for insertion. The blocks are read from the connection, deserialized, and inserted into the table sequentially. Let's make the insertion in parallel, according to the
max_insert_threads setting
.#42372"
Quantile Compression codec
@rgolovachev
Compression for floating point data types based on the statistical distribution in a block of data: https://github.com/mwlon/pcodec
It can be used both as lossy and lossless compression. This task is for algorithms and performance optimization.
Lossy Floating Point Compression codecs
Besides quantile compression (above), there are many other lossy FP compression codecs, see e.g. #49888. The task is to create a small survey how they work, how effective and popular they are and to integrate one or two of them into ClickHouse. Quantile compression can be used as a baseline for comparison.
Data transformations with URL table functions
@scanhex12
#45994
Streaming Queries
@Michicosun
That's a very large task, but if even a single item will be implemented, it will be more than enough. It will require a good understanding of the query pipeline.
#42990
Interactive metrics displaying in ClickHouse client
@mariaKhr
It is an absolutely great task, and if no one is interested, I will do it by myself.
#39694
API for query rewrite
Available
There are two features inside this task:
Provide out-of-band modifications for a query to run. For example, we already have the possibility to specify
limit
andoffset
,additional_table_filters
, andadditional_result_filter
in settings. We can also add something likeoverride_select
,override_order_by
, etc. This is useful when a user provides a single query, but we generate multiple reports on top of it.Given a query and a set of desired modifications, return a modified, reformatted query. This is useful for various query UIs.
See #29922
Size-capped tables
@punkmunk
#37548
TOTAL, BY, and ORDER BY combinators for aggregate functions
@perst20
#34156
Integration with SQLSmith
Available
This is needed for testing:
https://github.com/anse1/sqlsmith
Improvements for INFORMATION SCHEMA
Available
This is something easy but enormously boring.
Time series analysis with aggregate functions
@LordVoldebug
The task offers quite a few possibilities. By the way, @aleks5d worked on this task last year and received an excellent grade. So, you can ask him how interesting it is.
Here's the pull request from last year that you can take a look at:
aleks5d#1
aleks5d#2
aleks5d#3
However, the task is still relevant because the functions haven't been added to the main codebase yet, not all planned options have been implemented, and the quality and performance haven't been tested on real data.
Time series prediction algorithms (with Holt-Winters as the main one) can be used for:
Dataset watermarking
Booked by ...
You distribute a dataset to several parties but want to add random undetectable modifications to make it possible to identify a party if the dataset will be distributed further
Add random modifications to the data to track the dataset source.
Encode a 64-bit identifier by pseudorandomly altering the data so it can be extracted back by xor-ing with the original dataset, but keep the following properties:
make
withcake
instead of)ake
;Intelligent auto-complete based on the history from query_log
@iyubondyrev
clickhouse-client
will create a text model from thequery_log
and use it for completion. Recent queries have a higher weight. The text model can be built on top of the tokens after Lexer. It can be a mix of order-0..2 frequencies.If you're lucky, you can use modern ML with deep learning (clickhouse-client will load pre-trained models); see LLamafile and https://github.com/jart/emacs-copilot
PS. It is important to start with a simple solution. Just load the data from
query_log
and predict with a Markov model.Fuzzy GROUP BY for data clustering
@no-1dea
Add a modifier
WITH CLUSTER
to aGROUP BY
element, defining a maximum distance between groups. Groups located close to each other will be merged.Example:
SELECT user_id, min(event_time), max(event_time) FROM table GROUP BY user_id, event_time WITH CLUSTER 1800
This query will group together sessions of events for every user_id, where
session
is a sequence of events, with the distance between nearest events no longer than 1800 (half an hour).It can be extended to the multidimensional case:
Example:
SELECT user_id, count() FROM table GROUP BY user_id, (x, y) WITH CLUSTER 10
This query will group together points (x, y) with L2-distance less than 10.
JOIN optimization with in-memory compression
@p1rattttt
#21047
JOIN is usually performed by placing blocks from the right side of the JOIN into the RAM and creating a hash table referencing these blocks, followed by lookups in this table. However, data may not fit into memory. In this case, data structures in external memory are used. But why not try an option where data blocks are compressed and remain in the RAM, and only a small part of them (according to LRU) is decompressed as needed and kept in uncompressed form?
We already have an interface for this,
IColumn::compress
, and it wouldn't be difficult to use it.Text classification with n-gram models
Available
Implement loading language models based on ngrams in the server. A model is represented by a map
(ngram, class) -> count
. Add a function, given the model name and an input string, to return the most likely class or the probability distribution of classes using the naive Bayes classifier.This might sound archaic, but it will work decently for tasks such as charset and language detection, as well as style classification.
See previous attempts: #50247 and #48984
Merge process visualization and optimization
@sodeniZzz, @arshtanko
The task has some parts in Python (modeling and optimization, although it can be implemented in any other language), C++ (implementing the algorithm in ClickHouse), and JavaScript (algorithm visualization).
https://presentations.clickhouse.com/hse_2019/merge_algorithm.pptx - there is something interesting here (previous developments on this task), but don't be afraid, it's not very complicated.
Integration with GGML
@HazyMrf, @m0r0zk01, @IArtNextI
Check the GGML and LLama.cpp, and how we can integrate it for text processing.
Also check https://jaykmody.com/blog/gpt-from-scratch/ and https://bbycroft.net/llm
Peephole optimizations for ClickHouse
@VKlin357
Different query optimizations by rewriting small pieces of expressions in them to simplify. For example: #50167
Many such optimizations can be imagined, and the challenge lies in implementing them correctly and in a user-friendly manner.
Ask Reddit: a large-scale RAG experiment
Available
We embed over 20 billion comments into vectors and will try to do something on top of that.
Improvements for ClickHouse Fiddle
Available
A guest task from https://github.com/lodthe/
Implementation of monotonic unique identifiers Snowfake ID and UUID v5/6/7
@kazalika
Quite a small task, but it also extends to the implementation of auto-increment in tables.
See #56228 (comment)
Streaming consumption from external tables with a monotonic key
@NDamirov, @dubr0vin
A good topic, quite easy to understand, and the results are easily presentable but, at the same time, require attention to details.
The task is to import data into ClickHouse from a table located in an external system, such as MySQL or PostgreSQL. Let's assume that new records are constantly added to the table and old ones do not change (append-only). Additionally, records have a monotonically increasing identifier (such as auto_increment or timestamp). We will simply remember the value of this identifier during the previous successful import and retrieve new records with a query.
This results in a straightforward solution for replication, although suitable for a limited number of scenarios.
An unusual detail is the need to support querying data with a time overlap. For example, the next query will extract data not precisely from the previous moment but from the previous moment minus 10 seconds and filter out already imported records based on their hashes.
Visualization and analysis of program tracing.
@Krytog
First, we create a beautiful video illustrating how the instruction pointer jumps through addresses in machine code. Drawing can be done using shaders on the GPU. The catch is that the video may turn out to be not so beautiful :)
Next, we take the names of symbols (functions) and transform them into vectors using the word2vec algorithm based on how often they appear together in the program trace.
https://jalammar.github.io/illustrated-word2vec/
https://github.com/JosephSBoyle/skip_gram/blob/master/skip_gram/main.py
Then, we visualize them using umap: https://github.com/lmcinnes/umap/
Take a look at things like this, too: https://codisec.com/binary-data-visualization/
I'm afraid this task is suitable only for enthusiasts.
Streaming consumption from Apache Pulsar
@nikitagolu8ev, @SteveBalayanAKAMedian
ClickHouse implements streaming import from persistent queues, such as Kafka, RabbitMQ, NATS, FileLog.
Apache Pulsar - is a persistent queue, similar to Kafka. Let's add its support similarly.
By the way, RabbitMQ and NATS were added in previous years projects:
#11069
#37171
Data visualization in the terminal in clickhouse-client
@Novart-projects
Take a look at:
https://github.com/red-data-tools/YouPlot
https://github.com/JuliaPlots/UnicodePlots.jl
These projects are written in Ruby and Julia, so we cannot use them directly.
But we can implement something similar in C++ and make it implemented as an output format.
Maybe you can find a library in Rust that we are also allowed to use (Rust - an obscure programming language for terminal apps).
Improvement for SQL functions in ClickHouse
@cucumber95
This task has an uninteresting name, and it includes #29978 #38002 #36470 and similar.
Data parallel implementation of hash functions for fixed-size strings
@piezzacondoleezza
Cryptographic hash functions, such as SHA-1, SHA-2, and SipHash, typically do not use SIMD instructions, as updating their state depends on previous bytes of data in a chain.
But if we need to apply a hash function to an array of strings, can we use SIMD by applying them simultaneously to a group of strings?
If the strings are of different lengths, it's difficult. However, if they are of fixed length, it becomes quite convenient.
This task is suitable only if you've already tried to do something with SIMD instructions.
Here is an article - an introduction to SIMD: https://woboq.com/blog/utf-8-processing-using-simd.html
Language Server for ClickHouse
@Minnakhmetov
ClickHouse should implement the language server protocol in a way that allows IDEs to properly understand its syntax.
ClickHouse does not provide an independent grammar, so we cannot implement it using ANTLR or, e.g., tree-sitter. It means we have to embed the language server directly into ClickHouse.
Then multiple questions arise - how to do a fallback on parser errors so you can continue and do proper syntax highlighting.
As a bonus, we can improve the syntax highlighting in clickhouse-client - currently, it is implemented using a lexer. But we'll do it with a parser.
Can we also implement context-aware completions, so it can find and use the information about database objects, functions, etc?
Can we also show a snippet from the docs in clickhouse-client while typing?
The text was updated successfully, but these errors were encountered: