Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intern Tasks 2023/2024 #58394

Open
alexey-milovidov opened this issue Dec 31, 2023 · 6 comments
Open

Intern Tasks 2023/2024 #58394

alexey-milovidov opened this issue Dec 31, 2023 · 6 comments

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Dec 31, 2023

This is the list of proposed tasks. It is to be extended. You can propose more tasks.
You can also find the previous lists here:

2022/2023: #42194
2021/2022: #29601
2020/2021: #15065
2019/2020: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9
2018/2019: https://gist.github.com/alexey-milovidov/6735d193762cab1ad3b6e6af643e3a43
2017/2018: https://gist.github.com/alexey-milovidov/26cc3862eb87e52869b9dac64ab99156

The tasks should be:

  • not too hard (doable within about a month) but usually not less than a week;
  • not alter some core components of the system;
  • mostly isolated, does not require full knowledge of the system;
  • somewhat interesting to implement or have some point of research;
  • not in the critical path of our roadmap (ok to be thrown away after a year);
  • most of them are for C++ developers, but there should also be tasks for frontend developers or tools/research that only require Go/Python/whatever;
  • some tasks should allow teamwork;
  • cover various skills, e.g., system programming, algorithm knowledge, etc...

Copy-on-Write and Backup disks

@yulikdaniel

#42155
#42154

Check IDisk and how various disk types are implemented.

With the backup-disk it will be possible to quickly attach a backup for read-only access without fully restoring it (without copying the data). As an example, it can be used for quickly attaching example datasets (although it is already possible, see https://github.com/ClickHouse/web-tables-demo).

With COW disks, you can turn a read-only disk to read/write, but lazily - the data will be instantly accessible, but the changes will be written on demand.

Optimization of vector indices with Intel AMX and GPU

@omgronny

This is for those who are interested in low-level code and SIMD CPU instructions.

Here is the main task: #48831

For example, partitionByHyperplanes needs to calculate many scalar products - and this is exactly the same as the multiplication of a vector to a matrix.

Intel AMX - an instruction set to accelerate matrix products. And this operation is needed for vector quantization.

If you have highly-dimensional vectors and you need to do an approximate search of nearest neighbors, there is a question - how to make the vectors smaller or rougher to make the search faster but without a large drop in quality?

Integration of ClickHouse with AWS Lambda

Available

This is a pretty straightforward task if you aren't afraid to read AWS docs.

Implement a mode (app) for the clickhouse binary that will listen to events according to the AWS Lambda API, run the corresponding queries, and output the result. This app is similar to clickhouse-local.

https://docs.aws.amazon.com/lambda/latest/dg/runtimes-custom.html

Frankly speaking, it is trivial to put clickhouse-local into AWS Lambda by using a small wrapper in bash or Python: https://github.com/aws-samples/aws-lambda-clickhouse
But we want the same without any wrappers but provided directly inside the clickhouse binary.

Indexing with space-filling curves

@Artemmm91

ClickHouse already supports the calculation of the z-order (Morton) curve, as well as its usage in the index analysis.

This task is about adding support for more types of curves: Hilbert and Moore curves. See #55520

This is not too hard but gives beautiful results.

Extending LIMIT clause with boundary conditions

Available

The LIMIT clause in the SQL language allows the cutoff of the result after a number of records. What if we extend it by allowing it to stop after certain criteria are met? This is useful for log analysis. Imagine something like "give me five records before and after every record with the column text containing 'upyachka'".

#15341

Extensions for the ORDER BY operator

@aretaval

We will add the SHUFFLE clause for SQL queries and SHUFFLE ... LIMIT that will work faster than ORDER BY rand(): under O(n) instead of O(n * log(n)).

As a bonus, we can add support for topological sorting: #34343

Large-scale internet technology survey

Available

This is for a full-stack engineer or data analyst.

This task does not require C++ knowledge and is implemented mostly in SQL.

Given a dataset of 1 TB of HTML pages (ClickHouse's Minicrawl) or 200 TB of HTML pages (CommonCrawl) do research around technologies used on the web and prepare a dataset with extracted metadata. Similar examples: https://w3techs.com/, https://builtwith.com/, https://httparchive.org/

You can check the preparation and current results in this task: #18842

The goal is to build a mini web app. The app will allow answering questions like - which JavaScript frameworks are the most popular on the Internet (the correct answer is - JQuery); and how their popularity changes depending on the website's popularity, country, SSL certificate vendor, etc. There is a boatload of information, but take care - if this topic is not interesting for you, it's unlikely you will make a good app.

Usability improvements for Pretty formats

@Volodyachan

A small but nice task, C++.

#56912
#49153
#58363

Additionally, we will:

  • Automatically switch to "Vertical" format if a single row with multiple columns is returned.
  • Improve the default cutoff by 10,000 records by displaying not only the first but also the last records.

Improvements for import/export data formats HDF5, ION, CBOR, XML, FlatBuffers

@zahartd, @ttser1085

ClickHouse supports a wide variety of data formats for import and export, like CSV, JSON, Protobuf, etc. See IInputFormat, IOutputFormat.

HDF5 - a format for multidimensional arrays of scientific/technical data.

ION - a format for semistructured data from Amazon, somewhat resembling MsgPack, Avro.

For example, take a look at the pull request introducing the Numpy format: #55982 - that was an easy task.

XML: #29822

Advanced code coverage collection on a per-test basis

I've accidentally implemented this task, so it is no longer available 😢

Integration with FSST

Booked by ...

https://github.com/cwida/fsst

See #34246

We will integrate it as a compression method and also provide a way to push-down filters into the decompressor for skipping the data without decompression.

Index advisor

Booked by ...

Users must currently choose primary and skipping indexes manually, e.g. by looking at which predicates in their queries will be costly. It would be better if ClickHouse could suggest candidate index columns automatically and show them in a system view. Most algorithms for automatic index selection are based on what-if cost estimation in the query optimizer (see here), given some workload. Since we don't have cost estimation in the optimizer yet, we can try to make a recommendation from the number of scanned rows (simplest approach) or use some light-weight statistics about output cardinalities of scans (during a collection phase).

Contact: @rschu1ze

GitHub public SSH keys as a user dictionary

Booked by @brezhart, and...

ClickHouse supports authentication via SSH keys: the private key resides on your machine, and you don't share it with anyone; the public key is located on the server, and you can publish it anywhere.

Then, you don't even need to copy the keys to the server; instead, you can instruct the server to fetch your public keys through some API. For example, here are my public keys for GitHub access: https://github.com/alexey-milovidov.keys

If we support key retrieval in this way on the server, we can achieve two things:

  1. It will be sufficient to configure only the username (for example, that alexey-milovidov has access) and specify that the keys should be taken from GitHub, and there is no need to copy anything to the server.

  2. You can allow access to all GitHub users and let them be represented within ClickHouse, for example, with a certain prefix (e.g., gh_alexey-milovidov), if such a user is not explicitly configured; or allow authentication of all GitHub users as one user on the ClickHouse side (for example, github_users).

In addition to (2), you can allow access not to all users but only to users with certain properties. For example, let only those who have given a start to the ClickHouse repository on GitHub use our demo server (this is just a funny example).

HTTP API and UI for ClickHouse Keeper

@speeedmaster

C++ and a bit of JavaScript.

#55332

Best effort sorting for data compression

@ElderlyPassionFruit

A very interesting task; see: #4413 (comment)

There is a proposal on how it could possibly be done.

Comparison of memory allocators in ClickHouse

@ilyaleshchik, @maximdervis, @h0tmi

#34157

Watch these two videos (for those who understand Russian; otherwise, you'll easily find better-quality videos on YouTube):

https://www.youtube.com/watch?v=BTkvaPefAKw
https://www.youtube.com/watch?v=i87W3KyZgPw

PostgreSQL as a dialect in ClickHouse

@nikitakeba, @seshWCS

ClickHouse supports switchable languages. SET dialect = 'prql' | 'kusto', etc. It is proposed to plug in a PostgreSQL query parser library and create a translator from the PostgreSQL AST to the ClickHouse AST. This way, compatibility with a subset of the PostgreSQL dialect can be achieved. It's a subset because PostgreSQL has many unusual operators and functions that will be transformed into unimplemented functions. Additionally, there may be incompatibilities in behavior in edge cases.

Query translation layer for MongoDB

Available

Add support for the MongoDB protocol to ClickHouse and the ability to handle queries in that inconvenient JSON format, as expected by MongoDB (which has SQL support only in its commercial product). As an example, you can take a look at https://github.com/fcoury/oxide

Support for InfluxDB line protocol and streaming inserts

@gitnabi

https://docs.influxdata.com/influxdb/v2.7/reference/syntax/line-protocol/

And #18273

Default expressions in INSERT query

Available

#51509

Multithreaded data import in the native ClickHouse protocol

Available

This is a trivial idea. In the INSERT query, the server receives blocks of data from the client over the network for insertion. The blocks are read from the connection, deserialized, and inserted into the table sequentially. Let's make the insertion in parallel, according to the max_insert_threads setting.

#42372"

Quantile Compression codec

@rgolovachev

Compression for floating point data types based on the statistical distribution in a block of data: https://github.com/mwlon/pcodec

It can be used both as lossy and lossless compression. This task is for algorithms and performance optimization.

Lossy Floating Point Compression codecs

Besides quantile compression (above), there are many other lossy FP compression codecs, see e.g. #49888. The task is to create a small survey how they work, how effective and popular they are and to integrate one or two of them into ClickHouse. Quantile compression can be used as a baseline for comparison.

Data transformations with URL table functions

@scanhex12

#45994

Streaming Queries

@Michicosun

That's a very large task, but if even a single item will be implemented, it will be more than enough. It will require a good understanding of the query pipeline.

#42990

Interactive metrics displaying in ClickHouse client

@mariaKhr

It is an absolutely great task, and if no one is interested, I will do it by myself.

#39694

API for query rewrite

Available

There are two features inside this task:

  1. Provide out-of-band modifications for a query to run. For example, we already have the possibility to specify limit and offset, additional_table_filters, and additional_result_filter in settings. We can also add something like override_select, override_order_by, etc. This is useful when a user provides a single query, but we generate multiple reports on top of it.

  2. Given a query and a set of desired modifications, return a modified, reformatted query. This is useful for various query UIs.

See #29922

Size-capped tables

@punkmunk

#37548

TOTAL, BY, and ORDER BY combinators for aggregate functions

@perst20

#34156

Integration with SQLSmith

Available

This is needed for testing:

https://github.com/anse1/sqlsmith

Improvements for INFORMATION SCHEMA

Available

This is something easy but enormously boring.

Time series analysis with aggregate functions

@LordVoldebug

The task offers quite a few possibilities. By the way, @aleks5d worked on this task last year and received an excellent grade. So, you can ask him how interesting it is.

Here's the pull request from last year that you can take a look at:
aleks5d#1
aleks5d#2
aleks5d#3

However, the task is still relevant because the functions haven't been added to the main codebase yet, not all planned options have been implemented, and the quality and performance haven't been tested on real data.

Time series prediction algorithms (with Holt-Winters as the main one) can be used for:

  • Predicting future values;
  • Detecting anomalies in the history;
  • Smoothing data;

Dataset watermarking

Booked by ...

You distribute a dataset to several parties but want to add random undetectable modifications to make it possible to identify a party if the dataset will be distributed further

Add random modifications to the data to track the dataset source.

Encode a 64-bit identifier by pseudorandomly altering the data so it can be extracted back by xor-ing with the original dataset, but keep the following properties:

  • spread the modifications uniformly across the dataset, so even if only part of it has been redistributed, the watermark can still be read;
  • keep the ratio of modified bits of the dataset low;
  • the modifications should be undetectable by the third parties - only make the low-entropy changes under some data model of the data (Markov model) - for example, prefer to replace the word make with cake instead of )ake;
  • the watermark should not repeat in the data - you can use streaming-encryption for it;
  • the watermark should be readable even if another watermark is applied or if the dataset has been slightly modified by a third party;
  • the watermark should be readable even if the data was reordered or if some columns were removed or new columns were added.

Intelligent auto-complete based on the history from query_log

@iyubondyrev

clickhouse-client will create a text model from the query_log and use it for completion. Recent queries have a higher weight. The text model can be built on top of the tokens after Lexer. It can be a mix of order-0..2 frequencies.

If you're lucky, you can use modern ML with deep learning (clickhouse-client will load pre-trained models); see LLamafile and https://github.com/jart/emacs-copilot

PS. It is important to start with a simple solution. Just load the data from query_log and predict with a Markov model.

Fuzzy GROUP BY for data clustering

@no-1dea

Add a modifier WITH CLUSTER to a GROUP BY element, defining a maximum distance between groups. Groups located close to each other will be merged.

Example:

SELECT user_id, min(event_time), max(event_time) FROM table GROUP BY user_id, event_time WITH CLUSTER 1800

This query will group together sessions of events for every user_id, where session is a sequence of events, with the distance between nearest events no longer than 1800 (half an hour).

It can be extended to the multidimensional case:

Example:

SELECT user_id, count() FROM table GROUP BY user_id, (x, y) WITH CLUSTER 10

This query will group together points (x, y) with L2-distance less than 10.

JOIN optimization with in-memory compression

@p1rattttt

#21047

JOIN is usually performed by placing blocks from the right side of the JOIN into the RAM and creating a hash table referencing these blocks, followed by lookups in this table. However, data may not fit into memory. In this case, data structures in external memory are used. But why not try an option where data blocks are compressed and remain in the RAM, and only a small part of them (according to LRU) is decompressed as needed and kept in uncompressed form?

We already have an interface for this, IColumn::compress, and it wouldn't be difficult to use it.

Text classification with n-gram models

Available

Implement loading language models based on ngrams in the server. A model is represented by a map (ngram, class) -> count. Add a function, given the model name and an input string, to return the most likely class or the probability distribution of classes using the naive Bayes classifier.

This might sound archaic, but it will work decently for tasks such as charset and language detection, as well as style classification.

See previous attempts: #50247 and #48984

Merge process visualization and optimization

@sodeniZzz, @arshtanko

The task has some parts in Python (modeling and optimization, although it can be implemented in any other language), C++ (implementing the algorithm in ClickHouse), and JavaScript (algorithm visualization).

https://presentations.clickhouse.com/hse_2019/merge_algorithm.pptx - there is something interesting here (previous developments on this task), but don't be afraid, it's not very complicated.

Integration with GGML

@HazyMrf, @m0r0zk01, @IArtNextI

Check the GGML and LLama.cpp, and how we can integrate it for text processing.

Also check https://jaykmody.com/blog/gpt-from-scratch/ and https://bbycroft.net/llm

Peephole optimizations for ClickHouse

@VKlin357

Different query optimizations by rewriting small pieces of expressions in them to simplify. For example: #50167

Many such optimizations can be imagined, and the challenge lies in implementing them correctly and in a user-friendly manner.

Ask Reddit: a large-scale RAG experiment

Available

We embed over 20 billion comments into vectors and will try to do something on top of that.

Improvements for ClickHouse Fiddle

Available

A guest task from https://github.com/lodthe/

Implementation of monotonic unique identifiers Snowfake ID and UUID v5/6/7

@kazalika

Quite a small task, but it also extends to the implementation of auto-increment in tables.

See #56228 (comment)

Streaming consumption from external tables with a monotonic key

@NDamirov, @dubr0vin

A good topic, quite easy to understand, and the results are easily presentable but, at the same time, require attention to details.

The task is to import data into ClickHouse from a table located in an external system, such as MySQL or PostgreSQL. Let's assume that new records are constantly added to the table and old ones do not change (append-only). Additionally, records have a monotonically increasing identifier (such as auto_increment or timestamp). We will simply remember the value of this identifier during the previous successful import and retrieve new records with a query.

This results in a straightforward solution for replication, although suitable for a limited number of scenarios.

An unusual detail is the need to support querying data with a time overlap. For example, the next query will extract data not precisely from the previous moment but from the previous moment minus 10 seconds and filter out already imported records based on their hashes.

Visualization and analysis of program tracing.

@Krytog

First, we create a beautiful video illustrating how the instruction pointer jumps through addresses in machine code. Drawing can be done using shaders on the GPU. The catch is that the video may turn out to be not so beautiful :)

Next, we take the names of symbols (functions) and transform them into vectors using the word2vec algorithm based on how often they appear together in the program trace.
https://jalammar.github.io/illustrated-word2vec/
https://github.com/JosephSBoyle/skip_gram/blob/master/skip_gram/main.py

Then, we visualize them using umap: https://github.com/lmcinnes/umap/

Take a look at things like this, too: https://codisec.com/binary-data-visualization/

I'm afraid this task is suitable only for enthusiasts.

Streaming consumption from Apache Pulsar

@nikitagolu8ev, @SteveBalayanAKAMedian

ClickHouse implements streaming import from persistent queues, such as Kafka, RabbitMQ, NATS, FileLog.

Apache Pulsar - is a persistent queue, similar to Kafka. Let's add its support similarly.

By the way, RabbitMQ and NATS were added in previous years projects:

#11069
#37171

Data visualization in the terminal in clickhouse-client

@Novart-projects

Take a look at:

https://github.com/red-data-tools/YouPlot
https://github.com/JuliaPlots/UnicodePlots.jl

These projects are written in Ruby and Julia, so we cannot use them directly.

But we can implement something similar in C++ and make it implemented as an output format.
Maybe you can find a library in Rust that we are also allowed to use (Rust - an obscure programming language for terminal apps).

Improvement for SQL functions in ClickHouse

@cucumber95

This task has an uninteresting name, and it includes #29978 #38002 #36470 and similar.

Data parallel implementation of hash functions for fixed-size strings

@piezzacondoleezza

Cryptographic hash functions, such as SHA-1, SHA-2, and SipHash, typically do not use SIMD instructions, as updating their state depends on previous bytes of data in a chain.

But if we need to apply a hash function to an array of strings, can we use SIMD by applying them simultaneously to a group of strings?

If the strings are of different lengths, it's difficult. However, if they are of fixed length, it becomes quite convenient.

This task is suitable only if you've already tried to do something with SIMD instructions.

Here is an article - an introduction to SIMD: https://woboq.com/blog/utf-8-processing-using-simd.html

Language Server for ClickHouse

@Minnakhmetov

ClickHouse should implement the language server protocol in a way that allows IDEs to properly understand its syntax.

ClickHouse does not provide an independent grammar, so we cannot implement it using ANTLR or, e.g., tree-sitter. It means we have to embed the language server directly into ClickHouse.

Then multiple questions arise - how to do a fallback on parser errors so you can continue and do proper syntax highlighting.

As a bonus, we can improve the syntax highlighting in clickhouse-client - currently, it is implemented using a lexer. But we'll do it with a parser.

Can we also implement context-aware completions, so it can find and use the information about database objects, functions, etc?

Can we also show a snippet from the docs in clickhouse-client while typing?

@fool1280
Copy link

fool1280 commented Jan 5, 2024

Hi @alexey-milovidov, can I work on #29922? It seems the best fit for my background now. Thank you.

@wheatdog
Copy link
Contributor

Hi @alexey-milovidov, I am interested in the task "Integration with SQLSmith". Could you assign it to me? Thanks

@thegalang
Copy link

Hi @alexey-milovidov . I am interested in working on "Default expressions in INSERT query", #51509
Can you assign to me? Thank you

@thinh2
Copy link

thinh2 commented Feb 14, 2024

Hi, do those tasks above open for everyone, or only available for Clickhouse intern?

@acoolnev
Copy link

Hi @alexey-milovidov, can I book ‘Integration of ClickHouse with AWS Lambda’? I wanted to deep dive in AWS Lambda runtime and stumbled upon this task. Fortunately, there is already C++ runtime implementation at https://github.com/awslabs/aws-lambda-cpp so it was relatively easy to integrate it with ClickHouse by hooking up Lambda payload as external table data and capturing a query result into string to send it as Lambda request response. Despite Lambda payload limitation of 6MB, it looks nice having the ability to retrieve data directly from Lambda requests, for example, for Change Data Capture (CDC) scenarios.

Upon the research I thought it would be good to have following features:

  1. Warm up the Lambda function with a SQL script, for example, with dictionaries that will then be used during the processing of Lambda handler requests. This approach appears to be useful for processing a massive number of files with the same structure using a pool of Lambda instances or for parallel processing of a big data file split into multiple chunks.

  2. The Lambda function can be utilized in CDC scenarios to handle events from DynamoDB, S3, etc. However, the current set of JSON formats does not allow for direct integration. They could potentially be extended with another format, such as JSONAwsEvent. This would enable Lambda to process CDC events e.g.

INSERT INTO db.my_table SELECT * FROM lambda_payload FORMAT JSONAwsEvent 

@h12567
Copy link

h12567 commented Apr 25, 2024

@alexey-milovidov May I ask if the above features are available for external contributions, or reserved for your company's official interns only?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants