Tiles and tiling

Tiles in HiGlass

HiGlass only requests small chunks of data corresponding to the visible region from the server. As seen on the left, any higlass view is composed of a number of "tiles" which are pieced together to form the visible region on the screen. Tiles are identified by their zoom level, x position and y position (shown as z/x/y on in the figure).

Tiles can be classified according to the dimensionality of the data they contain (1D or 2D) and according to the structure of the data (sparse or dense). Dense data containing tiles (e.g. for matrices or lines) store the data as an array. The positions of each data point can be determined by the tile's position and the index of the datapoint within the dense array. They are transferred between the server and client as base64 encoded strings.

Sparse tiles are more flexible in the type of data they contain and they contain position information for each data point. This makes it possible to display features such as gene annotation that are present in some locations and not others. Each dense. 2D tile contains the data for a 256x256 pixel region. Due to the free zooming, when a tile is displayed, it can take up an area smaller or larger than 256x256 pixels. Dense 1D tiles contain 1024 data points.

All tile data is compressed on the server and extracted on the client to minimize the amount of data which needs to be transferred.

Motivation

To display large datasets, HiGlass relies on aggregation and tiling to fetch only the visible region at any given time. A high-resolution (1Kb) genomic contact map will be matrix of roughly 3M rows and 3M columns. While the sparsity of the matrix implies that the majority of the cells in the matrix will be unpopulated, these cells still need to be rendered. Assuming one pixel per column, a monitor would need to be approximately two and a half football fields long to display it in its entirety. To fit the entire matrix in a single monitor, we need to aggregate data so that we are displaying multiple rows and column on each pixel.

Aggregation

Aggregation is the process of reducing larger datasets to smaller ones for the purposes of displaying more data than can fit on the screen at once. While there are a multitude of ways to aggregate large datasets, we make significant use of summation and prioritization for numerical and categorical data, respectively. The following sections will provide examples for how we use aggregation to reduce the size of larger datasets.

Summation

Aggregation by summation is simply aggregation of adjacent elements by summing their values. This can be generalized to any n-dimensional data set, but we only employ it for matrices and vectors.

Contact matrices

In the case of contact matrices, this is easily accomplished by summing adjacent cells. Consider the following matrix:

1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	16

We can perform a round of aggregation, summing each block of four cells into one:

14	22
46	54

From this state, we could perform one more round of aggregation and end up with a matrix with just one entry which is simply the sum of all the values of our original matrix.

136

This procedure is an example of how we can lower the resolution of our matrix so that it can be displayed using fewer pixels.

Vectors

Similar to matrices, vectors can also be reduced by summation. The vector

1	2	3	4

reduces to

3	7

and can be further reduced to

10

While summation is neither the only way nor, perhaps, the best way to reduce large matrices and vectors, we find that it not only serves our purposes but meshes smoothly with the notion of binning in the creation of contact matrices. A contact matrix that starts out at 1Kb resolution becomes a matrix at 2Kb resolution after one round of aggregation. After 14 rounds of aggregation, an entire 1Kb resolution human contact matrix can fit into a 256x256 pixel image (3.2e9 < 256 * 2 ** 14 * 1000). The resolution of the data at this level of aggregation is 1Kbp * 2 ** 14 = 16384 Kbp per pixel (or bin).

Prioritization

Another method of aggregating data is by picking out entries from a dataset according to some importance function. This is commonly found in maps. Showing every village on an overview of the world would be useless because all of the labels would overlap. Showing every village when only a county is displayed, however, makes more sense. As the size of the area increases, labels are selectively hidden to show features with a higher priority (often population or area, in the case of maps). The same holds true for gene annotations and other genomic features.

Gene Annotations

To display every gene label in the genome on one monitor is impossible. The labels would overlap. By prioritizing some labels over others and selectively hiding those with lower priority, we can maintain a nearly constant number of non-overlapping labels at any resolution. To do this, we first declare that we will attempt to display no more than 100 gene labels in any 1024 pixel region. We then aggregate adjacent regions by taking the 100 most 'important' entries from the union of the genes in the two regions. This can be illustrated using a simple example where we begin with a list of prioritized regions