Skip to content

Highly efficient system for storing large amounts of data with warranted GET time

License

Notifications You must be signed in to change notification settings

jakubriegel/big-store-wti

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

big-store-wti

Intro

Abstract

docsBig Store is meant to be a highly efficient system for storing large amounts of data with warranted GET time.

Table of contents

About

Motivation

Modern systems often require strict time constraints to be met in order to assure desired quality of service or SLA. This specifically applies to many distributed systems, online machine learning and end user solutions. With a great scale come immersive amounts of data. Storage access can be a bottleneck for every application. A reliable resolution of this problem may lead to a gigantic improvement of performance in all kind of large-scale solutions. Big Store aims to be a POC in a way to fighting these problems.

General assumptions

Several assumptions about data can be made, that applies to a great part of modern use cases:

  1. Data should be saved and accessed in a concurrent safe process.
  2. Storage should be redundant and scalable.
  3. Data should be accessible fast enough to use it for online actions.
  4. Retrieved data must be as fresh as possible, but not necessarily the freshest available.

Proposed solution

Big Store is a data storage system. It is designed to handle high volume of income and outcome traffic, side-by-side enabling users to maintain stable data retrieve time. BS is based on two data levels. Data is stored in the main high-volume store, which is managed by several companions, that are bounding the store with cache storage.

Each request for retrieving data is directed by the Hub to the companion assigned to desired part of store. The companion then is trying to provide the client with the freshest data available. It achieves so, by querying cache first and if the data is fresh enough it returns it. In case the data is present in cache, but to old to be considered as possible most fresh, the companion ask the store to retrieve the data. If the store will not manage to handle the request on time, the entity from cache is returned to the client. Late queries to the store are not cancelled, they are used by a background process to update the cache.

Saving data into the store holds asynchronously. After entering the data, the client is provided with a promise, the data will be eventually saved in the store. All requests are being directed by the Hub to the companion assigned to desired part of store. A background process is then queueing the data to be saved and is saving it maintaining the order.

Access to the system can be achieved by REST API and async RabbitMQ queue. REST provides the ability to update and retrieve the data. Asynchronous API is used only for updating data in the system.

Technologies

Such use case requires, that the technology will reliably handle long time runs with continuous heavy load. One of the best available solution, that meets these constraints, is JVM. Java Virtual Machine was designed specifically to be run under high traffic conditions. That is making JVM natural choice for Big Store.

Hub is implemented in Scala using akka HTTP. As main role of it is to direct the traffic to correct companions, akka actors makes it perfect tool for that.

Companions uses Kotlin with coroutines. That allows it to handle concurrent jobs (like incoming GETs or background queries), with very few resources engaged. Which is crucial since BS will spawn multiple companions at once.

Data is stored in Cassandra, because of its ability of fast inserts and high scalability. For cache Redis was chosen.

Sample deployment

Figure below shows Big Store deployed with 3 companions and both REST and async API enabled.

Schema of sample deployment with 3 companions

Data flows

Data inflow

Figure below shows the flow of input data in Big Store.

Schema of data inflow

Data outflow

Figure below shows the flow of output data in Big Store.

Schema of data outflow

Implementation state

For the moment being the system is working as described above with a few exceptions. For instance, inserts to Cassandra are not going through any buffer. Nevertheless, the system as a whole works and was successfully demonstrated to the supervisor.

Future improvements

Some future improvements may include:

  1. User configurable data model.
  2. Single queue for incoming data in the Hub.
  3. Performance analisys and test for possible switch from Cassandra to MongoDB or MySQL.
  4. Client library in Kotlin (compatible with Java) and Python.

Deployment

Preliminary

Big Store is deployed on Docker and was tested on Docker Desktop for Windows and Docker on arm powered computer.

Build

Building Big Store requires preparing fat jars of the Hub and companion.

To get the fat jar of the Hub in hub dir type:

sbt assembly

To get the fat jar of companion in cluster-companion dir type:

./gradlew clean shadowJar

Dockerfile are configured to find the jars in their deafult build locations.

Run

The system can be run by a single shell command:

docker-compose up --build --scale cluster-companion=N --scale cache=N

Where N means desired number of companions. This number should also be set in configuration of the Hub in the file hub\src\main\resources\application.conf under the path big-store.hub.companions..expectedNumber.

License

MIT (see license.md)

Credits

Big Store was made by Jakub Riegel and supervised by Andrzej Szwabe, PhD as a project for Selected Internet Technologies course on Poznań University of Technology.