Name		Name	Last commit message	Last commit date
parent directory ..
assembly		assembly
benchmark		benchmark
common		common
compiler		compiler
data-load-tool		data-load-tool
executor		executor
frontend		frontend
groot-client		groot-client
groot-module		groot-module
groot-server		groot-server
tests		tests
README-zh.md		README-zh.md
README.md		README.md
pom.xml		pom.xml

README.md

GraphScope Interactive Engine

GraphScope Interactive Engine (GIE) is a distributed system designed specifically to make it easy for a variety of users to analyze large and complex graph structures in an exploratory manner. It exploits Gremlin to provide a high-level language for interactive graph queries, and provides automatic parallel execution.

Scalable Gremlin

GIE is designed to faithfully preserve the programming model of Gremlin, which is widely supported by popular graph database vendors such as Neo4j, OrientDB, JanusGraph, Microsoft Cosmos DB, and Amazon Neptune. As a result, it can be used to scale existing Gremlin applications to large compute clusters with minimum modification.

Read the full documentation for more information on the current implementation of Gremlin.

High Performance

GIE achieves high performance for complex Gremlin traversal, using a parallelizing compiler and a corresponding system that can execute distributed dataflow computations efficiently on large clusters.

Here is a comparison between GIE (labeled as GraphScope in the figure) and JanusGraph using the LDBC Social Network Benchmark (Interactive Workload):

We choose five representative queries (one small query plus four large queries) to compare query latency. GraphScope runs 13~147x faster than JanusGraph, and can scale those large, complex traversal queries almost linearly across multiple servers in a cluster.

See benchmark for a more detailed performance report with explanations.

Getting Started

There are Getting Started instructions here.

Background and Roadmap

Since about two years ago, we heard an increasing demand from internal data analysts at Alibaba for an easier way to explore different types of patterns in big graph data, such as cycles and bipartite cliques [2], instead of having users (typically non-technical domain experts) to write specific algorithms (in Java or C++) for each particular task.

An example graph model for fraud detection.

As an example, the graph depicted in the above figure is a simplified version [3] of a real query employed at Alibaba for credit card fraud detection. By using a fake identifier, the "criminal" may obtain a short-term credit from a bank (vertex 4). He/she tries to illegally cash out money by forging a purchase (edge 2-->3) at time t1 with the help of a merchant (vertex 3). Once receiving payment (edge 4-->3) from the bank (vertex 4), the merchant tries to send the money back (edges 3-->1 and 1-->2) to the "criminal" via multiple accounts of a middle man (vertex 1) at time t3 and t4, respectively. This pattern eventually forms a cycle (2-->3-->1...-->2).

In reality, the graph could contain billions of vertices (e.g., users) and hundreds of billions to trillions of edges (e.g., payments), and the entire fraudulent process can involve much more complex chains of transactions, through many entities, with various constraints, which therefore requires complex interactive analysis to identify.

Thus, we started this project to offer a new distributed-system infrastructure for the new class of graph applications. This branch contains our first technology preview -- GAIA release. It has been in use by a small community of developers at Alibaba for over a year, resulting in tens of large applications and many more small programs.

Feedback from users has generally been very positive. Even so, with our experience with the system rapidly growing due to wider usage within Alibaba, we have identified important areas for improvement, such as:

Memory management. Graph traversal can produce paths of arbitrary length, leading to the memory usage growing exponentially with the number of hops. This can result in unbounded memory, especially in interactive environments with limited memory configuration. For example, a 5-hop traversal from a single vertex in a graph with an average degree of 100 (which is normal for graphs we tackle at Alibaba) can produce 10 billion paths.
Early termination. Traversing all candidate paths fully is often unnecessary, especially for interactive queries with dynamic conditions running on diverse input graphs. For example, in the following query, only the first k results are needed. For real-world queries on large graph data, such wasted computation can be hidden deeply in nested traversals (e.g., a predicate that can be evaluated early from partial inputs) and significantly impact query performance. While avoiding such wastage is straightforward in a sequential implementation, it leads to an interesting trade-off between parallelism and wasted computation for a fully-parallel execution.
```
g.V(2).repeat(out().simplePath())
 .times(4).path()
 .limit(k)
```

We have addressed the above issues for real production deployment at Alibaba as reported in this research paper [1]. The improvements are NOT yet included in the GAIA release, but they will most likely be shipped as the next major update in the first half of 2021.

We welcome community feedback on our plans (including issues, features, etc.). The best way to give feedback is to open an issue in this repository.

Publications

Zhengping Qian, Chenqiang Min, Longbin Lai, Yong Fang, Gaofeng Li, Youyang Yao, Bingqing Lyu, Zhimin Chen, Jingren Zhou. GraphScope: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language. Accepted at NSDI ’21. The final version of the paper is due on March 2, 2021.
Bingqing Lyu, Lu Qin, Xuemin Lin, Ying Zhang, Zhengping Qian, and Jingren Zhou. Maximum biclique search at billion scale. Awarded Best Paper Runner-up in VLDB 2020. (pdf)
Xiafei Qiu, Wubin Cen, Zhengping Qian, You Peng, Ying Zhang, Xuemin Lin, and Jingren Zhou. Real-time constrained cycle detection in large dynamic graphs. In VLDB 2018. (pdf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

interactive_engine

interactive_engine

assembly

assembly

benchmark

benchmark

common

common

compiler

compiler

data-load-tool

data-load-tool

executor

executor

frontend

frontend

groot-client

groot-client

groot-module

groot-module

groot-server

groot-server

tests

tests

README-zh.md

README-zh.md

README.md

README.md

pom.xml

pom.xml

README.md

GraphScope Interactive Engine

Scalable Gremlin

High Performance

Getting Started

Background and Roadmap

Publications

Files

interactive_engine

Directory actions

More options

Directory actions

More options

Latest commit

History

interactive_engine

Folders and files

parent directory

GraphScope Interactive Engine

Scalable Gremlin

High Performance

Getting Started

Background and Roadmap

Publications