Benchmark: Trovares xGT vs. Neo4j™

We are always testing our graph software internally to make sure we hold our lead in performance and scalability over our competition. Below are our most recent results from benchmarking vs. Neo4j (Mar 2022).

Temporal Triangles

The “temporal triangles” benchmark represents a key analytic kernel in graph search. It searches for all tightly-coupled interactions between groups of three actors during a narrow time window. Each group of three actors (triangle) could represent computers communicating with each other, people sending messages via social media, bank accounts that had money transfers to each other, as well as other cases of three actors of interest interacting with each other.

A key aspect of the benchmark is that all the tightly-coupled triangular interactions present in the graph must be found. There are no pre-determined sets of actors to start with or specific triangular patterns to search for. All triangular patterns (A, B, C) that satisfy the time-constrained ordering must be found.

Why Temporal Triangles?

The benchmark is centered around a having temporal data attribute on edges, which applies to a wide range of domains.
The idea of increasing timestamps around the 3-cycle is intended to represent events happening in a sequence, usually implying causality from a single external phenomenon.
The limit as to the total elapsed time around the 3-cycle exhibits evidence suggesting all three events (edges) occurred very close in time because they are part of the same phenomenon.
The notion that three events are recorded in the data for a single phenomenon is descriptive of a behavior.
By abstracting away the specifics of any application, the 3-cycle and temporal constraints represent the kernel of many search problems and embody an intrinsically challenging search space. There are no seeds to localize the search. The entire graph must be searched for evidence of the desired behavior.

Dataset Description

The temporal triangles benchmark uses synthetically generated graph datasets that can be scaled to different sizes in a straightforward manner. Each generated graph has an R-MAT (Recursive Matrix) topology to accurately represent graphs found in real-world datasets such as computer network interactions, social networks, and other actor interaction datasets. R-MAT graphs have a power-law (or exponential) degree distribution with most vertices/nodes having just a few neighbors and a few vertices having very large numbers of neighbors.

We use graph datasets corresponding to increasing numbers of edges/connections with vertices/nodes being about 10% of the number of edges. The datasets used in the experimental evaluation are as follows:

10,000 (104) edges.
100,000 (105) edges.
1,000,000 (106) edges.

10M (107) edges.
100M (108) edges.
1B (109) edges.
5B (5 x 109) edges.

Platform Description & Experimental Setup

We have ported the temporal triangles benchmark to three different graph platforms:

neo4j community edition v4.4.4.
TigerGraph free enterprise edition v3.5.0.
Trovares xGTv1.10.

All these graph platforms support ingesting large graph datasets and performing searches for tightly coupled triangular interactions as described previously.

We have evaluated the three versions of the benchmark (neo4j, TigerGraph and xGT) on three different hardware platforms available on Amazon Web Services (AWS) cloud. We have chosen three different sized AWS instances corresponding to common server hardware available to customers:

Small: AWS m5.4xlarge, single-socket 64GB RAM, Intel Xeon system with 8 cores and two hyperthreads on each core (16 vCPUs in Amazon’s terminology).
Medium: AWS m5.24xlarge, multi-socket 384GB RAM, Intel Xeon system with a total of 48 cores and two hyperthreads on each core (96 vCPUs in Amazon’s terminology).
Large: AWS u-6tb.112xlarge, multi-socket 6TB RAM, Intel Xeon SkyLake system with a total of 224 cores and two hyperthreads on each core (448 vCPUs in Amazon’s terminology).

Note that not all the platforms support all the dataset sizes. The exceptionally large 1B and 5B datasets could only be run on medium and large experimental platforms. We used a time threshold of 42 for all experimental runs.

Ingest Time: AWS m5.24xlarge

Here are the ingest time results from our benchmarking with Neo4j on a Temporal Triangles query with all of the information stated above. We found Trovares xGT ingest times to be significantly faster than that of Neo4j.

Query Time: AWS m5.24xlarge

Here are the query time results from our benchmarking with Neo4j on a Temporal Triangles query with all of the information stated above. We found Trovares xGT query times to be 200-1000x faster than that of Neo4j.

Raw Numbers

Below are the results from the trial on the medium m5.24xlarge instance. While we scale relatively close to a 1:1 ratio, Neo4j's query time "hits a wall" at some point, leading them to stall out the system or run out of space in memory.

Screen Shot 2022-02-28 at 12.52.28 PM.png

Conclusion:
Why Does This Matter?

Sure these numbers show that Trovares xGT is faster than Neo4j, but why does that matter to me? There are endless use cases where ingest speed and query time are vitally important. Especially in cases where the data analysis has a time limit (cyber, shipping and hauling, etc.) and there is a new set of data coming in right after. It can even be important when working with a very large amount of static data as Neo4j will simply “Hit a wall”.

Below are some (but not all) fields where Trovares xGT has multiple use cases:

Biotech
- There are many large-scale problems that need solving in the Biotech field. Among the many are: Genomics and genome representation, neural networks, pharmaceutical research, etc.
Shipping and Handling
- With so many moving parts in a shipping pipeline, some companies are juggling massive amounts of data. With so much information to keep track of such as; Retailers, customers, planes, boats, trucks, their handlers, crates and even loading machinery, many enterprises are dropping the ball along the way. Leading to full crates in the port and long lines of truckers waiting hours for shipments. We offer a dynamic graph solution to manage the data as well as run analytics to find the weak points of the pipeline.
Fraud detection
- Large telecom companies with e-mail services deal with bad actors trying to use their services to commit fraud. These bad actors are smart and have crafty ways to get around most systems. The only way to solve a problem like this is to look at the entire data set (very massive in most cases) and run queries to discover fraudulent activity. This is somewhere where sharding the graph could be detrimental to the result. Most of the loopholes bad actors use rely on taking advantage of the data being separated.
Digital Forensics
- In this case, there is a bad actor who penetrated your network and has been resident in your network for longer than you realize. Cyber intruders are known to wait months, or longer, to carry out their plans. Financial institutions face the risk of immense loss from cyber-attacks, they must carry a large reserve of cash (negative for the bottom line) to hedge against such risk. Proof that an Enterprise has invested in digital forensic searches to minimize the risk that their networks have been penetrated results in savings for the reserve (positive for the bottom line).
Deep Graph Search
- Graph Search can be a simple “lookup” or it can be a deeper search. Lookups are measured in transactions per second. These are automated applications like checking a credit card number. Intuitive analytical queries are measured by their response time in human time measured in seconds, minutes, and hours. Analysts need a response before they forget the question. Then, they often wish to ask a follow-up question. A search engine, Trovares xGT, that can provide answers in seconds when querying a 300-billion edge graph is a necessary tool for the serious data scientist.
Advertising
- Advertising has changed before our eyes in the last decade. Market research and available statistics previously allowed corporations to discover what a population's preference is. Now, with the recent explosion of individuals data, we can see the preferences of specific people. This problem could be small for a mom-and-pop retailer but for a larger corporation, this problem becomes massive very quickly. Not only is it important to retain all the information of specific people, but the fast analysis is necessary to deliver the right advertisements to the right people.