In-Depth Interpretation of the Latest VLDB 2021 Paper: Artificial Intelligence Driven Real-Time Decision System Database and Optimization Based on Persistent Memory
This acticle is published by MemArk Tech Community (https://memark.io/en/). All rights reserved.
One of the most reputable database academic conference VLDB 2021 ( https://vldb.org/2021/ ) is held this week in Denmark, the beautiful hometown of the little mermaid. One of the emerging hot spots of this conference is the optimization and application based on persistent memory. For the first time in history, the conference sets up two special research sessions to introduce a total of 8 academic papers related to persistent memory (conference programs: https://vldb.org/2021/?papers -research). These papers focus on the persistent basic data structures, database indexing algorithms, artificial intelligence systems and so on. What is the magic of persistent memory that makes the database top-level conference VLDB earn so many related papers this year? Today, I’ll take you to analyze in depth one of the papers contributed by the leading AI platform provider 4Paradigm, titled Optimizing In-Memory Database Engine for AI Powered On-Line Decision Augmentation Using Persistent Memory [pdf], which introduces the database engine in AI driven real-time decision-making systems, and how to optimize on persistent memory.
NOTE: The database named FEDB in the paper has been renamed as OpenMLDB, which is open sourced at https://github/4paradigm/OpenMLDB . In this blog, we will use the name OpenMLDB thoroughly.
Real-Time Decision System Driven by Artificial Intelligence
With the vigorous progress of artificial intelligence, more and more online real-time decision-making systems use AI technology to help in the decision-making. Typical applications include real-time credit card anti-fraud, personalized recommendation and so on. A typical AI driven real-time decision system consists of two subsystems: offline training and online prediction. As shown in Figure 1, we put massive historical data into the offline training system on the left. This system will help us extract features from historical data and train ultra-high-dimensional AI models. Then we deploy the trained model into the online prediction system. The online prediction system needs to extract the user’s historical behavior information according to the user’s real-time behavior (such as card swiping transaction), and import the AI model for real-time scoring, to predict. The whole online prediction system has high requirements for real-time performance, and the general latency requirement is a few milliseconds. This paper mainly focuses on the performance optimization of the database storage engine part of the online prediction system (OpenMLDB in Figure 1), which is also the performance bottleneck in the whole online prediction link.
Feature and Feature Extraction
As shown in Figure 2, we take the credit card anti-fraud application as an example. When the card swiping behavior occurs, the POS opportunity generates a record. From this record alone, it is impossible to accurately judge whether this card swiping is a normal card swiping or a stolen card swiping. In the artificial intelligence system, in addition to using this new card swiping record, we also need to extract the hidden information according to the new record for model training. All this information together is our feature, and the process of feature extraction is feature engineering. This can be seen in Figure 2, through the card ID, we can query the feature database to understand the information of this card and the associated account information. Through the calculation based on the time window, you can further understand the three stores with the largest total credit card consumption in the last 10 seconds, 1 minute, 5 minutes and 10 minutes of this card. These features need to be extracted in real time from the user’s recent historical data. In general, the more features are extracted, the more accurate the model prediction is.
Feature extraction is an expensive operation. Firstly, feature extraction involves multiple database query operations. Taking anti-fraud as an example, a user’s card swiping behavior will trigger thousands of database query requests. At the same time, many of these real-time features, such as calculating the amount difference between the most recent card swiping record and the latest card swiping record, can only be extracted after new user card swiping data is generated and transmitted to the back-end system, and data pre-calculation cannot be carried out. At the same time, most real-time features involve a large number of time window queries in different dimensions. The workload characteristics of these feature extraction are very different from traditional OLTP, OLAP or HTAP workloads. We select two most advanced mainstream commercial in-memory databases (DB-X and DB-Y) and use typical real-time feature extraction workload for performance test. As shown in Figure 3, with the increase of the number of time windows, the query latency of DB-X and DB-Y increases significantly. When the number of windows is greater than 4, the query performance of both databases has exceeded 100ms, which cannot meet our performance requirements at all. Moreover, in the real business scenario, the number of time windows is far greater than 4. Based on the above results, we need to design a new database engine for artificial intelligence feature extraction.
Optimized Feature Engineering Database for Artificial Intelligence Workload
In order to efficiently extract real-time features, we designed the machine learning database OpenMLDB (https://github.com/4paradigm/OpenMLDB). As shown in Figure 4, OpenMLDB includes: feature extraction execution engine (FEQL) and storage engine.
FEQL compiles and optimizes query statements using llvm, and optimizes multi-window queries, which greatly increases the execution efficiency of statement parsing of feature extraction. In addition to supporting SQL like syntax, FEQL also defines special syntax for feature extraction operations. As shown in Figure 5, FEQL allows multiple time windows to be defined in the same SQL statement. In the process of syntax parsing and optimization, FEQL can reuse the results extracted in different windows to the greatest extent. For example, when we need to extract the total amount of user consumption in each period within 10 seconds, 1 minute, 5 minutes, 30 minutes, 1 hour and 5 hours, we can define six time-windows in one FEQL only. When the executor runs FEQL, it can take all the data of the largest window (5 hours) and process them separately, instead of repeatedly extracting the data of 5 windows. Thus, the efficiency of multi-window TopN operation in feature extraction operation is greatly improved.
In the storage engine as shown in Figure 6, OpenMLDB adopts a two-layer skiplist data structure. The primary key of the feature is stored in the first layer skiplist, and each time-window corresponding to the primary key is stored in the second layer. When we are querying data within multiple time windows under a primary key, we only need to locate the corresponding primary key in the first layer skiplist, and then continue to query the time window in the second layer.
As shown in Figure 7, we compare OpenMLDB with other commercial databases. D-OpenMLDB in the figure represents the DRAM version of OpenMLDB. We vary the number of time windows and feature primary keys, and then run typical feature extraction queries to compare performance. As shown in the figure, the performance of the DRAM version of OpenMLDB is much higher than that of the traditional commercial database, up to 84x.
Persistent Memory Based Optimization for OpenMLDB
The DRAM version of OpenMLDB can well meet the real-time requirements of feature extraction, but we still find the following disadvantages in the real-world applications.
1. Huge memory consumption and high hardware cost
To ensure the online service performance, the index and data of OpenMLDB are stored in DRAM. Taking the credit card anti-fraud scenario as an example, our test data stores 1 billion card swiping records in the last three months for feature extraction. This part of data has occupied more than 3 TB of memory space. When we consider the case of multiple replicas, the data scale exceeds 10 TB. When we need more historical data for feature extraction (such as one year or even three years), the hardware cost is very high.
2. Recovery time
As a real-time online decision-making system database, OpenMLDB needs to read massive data from disk into main memory and reconstruct the in-memory data structures when nodes are offline (such as system failure or routine maintenance). The whole process takes several hours, which will seriously affect the quality of online service.
3. Long-tail latency
To ensure data consistency, traditional in-memory databases will periodically write data from volatile DRAM to low-speed disk or SSD through logs or snapshots. When the system pressure is high, this synchronous operation can easily cause long query latency, which is called long-tail latency.
We introduce persistent memory to provide new ideas to solve the above pain points. Persistent memory has the characteristics of low cost, large capacity, and data persistence. The features of low cost and large capacity help to solve the problem of high hardware cost of OpenMLDB. Data persistence brings two benefits: 1) greatly shorten the recovery time after the node is offline and ensure the online service quality; 2) due to the in-memory data persistence, the database no longer needs to persist logs and snapshots through low-speed disk devices, so it can effectively solve the problem of long-tail latency.
Persistent memory has two operating modes: memory mode and app direct mode (or AD mode). In memory mode, persistent memory will be treated as a part of main memory and transparent to users. The advantage of this mode is that programs can use large memory capacity without modification, but they cannot use persistence and fine-grained optimization. In app direct mode, persistent memory will be perceived by the system in the form of independent block devices. Programmers use the PMDK library provided by Intel for programming. In this mode, applications can enjoy the large capacity and make use of the in-memory data persistence.
As shown in Figure 8, the leftmost version is the original DRAM based OpenMLDB. The process of querying data is completed in DRAM, but it will be backed by writing logs and snapshots on external storage (SSD / HDD). If memory mode is used (the middle figure in Figure 8), DRAM is directly replaced with large persistent memory, but it cannot benefit from the data persistence feature. Therefore, as shown in the right figure of Figure 8, we use app direct mode to reconstruct the storage engine at the lower layer of OpenMLDB, use PMDK to implement the double-layer skiplist structure based on persistent memory, and remove the traditional log and snapshot mechanism.
The main challenge of developing a persistent memory-based storage engine is to ensure the correctness and efficiency of data persistence. The underlying data structure of OpenMLDB, double layer skiplist, runs in a multi-threaded environment, and there are a large number of multi-threaded concurrent read and write operations. To achieve more efficient concurrency, compare and swap (CAS) is usually used in DRAM environment. CAS is a common technique to implement thread-safety through hardware. It uses the CAS instructions of CPU to lock the cache or bus to realize the atomic operation between multiprocessors. However, when we apply CAS to persistent memory, it may cause the data inconsistency.
As shown in Figure 9, a typical scenario is when a new credit card swiping record T1 is inserted into OpenMLDB through CAS on thread 1, while another thread, thread 2 extracts feature F1 based on the new record T1. Since the flush instructions of CAS and CPU are independent of each other, the traditional DRAM based CAS does not have persistence semantics. If the system is going to persist F1 and T1 (i.e. the flush instruction in Figure 9), and the power is lost between the two flush commands, the new credit card record T1 on the system will be lost after restart, but the feature F1 calculated based on T1 exists, This causes the inconsistency issue.
To solve the problem of inconsistency in persistent memory, we propose persistent compare and swap (PCAs) technique. First, we use the flush-on-read rule to solve the correctness problem. Once reading in-memory data, we proceed with flush instruction to ensure that the read data is persistent. However, too many persistent flush operations will bring additional performance loss, especially unnecessary continuous flush of “clean” data that has not been modified. To efficiently identify whether the data is “clean”, we use a special pointer called smart pointer. Under x86, the memory address supporting 8-byte CAS atomic operation in 64-bit CPU must ensure 8-byte alignment. Since the basic unit of memory access is one byte, 8-byte alignment means that the lower 3 bits of CAS address are always 0. As shown in Figure 10, the smart pointer uses the lower 3 bits of the memory address that is always 0 when using CAS to mark whether the data is dirty. With the smart pointer, we only need to mark the dirty bit on the pointer of the data that is not flushed every time we modify the data. Then, in subsequent reads, only when the dirty bit is marked, the flush instruction is executed for persistence, to avoid unnecessary persistence overhead.
As shown in Figure 11, we use PCAs technology to optimize the persistent memory based skiplist. To further reduce the amount of writing on the persistent memory, we only put the last layer and data on the persistent memory. Therefore, we also designed a new reconstruction process to ensure that after power failure, the system can reconstruct the whole skiplist according to the information of the last layer. We also examined the performance impact of persistence strategies at different levels in the experiment.
Persistent Memory Optimization Experiment and Conclusion
We use a real-world anti-fraud application data for evaluations. The data contains 1 billion card swiping records in three months, and about 10 TB memory space is required in the actual deployment. We compared the DRAM version of OpenMLDB with the OpenMLDB variants of various persistent memory versions. Figure 12 shows the configurations of various database systems under test. The experiment draws the following conclusions.
1. Comparison of long tail latency TP-9999 (Figure 13): OpenMLDB with the persistent data structure on persistent memory, abandons the persistence mechanism based on external storage devices of the original design, and achieves an improvement of long tail latency (TP-9999) of nearly 20%.
2. Comparison of data recovery time (Figure 14): In the case of service interruption, the recovery time is reduced by 99.7%, which comprehensively reduces the impact on the online service quality. In the test scenario, the data recovery time is shortened from the original six hours to one minute.
3. Hardware cost comparison (Figure 15): We compared the deployment of 10 TB credit card anti-fraud business in different clusters and showed the machine configuration used in the paper experiment. In the business scenario of 10 TB data, the hardware cost of OpenMLDB based on persistent memory is only 41.6% of that based on DRAM only.
This paper analyses the challenges and solutions faced by the current artificial intelligence driven real-time decision-making system, and focuses on the following:
1. The architecture and workload of AI driven real-time decision-making system are summarized. We find that the existing commercial in-memory database cannot meet the performance requirement for this kind of applications.
2. The machine learning database OpenMLDB for artificial intelligence online real-time decision-making system is designed, and the execution engine and storage engine are optimized to meet the performance requirements of real-time decision-making.
3. To solve the disadvantages of high memory consumption, slow recovery time and long-tail latency of OpenMLDB, we redesigned the storage engine of OpenMLDB based on persistent memory and proposed the techniques of persistent CAS (PCAs) and smart pointer.
4. Experiments show that compared with DRAM based OpenMLDB, persistent memory based OpenMLDB can reduce the long-tail latency by 19.7%, shorten the recovery time by 99.7%, and reduce the cost by 58.4%.
For More Information
- Full text pdf: http://vldb.org/pvldb/vol14/p799-chen.pdf
- Open-source projects related to the paper:
- Open source machine learning database OpenMLDB: https://github.com/4paradigm/OpenMLDB
- Persistent skiplist data structure, pskiplist: https://github.com/4paradigm/pskiplist
- The PMem based storage engine integrated with pskiplist, PmemStore: https://github.com/4paradigm/pmemstore
- The paper project is led by the MemArk tech community founded by 4Paradigm: https://memark.io/en
The community focuses on modern storage architecture for system enhancement
- More open-source projects from MemArk
- An optimized Kafka version with persistent memory optimization for message queuing, Pafka: https://github.com/4paradigm/pafka/