Data Stream Processing Architecture (DSMS)
A Data Stream Management System (DSMS) is a computer program that manages the continuous flow of data. A DSMS is similar to a database management system (DBMS), but instead of executing a query once, it executes it continuously and permanently for as long as it is installed. Since most DSMSs are data-driven, a continuous query will continue to produce new results as long as data is ingested into the system.
The major challenge in a DSMS is to process a potentially infinite amount of data streams without a fixed amount of memory and random access to the data. One is compression techniques, which attempt to summarize the data, and the other is windowing techniques, which attempt to divide the data into (finite) portions.
The idea behind compression techniques is to keep only a rough outline of the data, rather than all the (raw) data points in the data stream. Algorithms range from sampling techniques that randomly select data points to those that use histograms, wavelets, sketching, and other techniques. One simple example of a compression technique is to calculate a continuous average. Summaries do not accurately reflect the data and can produce inaccurate results.
Window techniques deal with only a portion of the data, rather than using a data summary to compress features of all data streams. The concept is based on the idea that the data stream is a window of data. The window allows the data stream to be cut out continuously. For example, only the last 10 data elements are taken out and only those are considered for processing. Other windows include a sliding window, similar to a FIFO list, and a time-based window that considers only the last 10 seconds of data.
There are many prototypes for queries, and there is no standard architecture. However, most DSMSs have query descriptions based on DBMS query processing, which translates into operator plans. These plans can be optimized and executed.
Query expressions are executed using a declarative language such as SQL in a DBMS. Since there is no standard for expressing continuous queries yet, many languages and their derivatives exist. However, Continuous Query Language (CQL), StreamSQL or EPL are based on SQL. Other graphical approaches exist that represent processing steps as boxes and connect the boxes with arrows to represent the flow.
The language strongly depends on the processing model. For example, if a window is used for processing, the definition of the window is needed in the expression; in StreamSQL, the query to the sliding window for the last 10 elements is as follows
SELECT AVG(price) FROM examplestream [SIZE 10 ADVANCE 1 TUPLES] WHERE value > 100.0
The declarative query thus created is converted into a logical query plan. The query plan is a directed graph where nodes are operators and edges represent processing flows. The individual operators in the query plan encapsulate the meaning of a particular process, such as filtering or aggregation, etc. DSMS processes relational data streams, and the operators are either relational algebra processes themselves or something similar, such as selection, projection, join, and set. This concept of operators allows for flexibility and comprehensiveness of processing by the DSMS.
Since logical operators are concerned only with the semantics of processing and do not contain any algorithms, logical query plans must be converted to the corresponding executable form. This is called a physical query plan. By distinguishing between logical and physical operators, one logical operator can have multiple implementations. For example, in joins, algorithms such as nested-loop joins and sort-merge joins can be implemented. These algorithms are strongly dependent on the stream and processing model used. Finally, queries can be used as physical query plans.
Physical query plans contain executable algorithms and can be executed directly. For this purpose, the physical query plan is installed in the system. (The bottom of the graph is connected to all incoming sources, from connectors to sensors. The top of the graph is connected to output sinks such as data visualization. Since most DSMSs are data-driven, the query is executed by pushing incoming data sources from the source to the sink from the query plan. Each time a data element passes through an operator, the operator performs a specific operation on the data element and passes the result to all subsequent operators.
DSMS was originally researched at Stanford University’s Info-lab, and commercial systems such as Coral8 (since 2005), StreamBase (since 2003), and Esper (since 2006) are available.
コメント