Data mining to find patterns in data delivered in sequence

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation General Machine Learning Algorithm Recommendation Technology Navigation of this blog

Sequential Pattern Mining

Sequential pattern mining is a special case of structural data mining in which data mining finds statistically related patterns among data examples whose values are delivered in sequences.

Tasks performed in these include “efficient database and index construction of sequence information,” “extraction of frequently occurring patterns,” “comparison of sequence similarities,” and “completion of missing sequence data.

Specific examples of applications include analysis of gene and protein sequence information (e.g., nucleotide base A, G, C, and T sequences) and expression analysis of their functions, as well as pattern extraction of purchase items that occur in large-scale transactions in stock trading and e-commerce (e.g., if a customer purchases onions and potatoes ), and also for process mining of workflows.

Here I will describe a typical algorithm, apriori (proposed by Agrawal and Srikant in 1994). (Proposed by Agrawal and Srikant in 1994.) First, let’s talk about the “Association Rule”. For example, {milk, bread} ⇒ {eggs} (people who buy milk and bread at the same time buy eggs more frequently).

When there are {green tea, tuna onigiri} ⇒ {tara ko onigiri} and {green tea, tuna onigiri} ⇒ {kombu onigiri}, it is possible to make a set of {green tea, tuna, cod roe, kombu onigiri} and lower the price a little more than buying them separately. In addition, if {instant ramen} is replaced by {chashu}, the locations of these items can be moved closer to each other, and if {milk} is replaced by {bread}, the amount of bread stocked can be increased when milk is on special sale.

The ratio of transactions that satisfy condition X to the total number of transactions is defined as the support level (support(X)). As an example, if condition X={a,b}, and the total number of transactions is T1={a,b,c}, T2={a,d}, T3=[b,d,e}, T4={a,b,e}, T5={a,b,c}, T6={d,e}, then the total number of transactions is 6, and the number of transactions that satisfy the condition is 3. The total number of transactions is 6, and the number of transactions that satisfy the condition is 3, so support(X)=3/6=0.5.

Next, confidence(X,Y) is the ratio of the number of transactions that satisfy both conditions X and Y to the number of transactions that satisfy condition X. This is expressed as support(X) = support(X∪Y)/support(X). This is expressed as confidence(X,Y)=support(X∪Y)/support(X), using support(X) from earlier.

Here, the Apriori algorithm uses these to extract all correlation rules X ⇒ Y that satisfy the following conditions from all transaction data when the minimum support minsup and the minimum confidence minconf are given as input parameters. (Condition: support(X∪Y)≥minsup, confidence(X,Y)≥minconf)

If we do everything greedily in implementing this algorithm, it will be computationally very expensive and inefficient. For this reason, the apriori algorithm uses the features of support and confidence to search efficiently (the search is divided into two stages: the search for frequent items and the search for correlation rules).

Furthermore, several algorithms have been proposed to improve the Apriori algorithm, including GPS (Generalized Sequential Patterns), SPADE, FreeSpan, PrefixSpan, MAPress, Seq2Pat, SPIRIT, and CloSpan. The following algorithms have been proposed.

Tools that can use this algorithm include “SPMF“, a module of Java, “arules“, a library using R, “apriori“, a module of Clojure, and “Efficient-Apriori“, a module of python.

I will describe how to use spmf from among these.

First, make sure that you have Java 1.8 or higher as the operating environment.
Next, download the executable files spmf.jar and test_files.zip from the download site.
Double-click the downloaded spmf.jar. (For mac) For wondows, right-click on spmf.jar and select “open with…” and “Java Platform”. and “Java Platform”. to run it.

When spmf starts up, the screen will look like the following.

spmf画面

Here is an example with PrefixSpan.

From the “Choose an algorithm” box, select “PrefixSpan”.
Select a test file in the “Choose input file” box. Select “contextPrefixSpan.txt” from the downloaded and unzipped test_fiile.zip. (If you are using Java 11 on a mac, you may not be able to access the file properly. In that case, store the test_file in the top-level user folder.)
Specify the output file in “Choose output file” (create an empty result.txt file in advance).
Set the parameters in “Choose minsup (%)”.
Press “Run algorithm” to run the program.

spmf動作

The result will be output to the result.txt file that you just created.

1 -1 #SUP: 4
1 2 -1 #SUP: 2
1 2 -1 3 -1 #SUP: 2
1 2 -1 4 -1 #SUP: 2
1 2 -1 4 -1 3 -1 #SUP: 2
1 2 -1 6 -1 #SUP: 2
1 -1 1 -1 #SUP: 2
1 -1 2 -1 #SUP: 4
1 -1 2 3 -1 #SUP: 2
1 -1 2 3 -1 1 -1 #SUP: 2
1 -1 2 -1 1 -1 #SUP: 2
1 -1 2 -1 3 -1 #SUP: 2
.......

The algorithm itself is very simple and does not take much time to calculate. The key to using it is in the normalization of the data.