Overview of relational data learning to extract the meaning and knowledge behind the information

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Semantic Web Knowledge Information Processing Graph Data Algorithm Structural Learning Recommendation Technology Relational Data Learning Navigation of this blog

Summary

Relational data is data that represents what kind of “relationship” exists for any given pair of N objects. Considering a matrix as a form of representing this relational data, the relationships are the matrix elements themselves, and relational data learning can be said to be learning to extract patterns in this matrix data.

There are two types of tasks to which these are applied: “prediction” and “knowledge extraction.

Prediction is the problem of estimating the value of unobserved data using statistical models learned and designed from observed data, and knowledge extraction is the task of extracting information that leads to some useful knowledge, rules, or knowledge by analyzing the characteristics of the related data itself or by modeling the given observed data appropriately. This is a task to extract some useful knowledge, rules, or information that leads to knowledge by analyzing the characteristics of the relational data itself or by appropriately modeling given observed data.

In this section, we describe various algorithms and their applications based on “Machine Learning Professional Series: Relational Data Learning” for relational data learning.

In this issue, I will discuss the post-reading notes.

Machine Learning Professional Series – Relational Data Learning Post-Reading Notes

Chapter 1 Introduction: What is Relational Data Analysis?

1.1 Statistical Machine Learning
1.1.1 Learning from Data
“Label” information is important
Using the observed data {X,y}={(xi,yi),i=1,2… Using
Estimate the function (mapping) that computes the label yi from the unknown data x
1.1.2 Supervised Learning and Unsupervised Learning
In image data, a “label” is assigned, and learning is performed in such a way that the label can be reproduced as much as possible.
Teacher information (data, labels)
Supervised learning (supervised learning)
Unsupervised learning (unsupervised learning)
1.2 What is relational data?
In this species, we deal with unsupervised learning.
Example Community analysis
Hyperlinks, spread of infectious diseases, etc.
Relationships not only between objects of the same type, but also between objects of different types
Example: purchase history data
Product” and “Customer”.
Document”, “keyword”, “reference document”, “author”
Technology is still developing for the case where a very large number of “relations” between objects can be defined.
30,48
Where to get datasets of relational data
Stanford Network Analysis Project
1.3 Representation of Relational Data
1.3.1 Graphical Representation of Relational Data
Graph theory
1.3.2 Matrix (multi-dimensional array) representation of relational data
Representation by adjacency matrices
Treatment in this book
Representation by connectivity matrix
Matrix representation of relational data
Enabling a linear algebra approach
1.3.3 Representing the value of a relation
A binary (discrete) value is given for the presence or absence of a relation.
Relation above or below a threshold value
Each relation can also be valued by a discrete symbol
Symbols are indexed.
For graph display, use triples (e=(v1,v2,v3))
For matrices, use notation such as xij=s (symbolic value), xij=0 (integer)
It is important to distinguish between observed data and unobserved data
Observable (observed)
Unobservable (missing)
1.4 Types of related data
1.4.1 Directed relational data and undirected relational data
Direction of the relationship
Directed relational data
xij ≠ xji
Undirected relational data (undirected)
1.4.2 Single and Multiple Domains
For the domain of an object in relational data
Single domain
All objects are of the same type
Same domain for rows and columns
Square matrix
Rows and columns have the same size
Multiple domains
Domains of different types
Vertex pairs (i,j) and (j,i) have different meanings
When the data is a bipartite graph
Objects in a domain do not have a defined relation
Only relations between domains are defined.
1.4.3 Symmetric and Asymmetric Relational Data
Symmetric relational data
Symmetric relational data (symmetric relational data) A relational data matrix that is always a symmetric matrix when represented as a matrix
Single domain and undirected relational data
Asymmetric relational data
Multi-domain or directed relational data
1.4.4 Binomial and Multinomial Relationships
Binary relation (two-place relation, dyadicrelation, binary relation)
Represents a relationship between two objects.
Binomial relation: two-dimensional matrix
Multinomial relation (military relation)
Ternary relation: a three-dimensional matrix
N-ary relation:N-dimensional matrix
Utility sequence of a multinomial relation
time series data
Tensor as a multidimensional extension
1.5 Relational data analysis
Task categories for relational data analysis
1.5.1 Prediction
Prediction
Estimate unobserved data using statistical models learned and designed from observed data.
A typical prediction problem
Link prediction problem
Estimation of the existence of links in a relational network
Estimation of item purchase (sdoption) probability for each user using purchase data
Prediction of missing values in relational data matrix
Estimation of information dissemination in a network
Estimation of information diffusion in a network
Example of item recommendation
The problem of recommending a movie that the user has never listened to before
Missing value prediction
1.5.2 Knowledge Extraction
Knowledge extraction
Knowledge mining, knowledge discovery
Analyze the properties of the relational data itself by computing the graph features
Extract information that leads to some useful knowledge or knowledge by appropriately modeling the given observation data.
Examples
Community extraction in a network
Clustering of networks
Mostly unsupervised learning
If the goal is to reveal the nature of the data involved through graph features
No need for supervised data or training, since we only calculate the already defined graph features
We don’t know if the clusters really exist, so we can’t create supervisory data
Reflects the user’s hypothesis (intention) that “if clusters exist, they should follow this property”.
Examples of Knowledge Extraction
Relational data clustering
Operation to group (only) data that are close together without feature space
An example of clustering that can be easily distinguished by the human eye
Example of relational data clustering
1.5.3 Approach of this book: Capturing low-dimensional structures
We hypothesize that real relational graphs are not completely random, but have some kind of low-dimensional structure embedded in them.
Separate the parts that match the model (structure) from the parts that do not match (noise)
Typical method
Block structure of relational data matrix
Divide the relational data into several blocks
Divide the row and column indices of the relational data into groups of about K<N
Reorder rows and columns so that objects in the same group are adjacent to each other
Low rank of relational data matrix
Represent NxN relational data matrix as a linear sum of K<N basis vectors
Representation of N relational patterns by K types of lumber patterns
1.6 Objectives and structure of this book
Basics of Matrices
Basic Concepts
What is a matrix?
Matrix Products
The product (multiplication) is not always computable.
A is an axb matrix and B is a cxd matrix (an axb matrix is a matrix with a columns vertically and b columns horizontally)
The product is computable because b=c
The product will be an axd matrix
Zero matrix and unit matrix
A matrix that is zero no matter what it is multiplied by
AO = OA = O
that does not change no matter what you multiply it by
AI = IA = A
Only square matrices exist
Inverse matrix
There is no division in matrices
Inverse matrix A-1 of A such that R such that AR = RA = I
Cannot be defined unless the original matrix is a square matrix
Orthogonal matrix
Transpose matrix is equal to inverse matrix
Transpose matrix
Grabbed the top left and bottom right of the matrix and rotated it 180° by blowing on it
AtA = AAt = I
An orthogonal matrix considered in the range of complex numbers is called a unitary matrix
Diagonal matrix
A matrix in which the numbers are arranged only in the diagonal components.
Example
Vector
A matrix such that the length of either the length or the width is 1.
Example
Operations
Diagonalization
Diagonalizing a matrix A means
Bring the orthogonal matrix P and set A = PDPt for the diagonal matrix D
D = PtAP
The result of sandwiching the orthogonal matrix is a diagonal matrix
What makes us happy?
By diagonalizing, we can transform A into a simple vector combination
The orthogonal matrix represents rotation and inversion in space
Think in the world beyond the rotation and return later
Diagonal matrices involve only each direction, and the directions are not correlated with each other
Also applies to matrix similarity
Diagonalization, eigenvalue decomposition, singular value decomposition as described in “Overview of Singular Value Decomposition (SVD) and examples of algorithms and implementations”
Eigenvalue decomposition
Singular value decomposition
An extension of eigenvalue decomposition to general matrices
Diagonalization, Jordan standard form Translated with www.DeepL.com/Translator (free version)

Chapter 2: Clustering Techniques for Symmetric Relational Data: Spectral Clustering

2.1 What is clustering of relational data?
Clustering as a knowledge discovery task
Given a large number of objects, the task is to collect objects that have similar characteristics in some sense and classify them into several groups.
Given a data set x={x1,…. Given a data set x={x1,…, .xN}, classify the data set into K clusters
Method
K-means
Commonly used when the data set is a set of ordinary d-dimensional continuous feature vectors
Mean shift clustering
Latent Dirichlet allocation
When the data set is a set of discrete symbols such as document data
Target of clustering in relational data analysis
Objects (vertices, the main part of the relation) or edges (edges, links)
Clustering of objects in this book
Three definitions of similarity
Similarity of feature values of each object itself i,j, independent of the relation
It is assumed that each object (vertex) has similarity other than relation information.
Analyze using K-means, etc.
Similarity based on what kind of relationship each object i,j has with other objects
Each object is anonymized
There is no information (attributes, etc.) to characterize each object.
The only information available is the relationship
Use similarity of vertex sets connected by edges
Similarity that takes into account both of the above
2.2 Object Clustering for Symmetric Data: Spectral Clustering
Different methods are available depending on the nature of the cluster to be found and the characteristics of the relational data
2.2.1 Community detection and tightly coupled graphs
Focus on community detection
Community
A cluster of vertices that are tightly coupled within the cluster, but very loosely coupled with objects outside the cluster
Assortative cluster
Loosely coupled cluster (disassotative cluster)
Can you have serendipity in a loosely coupled cluster?
2.2.2 Graph cuts
What features can we measure to determine if a cluster is tightly coupled or not, or to discover tightly coupled clusters?
One of the features
Graph cut
Graph cut function
Defined for the partitioning of graph data vertices by relational data matrices or affinity matrices
Example
N objects are partitioned exclusively into K clusters
The index set of vertices belonging to the Kth cluster is Pk
The index set of the vertices belonging to the non-Kth cluster is Ṗk
Partition P={P1, P2, …, Pk , Pk) for the graph cut function
weights of edges between vertices belonging to different clusters.
Minimize the cut (mincut)
Extracts tightly connected clusters in the graph by splitting the vertex set at the weakest link
Extract tightly connected clusters in the graph
Tends to separate isolated clusters from the network and others
Make a normalized cut so that each cluster has a certain size
2.2.3 Spectral clustering
The problem of finding a minimized cut with normalization is NP-hard
Difficult to search in a realistic time
Solve the discrete normalized cut minimization problem by continuous relaxation
Spectral clustering
Represents a graph as an adjacency matrix
A relational data matrix with a value of 1 if there is a link between two nodes in the graph and 0 if there is no link.
Focuses on the eigenvalues of a matrix called graph Laplacian derived from the adjacency matrix
The second smallest eigenvalue and subsequent eigenvalues can be used to estimate the connectivity between nodes in the graph.
Eigenvalue decomposition and k-means clustering can be used for node clustering
Spectral clustering can be applied only to the target relational data matrix
Example
2.3 Spectral Clustering with Unnormalized Graph Laplacian
Describes spectral clustering based on the simplest algorithm
2.3.1 Input and Output
Inputs
Symmetry relation data of N objects X=(xi,j)
Number of nodes (number of objects) N∈ ℕ
Since it is symmetric data, the range taken by indices i, j are identical
For simplicity, we assume that Xi, j ∈ {0, 1}.
Assumed number of clusters K ∈ ℕ
Need to give the number of potential clusters assumed in advance for analysis
Output
Clustering results for each object
Main product
Result of classifying N objects into K tightly coupled clusters (communities)
By-product
Feature vector vi for each object i embedded in a K-dimensional continuous space
This is the “location” of the object.
Input/output example
2.3.2 Order matrix and Graph Laplacian
Description of the specific spectral clustering calculation process
Calculate the degree matrix from the input relational data matrix X
The degree matrix D=(di,j) is a diagonal matrix
All non-diagonal elements are zero.
The I-th body element di,j is the degree of object i in X (i.e., the number of links or the sum of the weights of the links).
It is necessary to distinguish between the in-degree, which is the total number of links received by the object, and the out-degree, which is the total number of links extended by the object.
In spectral clustering, the in-degree and out-degree are the same because only symmetric relation data is used.
The order matrix D, which is an NxN diagonal matrix, is
Example of order calculation and Laplacian calculation
Next, we calculate a matrix called the graph Laplacian.
There are several types of graph Laplacians
A simple graph Laplacian L
L = D – X
unnormalized graph Lalacian
2.3.3 Cluster Extraction by Eigenvalue Decomposition
Rows (columns) with similar values appear in the result of graph Laplacian
If we can compute “similar” patterns from the graph Laplacian, we can cluster the objects.
Use “eigenvalue decomposition” to compute patterns
Review of eigenvalue decomposition
Eigenvalue decomposition is one of the most basic forms of linear algebra
Suppose we are given a symmetric matrix M ∈ ℝNxN
For K < N
Find a column orthogonal matrix V ∈ ℝNxK and a diagonal matrix Λ ∈ ℝKxK such that M =VΛVT
VT is a transpose of V
Let each diagonal element of the diagonal matrix Λ be an eigenvalue of the matrix M
Let each column vector of V be an eigenvector of matrix M
It holds that λkvk = Mvk
Each eigenvector is linearly independent of each other
Each column vector of matrix M is represented by a linear combination of eigenvectors
λ1≤λ2≤… ≤ λk
In spectral clustering, we perform an eigenvalue decomposition of the graph Laplacian L
L = VΛVT
or λkvk = Lvk
For semi-positive definite matrices such as the graph Laplacian, all eigenvalues are non-negative real numbers, so
Λ1≤λ2≤… ≤ λk establishes
The eigenvector matrix V=(v1,v2,…,vk)∈ℝ corresponding to each eigenvalue. ,vk) ∈ ℝNxK.
Each i-th row is a K-dimensional compressed representation of the i-th object
K is the number of classes we are assuming.
Why is this good?
Intuitive interpretation
For a graph with K clusters
At least K eigenvalues will be zero
Subsequent eigenvalues will be non-zero integers
The minimum K eigenvectors will be the indicator vectors of the clusters
2.3.4 Calculation of Cluster Assignment Z
The above discussion is based on the assumption that each cluster is fully connected and that there are no edges between two clusters.
This is not the case in many real-world examples.
Calculating eigenvalues does not mean that the minimum value is zero
As close to 0 as possible is closer to the ideal cluster
To consider specific initial values, we apply k-means
Consider V as a vector of N K dimensions.
Perform k-means with k=K in K-dimensional space
2.3.5 Summary: Algorithm
Can be computed using only eigenvalue decomposition and k-means
Input parameter is only K
2.4 Spectral Clustering with Normalized Graph Laplacian

2.4.1 Relationship between Normalized Cut and Spectral Clustering Algorithm
Ideally, the number of vertices should not be too large or too small, and there should be a small number of classes with tightly coupled properties.
Put constraints on the size of each cluster
The most famous of the many normalization methods
Ratio Cut
Normalizes the cut for each cluster k by dividing it by the size of the cluster (number of vertices in the cluster)
The larger the cluster, the smaller the cut
Clusters consisting of a single vertex are eliminated
Normalized Cut (NCut)
Normalized by dividing by the sum of the weights of the links in the cluster
The larger the clusters and the more closely they are connected to each other, the smaller the cut
This is more suitable for typical community detection.
2.4.2 Spectral Clustering Based on Symmetric Normalized Graph Laplacian
With normalized cuts, rewrite the graph Laplacian into a quadratic form using its matrix
Solve for the solution that minimizes that quadratic form with constraints
It becomes a new spectral clustering algorithm
Lsym = D-1/2LD-1/2
Symmetric normalized graph Laplacian algorithm
2.4.3 Spectral Clustering Based on the Symmetric Normalized Graph Laplacian
Spectral clustering based on different normalized graph Laplacians proposed by Shih and Malik.
Proposed Graph Laplacian
Lrw = D-1L
Random-walk normalized graph Laplacian
Why random-walk?
The normalized cut minimization problem leads to the problem of finding mutually independent communities that minimize the transition probability between clusters when the graph is partitioned into multiple clusters.
Algorithm
2.5 Example of Application to Real Data
Application to Zachary’s Karate Club network data
Each vertex is a member of a karate club in a university
Each vertex is a member of a karate club in a university
Results of applying three types of spectral clustering
One of the measures of clustering accuracy
Adjusted Rand Index (ARI)
ARI is 0.5725 in the case of Lrw and L
ARI is 0.3291 in the case of Lsym
Low performance
2.6 Operational Considerations and References
2.6.1 Which algorithm to choose
First of all, we should choose Drunken Step Normalized Laplacian.
2.6.2 How to set K
Heuristic Methods
1. set a threshold on the absolute value of the eigenvalues
2. threshold on the difference of neighboring eigenvalues
3. threshold on the ratio of neighboring eigenvalues
Determine the number of K’s using model selection criteria
Akaike informato criterion (AIC)
Bayes information criterion (BIC)
2.6.3 References
Tutorial on Spectral Clustering by von Laxberg
84
The software is embedded in the pyhton scimitar-learn module.
2.6.4 Limitations of Spectral Clustering and the State of the Art of Tightly Coupled Clustering
Can only be applied to undirected, single domain relational data (symmetric relational data)
Not applicable to
Directed relational data
Multi-domain relational data
memo
Introduction to Spectral Clustering
What is Spectral Clustering?
Spectral clustering is characterized by the fact that it generates a graph from the data and applies the connected component decomposition of the graph to clustering.
What makes it different from other learning methods
Connected component decomposition of graphs using matrix eigenvalue problems
Before explaining the algorithm of spectral clustering, we will explain how to perform connected component decomposition of graphs using matrix eigenvalue problem as a basis.
Assumptions
A positively weighted undirected graph G=(V = {v1, …, vn}, E). , vn}, E).
Let the adjacency matrix be W=(wij)
Let wij be wij=wji, wii=0, wij≥0
For the adjacency matrix W=(wij), define its degree matrix as above
Define the unnormalized graph Laplacian of graph G as above
The higher the value of wij, the closer the vertices i, j should be.
Example
In the case of graph G_1, where vertex 1 and vertex 2 are connected by an edge with weight 100, and vertex 3 and vertex 4 are connected by an edge with weight 20
Example
For graph G1, the degree matrix D and the unnormalized graph Laplacian L are
The adjacency matrix is
For graph G_1, where vertex 1 and vertex 2 are connected by an edge of weight 100, and vertex 3 and vertex 4 are connected by an edge of weight 20
In this case, the following theorem holds
Graph G=(V = { v1, …. , vn}, E) has K connected components and
V={v1,…. ,vc1}⋃{vc1+1},… ,vc2}⋃…. ⋃{vcK-1+1,…. ,vn} and the connected component decomposition
In this case, all eigenvalues of the unnormalized graph LaplacianL of G are non-negative real numbers.
The eigenspace corresponding to eigenvalue 0 has { } as its basis.
Example
Let be an eigenvector belonging to eigenvalue 0 of L
Lx=0x i.e. the above equation holds
From this, we know that the eigenspace belonging to eigenvalue 0 has (1,1,0,0)T, (0,0,1,1)T as its basis.
To obtain the connected component decomposition of the graph from this
First, we need to arrange the provisions
If we look at each of the row vectors that make up this matrix
The first and second rows are both (1, 0) and the same
The third and fourth rows are both (0, 1), which is the same.
This corresponds to the fact that the connected component decomposition of the graph is given by {v1, v2}, {v3, v4}.
In this way, the connected components of a graph can be obtained by solving the eigenvalue problem of a matrix.
Algorithm for decomposing connected components of a graph
Compute the adjacency matrix representation W and the degree matrixD from the graph, and compute the unnormalized graph LaplacianL.
Calculate the basis u1, …,uK of the eigenspaces belonging to the eigenvalue 0 of the unnormalized graph LaplacianL. ,uK of the eigenspaces belonging to eigenvalue 0 of unnormalized graph LaplacianL.
Arrange the basis vectors to form the matrix U=(u1u2 …. uK).
If the row vectors of the i-th and j-th row of the matrix U are equal, then the vertices vi and vj belong to the same connected component, otherwise they belong to different connected components.
Spectral Clustering Algorithm and Calculation Example
The algorithm for spectral clustering is based on the connected component decomposition of graphs described above.
Specific algorithm
Given the number of clusters K
Construct a graph from the data
Compute the adjacency matrix representation W and the degree matrixD from the graph, and compute the unnormalized graph LaplacianL.
Compute K eigenvectors u1,…, uK in order of decreasing eigenvalue of unnormalized graph LaplacianL. , uK.
Arrange the eigenvectors to make a matrix U=(u1 u2 … uK).
Cluster the row vectors of matrix U into K vectors using K-means.
The cluster that contains the i-th row as a result of K-means is the cluster that contains the vertex vi.
Differences with connected component decomposition
A step to construct a graph from the data is necessary.
Because row i and row j are not exactly equal or different as in connected component decomposition, it is necessary to use K-means.
Example
Connected graph as shown above
The unnormalized graph Laplacian of this graph is
Its eigenvalues are
The corresponding eigenvectors are
If we take K=2 eigenvectors, starting with the one with the smallest eigenvalue, and put them together
If we cluster these four row vectors with K-means
The clusters of (1,1), (1, 0.99) in the first and second rows
Clusters of (1, -0.97) and (1, -1.0) in the third and fourth rows
Corresponding to clustering into v1, v2 and v3, v4
If we calculate with K=3, we can take K=3 eigenvectors from the one with the smallest eigenvalue
Clustering the four row vectors into three clusters with K-Means
Clusters of (1,1,1), (1,0.99,0.59)
Cluster of (1,-0.97, -64)
Cluster of (1, -1.0, 62)
This will be clustered into v1, v2, v3 and v4
Figure
Related topics
Spectral clustering first transforms the data into a graph
How to convert to a graph
Ε Nearest Neighbor Method
K nearest neighbor method
All join method

Chapter 3 Clustering Techniques for Asymmetric Relational Data: Probabilistic Block Models and Infinite Relational Models

3.1 Probabilistic “block structure” clustering of asymmetric relational data
Analyzing the Problems of Spectral Clustering
3.1.1 Dissatisfaction with Spectral Clustering
Problems
The only applicable relational data is the target relational data matrix
Many of the actual data are not
The only clusters extracted are tightly coupled clusters (communities)
Not all clusters are tightly coupled clusters
It is difficult to set the number of clusters K to be extracted
3.1.2 Approach: Probabilistic model assuming block structure
Goal: To perform relational data clustering that can be applied to asymmetric relational data and that can extract clusters other than tightly coupled clusters
Probabilistic model approach that assumes the existence of a “potential block structure
What is a “potential block structure”?
The set of row index i and the set of column index j of the relational data matrix can be divided into a small number of clusters.
The observed values of the relational data are defined by these clusterings
Contains both tightly and loosely coupled clusters
Tightly coupled clusters have strongly related (black) blocks lined up on the diagonal
Loosely coupled clusters, on the contrary, have weak diagonal elements and strongly related blocks on the non-diagonal elements
Probabilistic model approach
Assume the existence of a potential block structure
Depending on the characteristics of the block structure, the observed relational data will change.
Some model or hypothesis can be used to calculate “What kind of data will be observed if we have a certain block? Based on some hypothesis, create a model to calculate “What data will be observed if we have a particular block?
The model on the right behaves stochastically.
What is stochastic behavior?
Even if the parameters of the model are completely fixed, the observed data (and the variables in the model) will contain randomness.
Since there is a discrepancy between the true value and the model
The randomness of a stochastic model absorbs the effects of unexpected external factors and deviations.
Objective function
Graph cut function to be minimized
Step
Determine the stochastic block structure model to be assumed.
Select the objective function for the stochastic model
Algorithm for minimization or maximization
In the case of stochastic models in general, minimize the expected loss of the discriminant rule by the stochastic model as the objective function
3.1.3 Subjects of this chapter: Stochastic block models and infinite relation models
Stochastic block model (SBM)
The most basic method of relational data clustering based on block structure.
Depending on the combination of row and column block numbers, the distribution of relational data values within the block changes.
Infinite relational model (IRM)
Extensions of SBM
Uses Bayesian nonparametric Bayesian
Can theoretically contain an arbitrary number of clusters (countable infinite) without specifying the number of clusters K
A simple practical example of infinite relational data
3.2 Probabilistic Generative Models
3.2.1 What is a probabilistic generative model?
A model that expresses, in probability, the process by which the values of each variable constituting a system are generated in order according to the dependencies.
The determination (generation) of the value of each variable in the model is realized by sampling from a specified probability distribution.
Rather than considering the existence of a single “true value,” emphasis is placed on probabilistic distributions that are “likely to be scattered in a range like this.
Independently and identically distributed (i.e., i.i.d.)
Example: Rolling the dice many times
xi | π ~ p(π) , I = 1,2,…,100 ,100
Parametric probability distribution
Normal distribution
Discrete distribution
3.2.2 Bayesian estimation of probability models
An object of interest when using probabilistic generative models in statistical machine learning.
True parameters and shape of probability distribution are unknown
Estimate the shape of the probability distribution and parameter values from a large amount of observed data
Inference
Inference results are also represented by probability distributions
Posterior distribution
A set of hidden variables Z and a set of parameters {θ} after receiving the observed data X
Posterior distribution p(Z,θ| X) ∝ p(X | Z, {θ }) p(Z, {θ})
p(X | Z, {θ})
Likelihood
In a generative model, this is the part of the equation that generates the observables.
If what we want in the end is only one of the hidden variables
Posterior distribution that incorporates all the uncertainty of the distribution based on the parameter {θ} and is marginalized
p(Z | X) ∝ ∫ p(X | Z, {θ}) p(Z, {θ}) dθ
Calculate all possible values of a variable and their probabilities of occurrence
What values does the variable take? How does the value of the variable change the overall distribution?” (eliminate the variable from the function)
Difficult to find analytically
Use the Markov chain Monte Carlo method
3.3 Stochastic blockmodel (SBM)
3.3.1 Overview of SBMs
SBM is a stochastic generative model that constructs a data generation process based on the assumption that the relational data matrix has a potential block structure.
Defines the strength of the relationship “between clusters” in each domain
Example
Observed relational data matrix
N1 in rows and N2 in columns
Abstract the observed values of N1xN2 relations into the strength of relations between classes of K rows and L columns
Assume that there are K and L potential clusters in each row and column.
The entire relational data matrix is represented by the strength of relationship parameter between a small number of clusters.
3.3.2 Formulation of SBM
Assumptions
Assume that the observed relational data is a binary relation of {1,0}.
X={xii}, xii∈{1,0}
I=1,…. Let ,N1 be the index of the first domain (row of the relational data matrix)
Let j=1,… ,N2 be the index of the second domain (columns of the relational data matrix)
k=1,…,K ,K is the index of the cluster in the first domain
l=1,…,L ,L is the index of the cluster in the second domain
Probabilistic Generative Model of SBM
π1 | α1 ~ Dirichlet(α1)
Mixing ratio parameter of clusters of potential objects in the first domain
α1: K-dimensional vector
Hyperparameter of SBM
Dirichlet
Dirichlet distribution (Dirichlet distribution)
K-dimensional vector α=(α1,α2,…,αk). (α1,α2,…,αk) as a parameter.
For example, a probability distribution that randomly generates a K-sided dice.
Specific probability distribution
𝛤 (*) is the Gamma function
π2 | α2 ~ Dirichlet(α2)
Mixing ratio parameter for clusters of potential objects in the second domain.
α2: L-dimensional vector
Hyperparameter of SBM
Z1,i = k | π1 ~ Discrete(π1)
Sample the cluster assignment of objects in each domain according to the k-sided dice (mixing ratio of clusters in each domain)
Discrete denotes a discrete distribution
Take a K-dimensional “dice” vector as a parameter and generate one of the values from 1 to K.
The probability of choosing a value is proportional to the magnitude of the value in each dimension of the vector.
Specific probability distribution
𝛱(*) is a function that returns 1 when the condition in parentheses is satisfied, and 0 when it is not.
Z2,j = k |π2 ~ Discrete(π2)
Sample the cluster assignment of objects in each domain according to the L-sided dice (the mixing ratio of clusters in each domain)
Same as above
Θk,l | a0, b0 ~ Beta(a0, b0)
The process of generating the parameter Θ that defines the strength of the relationship between clusters
Since we are assuming binary relational data, we assume a beta distribution for ease of later calculation.
Beta distribution is a probability distribution for real numbers θ between 0 and 1.
Probability distribution
xi, j | {θk,l}, z1,i, z2,j ~ Bernoulli(θz1,i, z2,j)
Specify the probability distribution that generates each entry of the observed relational data matrix X={0,1}N1xN2 using the hidden variable z that represents the cluster assignment and the parameter θ that represents the strength of the relationship between clusters.
Bernoulli distribution is adopted as the probability distribution of observed values.
Simply put, Bernoulli distribution is a probability model of coin tossing.
It generates the value x=1 (front) with probability θ and x=0 (back) with probability 1-θ.
Probability distribution Translated with www.DeepL.com/Translator (free version)

3.3.3 Inference for SBM
Specifically derive the inference method using the Collapsed Gibbs Sampler.
(A) Collapsed Gibbs sampler (CGS)
A type of MCMC method
The stochastic model is
A set of hidden variables Z={zi}, i=1,…. The probability model consists of a set of N parameters {θ} and observed data X
The goal of Bayesian estimation is to compute the posterior distribution of the hidden variables and parameters
All we need is the posterior distribution of the hidden variables
Keep parameters marginalized and focus only on Z to perform inference
Perform inference using CGS
CGS Procedure
Iteratively sample the values of all the hidden variables
Sampling one variable at a time, instead of sampling the posterior distribution of all hidden variables at once
Example
In the posterior distribution sampling of the Sth zi, compute the distribution assuming that all other variables are known, and generate a sample
For all zi, s=1,2,…,S For all zi, perform sampling S times until s=1,2,…,S
CGS is feasible
The above integrals can be computed analytically.
Prerequisites
(A) The probability distributions of the parameters and hidden variables are in a “conjugate” relationship.
(B) Derivation of inference equation for SBM by CGS
Derivation of the CGS formula for SBM
What we want to find in SBM is Z1, Z2
Use CGS to assign clusters to SBMs Make inferences about hidden variables Z={Z1,Z2}
Posterior distribution sampling of CGS
Temporarily cancel cluster assignment for the i-th hidden variable
Compute the posterior distribution taking into account the cluster assignment of the other hidden variables, the mixture parameter π of the clusters, and the strength parameter of the relationship between the clusters
Set the cluster assignment again (sampling)
(C) Definition of sufficient statistics and specific update formulas
The sampling operation of CGS is attributed to the operation of updating sufficient statistics with the assignment z1,i
Programmatically, we just add or subtract sufficient statistics to sample z1,i
Equation for updating sufficient statistics needed to update the hidden variables in the first domain of an SBM
For each term, k,l are k=1,2,…,K, l=1,2,… ,K, l=1,2,…. Consider the range of k, l
Equation (3.14) is the number of objects belonging to cluster k, the first domain
Equations (3.15) and (3.16) represent the number of relational data for which x=1(0) in the block (k,l) of the relational data matrix defined by the kth first domain cluster and the lth second domain cluster.
The above quantity is used for the actual sampling of z1,i.
Sample z1,i using the sufficient statistics without the first one in the right formula.
Update the original sufficient statistic according to the result
Actual calculation
Conjugacy
If the prior distribution is a combination of parameters with Dirichlet distribution and a likelihood term defined by a discrete distribution
The posterior distribution of the parameter will always be Dirichlet distribution.
Congruence equation for Dirichlet distribution
The result of the integration calculation with the constant equation
Similarly, the second equation uses the conjugacy of the beta and Bernoulli distributions and the identity for the beta distribution
Z1,i=k, k∈{1,2,…. K}, and the posterior distribution
(Assign Z1,i by rolling the “K-sided dice” defined in the right equation.
3.4 Infinite relational model (IRM)
3.4.1 Overview of IRM
The most important feature of IRM is that it can automatically determine the number of potential clusters K and L.
The number of clusters is also automatically estimated as some kind of unknown parameter.
IRM is a generative model that assumes the existence of infinite number of clusters.
3.4.2 Formulation of IRM
Generative model of IRM
Z1 | α1 ~ CRP(α1)
Z2 | α2 ~ CRP(α2)
Z2 | α2 ~ CRP(α2) Θk,l | a0,b0 ~ Beta(a0,b0)
Xi, j | {θk,l),z1,i, z2,j ~ Bernoulli(θz1,i,z2,j)
Difference from SBM
SBM uses a discrete distribution with π as a parameter to generate cluster assignments z1,i,z2,j
IRM generates Z1,Z2 directly from a distribution called CRP
(A) Chinese restaurant process (CRP)
CRP (Chinese restaurant Process, CRP)
Stochastic process in the Bayesian nonparametric framework
Difference between a stochastic process and an ordinary probability distribution
Normal probability distribution is a distribution of certainty of values
A stochastic process is a distribution of certainty for an infinite number of values (i.e., a distribution (or function))
Detailed explanation [69].
CRP is a stochastic process that gives a probability for partitioning of an infinite set of elements.
In the case applied to IRM, the N1 objects in the first domain and the N2 objects in the second domain are assigned to the appropriate clusters (to produce the result of the partitioning).
The number of clusters changes probabilistically with each sampling
Explanation of CRP (Chinese Restaurant Metaphor)
Assumption of a Chinese restaurant
There is no upper limit to the number of customers that can be seated at each table
There is no upper limit to the number of tables that can be placed in a Chinese restaurant
Which table to place the customers who enter the restaurant (how to cluster and divide the given data elements (customers) into tables)
Each table is served a specific dish, and the content of the dish distinguishes the tables.
Each table represents a different partition class and is distinguished by index k
Each guest i represents an element of data, an object, a feature vector, etc.
The I-th guest is seated at the k-th table, represented by the cluster assignment variable zi=k
The food served at each table represents the parameter θk of cluster k
N customers have entered the restaurant and are seated at a total of k tables
Table selection probability of the n+1th customer to enter the restaurant in CRP
Generate a new table (k=K+1th cluster) with probability proportional to the concentration hyperparameter α
Adaptively change the number of expected clusters according to the number of data points
New customers are more likely to be attracted to a table where many customers are already seated
Clustering of data elements by CRP is exchangeable with respect to index
No matter what order customers enter the restaurant, the probability of being assigned to the same table does not change.
The probability does not depend on the order of index i of each element, but only on the structure of the final cluster assignment.
(B) Peripheralization of π
Where did π in SBM go?
CRP is a combination of the dimensionless Dirichlet distribution and the infinite-dimensional discrete distribution.
Relation to infinite-dimensional Dirichlet distribution
π=(π1, π2,… ,πk,…) Generation process of
If we follow the stick breaking process, we can achieve the correct Dirichlet process (sampling from an infinite dimensional Dirichlet distribution).
Generative process of Z when using the stick breaking process
CRP is a peripheralization of π in the above generative model.
3.4.3 Inference for IRM
The only difference between SBM and IRM is the part about the cluster mixture ratio parameter π
π is also peripheralized and sampled in the CGS of the SBM.
Most of the derivation of the inference algorithm formulae and the final update rule can be borrowed from the SBM results.
(A) Derivation of the inference formula for IRM by CGS
Place the variables using the defining equation of CGS with respect to the sampling of the cluster assignment zi,i of the ith object in the first domain.
Fit the variables in the defining equation from the generative model of the IRM
For the hidden variables, Z\(i) = {Z1\(i), Z2} as in the SBM
For the parameter set {θ}, the strength of the relationship between clusters {θk,l} k=1,2,… , l=1,2,… only
Since the class mixture parameters π1 and π2 are peripheralized by CRP
Equation
In IRM, the number of clusters varies depending on CRP
Let K and L be the number of different clusters in Z1\(i) and Z2, respectively.
Final inference formula
One less integral than SBM
(B) Definition of sufficient statistics and specific update formula
Define the sufficient statistics required to update the hidden variables in the first domain of IRM
The formula is the same as for the SBM
When sampling Z1,i, ask Z1,i to “sit up” (i.e. treat it as an unknown quantity)
The use of “one-off” sufficient statistics is also the same as in SBM.
Sample Z1,i using the right one-extraction sufficient statistic (and “seat” Z1,i in the new seat according to the result)
Posterior distribution
Determine the new seat for Z1,i by rolling a K+1-sided dice
Update the sufficient statistics after the sampling results Translated with www.DeepL.com/Translator (free version)
3.4.4 Output Method
Procedures for outputting specific inference results of IRM
The purpose is to estimate the posterior distribution.
Specifically, name clustering results need to be post-processed.
The output value of CGS is a sampling realization.
If we run CGS S times, we get S sampling results for all N1+N2 objects.
For Z1,i we have (z1,i(1),z1,i(2),…,z1,i(3)) ,z1,i(S))
For Z2,i, (z2,i(1),z2,i(2),…,z2,i(S)) ,z2,i(S))
Erase all samples in the first appropriate B period out of S samples.
Extract only the sample realizations every M appropriate periods.
Procedure to ensure independence between samples
Using T=(S-B)/M samples, count “whether an object i belongs to a certain cluster k in the South Sea.
By normalizing the right count, we can calculate the posterior distribution of the object’s cluster assignment hidden variable z.
Formula for probability distribution
To count even if a cluster disappears, give a unique investment number to the cluster
From the posterior probability, find the cluster assignment C that says “object (1,i) belongs to a certain cluster
Use the maximum a posterior (MAP) solution
From the posterior distribution of the affiliation probabilities (weights) for multiple clusters, select the cluster with the largest affiliation probability (the one with the highest weight)
Formula
There is also a method to calculate some evaluation value f(Z1,Z2) during the Gibbs sampler cycle and use the sampling result of the cycle with the highest evaluation value.
The simplest method
Use the result of the Sth round (the last sampling round) as is
3.5 Summary of IRM
Features of IRM
Block structure modeling inherited from SBM
Decompose the first domain (rows) and the second domain (columns) into multiple clusters (Z1, Z2)
Generate the relational data matrix probabilistically according to the coupling parameters (θk,l) between clusters
The Chinese Restaurant Process (CRP) can be used to model arbitrary cluster structures.
In the above explanation, binary relational data (0,1) is used, but it can also be used for continuous and discrete values by appropriately changing the probability distribution of Θk,l and xi,j.
Reference 17
Inference Method
In this book, Gibbs sampler is applied.
Relatively easy to derive and implement the inference method
Can be applied to asymmetric relational data, which spectral clustering cannot be applied to
Can extract both tightly and loosely coupled clusters
Input and output of IRM
Number of nodes (objects) in the first domain N1∈ℕ, number of nodes (objects) in the second domain N2∈ℕ
N1xN2 matrix relational data X(xi,j) There is no restriction on the value of the relational data, it can be binary, continuous, integer, or discrete
Experience shows that setting α1=α2=0.1 when N is large, and α1=α2=1.0 when N is small often works well.
Hyperparameters for inter-cluster connection probability parameters (a0,b0 in this chapter) For beta distribution, for example, a0=b0=0.5 is good
Constants M, B, S for the Gibbs sampler (there is no typical value for these constants, and they are determined by the size of N and the assumed K, as well as the available computing resources and time)
Output of IRM
Main product: Posterior distribution of hidden variables assigned to clusters
p(Z={Z1,Z2}|X) Z1,i, z2,j represent the posterior probability of cluster assignment of an object in the main angle
Byproduct: Representative value of cluster assignment C={C1,C2}
c1,i,c2,j represent the representative cluster assignment of the object in each domain
By-product: Estimated total number of clusters K1,K2 in angle domain
The total number of clusters is the sum of the cluster indices in Z or C
Algorithm
3.6 Example of Application to Real Data
Application to Zachary’s Karate Club network data
Target data
Spectral clustering finds 2 clusters, but 4 clusters are found.
Use ARI to evaluate agreement with real-world split results
0.6077 for row clustering
0.6404 for column clustering
Better than the value for spectral clustering
Applied to Enron’s internal email data set
Asymmetric data
CGS results (C) show that the number of clusters is stable and converges to (K,L)=(6,3)
Since some employees have position information, the cluster results can be verified.
3.7 Operational Considerations and References
3.7.1 Which algorithm should be used?
SBM is preferable for practical use.
Easy to implement
Easy to apply if you don’t need to worry about the number of clusters or cluster number management
If the number of clusters is set too large in SBM, only the number of clusters that is appropriate for the complexity of the data will remain, and the other clusters will become small.
3.7.2 Limitations and Extensions of IRM
The generative model of SBM and IRM assumes that a row (column) object always belongs to a class.
Inappropriate example
It is not assumed that the probability of creating a link between friends depends on the pattern of “where they met”.
Response to the above example
Mixed membership stochastic block model
A relational model that uses an extension of the mixture model called the topic model.
A single object can choose the clusters to which it belongs depending on the people to which it is connected.
This model requires a finite number of clusters to be fixed in advance.
Some extension methods automatically free an unknown number of clusters.
Literature54
Changes in companionship networks over time
Modeling methods for time-series relational data
15,25,37,86
Methods for computing very large networks
Speedup by Parallel Computing
17
Network models with emphasis on computational efficiency and scalability
20,80,92
In this book, we consider the implicit assumption that the data belongs to one of the clusters (exhaustive and non-overlapping clustering).
Dealing with data that does not
Subset IRM (partial IRM) method
27
Extensions to IRM
Have a mechanism to estimate whether row and column objects follow a block structure or not
Objects that do not follow the block structure are assigned to the “other” cluster
Non-exhaustive and overlapping (NEO)
Extracts only those blocks in the relational data matrix that have a block structure and whose values differ from those of the surrounding blocks
Otherwise, classified into “other” clusters
13,26,44
3.7.3 References
There is an implementation in C, but considering the variable length array (increase/decrease in the number of clusters), it is better to implement it based on the normal mixed Gaussian model. Translated with www.DeepL.com/Translator (free version)

Chapter 4 Matrix Decomposition

4.1 Preparation
Transposition
Denote the transpose of matrix A as AT
Vertical and horizontal vectors
Any vector a ∈ ℝ𝐼 is treated as vertically aligned, i.e., as a 𝐼x1 matrix
If we want to change the orientation (i.e., treat it as a 1x𝐼 matrix), we use the transpose symbol and denote it as aT
Index set (index time set)
For a natural number N ∈ ℕ, denote the set of natural numbers from 1 to N as [N]={1,2,…,N}. N}.
Unit matrix and zero matrix
The unit matrix of NxN is denoted by 𝐼N
An NxN matrix where all elements are given by zero is denoted as ON
Indexes
Given a 𝐼x𝐽 matrix A, let
Let the row vector of the ith (𝑖∈[𝐼]) be ai:T=(ai1,ai2,…. .ai𝑗)T
Denote the column vector of the jth (𝑗∈[𝐽]) as a:jT=(a1j,a2j,. .aIj)T
In summary, we have the above
Notation in linear algebra
Inner product and norm of vectors
The inner product of vectors 𝓧,𝓨∈ℝ𝐼 is defined as above.
The ℓ2 norm naturally defined from this inner product is denoted as ∥x∥ = √⟨x, x⟩.
Inner product and norm of matrices
We define the inner product of matrices X, Y ∈ ℝ𝐼x𝑱 as above
The Frobenius norm described in “Overview of the Frobenius norm and examples of algorithms and implementations” naturally defined from this inner product is denoted as ∥X∥Fro = √⟨X,X⟩
adamantine product
The adamantine product A*A’ of matrices A, A’ ∈ ℝ𝐼x𝑱 is the function that returns the 𝐼x𝑱 matrix
The value of its (i,j) component (I ∈ [𝐼], j ∈ [𝑱]) is given by aija’ij
Vectors and matrix products
Given vectors a, a’ ∈ ℝ𝐼, b ∈ ℝ𝑱 and matrices A ∈ ℝ𝐼x𝑱, B ∈ ℝ𝑱x𝑲 for K ∈ ℕ, their product is
aTa’ returns a scalar, whose value is ⟨a, a’⟩
abT returns the 𝑰x𝐉 matrix and its (𝑖, 𝑗) principal component (𝒊𝒊∈[𝐼], 𝙟∈[𝐽]) is aibj
Ab returns the 𝐼-dimensional vector and its 𝒊∈[𝐼]th value is a𝒊:Tb
AB returns a 𝙄x𝑱 matrix and the value of its (𝒊,𝒋) component (𝒊∈[𝑰], 𝒋∈[𝑱]) is a𝒊:Tb:k
Rank.
For a matrix A ∈ ℝ𝐼x𝑱, the number of nonzero eigenvalues is called the rank of A.
By definition, the rank of A is greater than or equal to 0 and less than or equal to min(𝑰,𝑱).
Differentiation
For a once-differentiable function 𝒇 : ℝ→ℝ with Θ ∈ ℝ as argument
Let the derivative be ∂𝒇 / ∂θ
The derivative coefficient at a point θ’ ∈ ℝ is denoted by ∂𝒇 / ∂θ|θ=θ’.
Gradient
For a one-solution differentiable function 𝖌 : ℝ𝑰 → ℝ with Θ ∈ 𝐼 as argument
We denote the partial derivative with respect to a dimension 𝒊 ∈ [𝐼] as ∂𝔤/∂θ𝒊.
The sum of the partial derivatives for all dimensions is called the gradient ∇g = (∂g/∂θ1, ∂g/∂θ2,… (∂g/∂θ1, ∂g/∂θ2,…, ∂g/∂θ𝐼)T
The value of the gradient at a point θ’ ∈ ℝ𝑰 is denoted as ∇g(θ’) = (∂g/∂θ1|θ=θ’, ∂g/∂θ2|θ=θ’, … ∂g/∂θ𝑰|θ=θ’).
4.2 Simple Matrix Decomposition
Introduction
Compression of relational matrix data
Example: Movie recommendation
A matrix where customers are the rows and movies are the columns, and the values of the elements are the evaluation values
Let I ∈ ℕ be the number of customers and J ∈ ℕ be the number of movies
Express as I x J matrix X
The higher the value of Xij, the higher the rating
Consider an imaginary customer, a ∈ [I].
Matrix decomposition
Summarize the R preferences of customer i in ui: ∈ ℝ Summarize it in the row direction in U ∈ ℝR
Let vj:∈ℝ be the sum of R components of movie j, and let V∈ℝJxR be the sum of them in the row direction.
xij
X≃UVT
X with IJ components is represented by the product of U with IR components and V with JR components
Decomposition of X into the form “X≃UVT
Rank-R matrix decomposition of X
U and V are called factor matrices.
4.2.1 Objective Function
How do we get U and V from X?
Find U and V such that “X is well represented”.
If the difference between X and UVT is small, then UVT is a good representation of X
E = X -UVT
The closer each element of E is to 0, the better the approximation.
How do we measure the size of E?
How to measure error in machine learning
Absolute error
Sum of absolute values of all elements
It is not easily affected even if some of the elements in X are outliers with large values.
Squared error
Sum of the squares of all elements
Susceptible to outliers
Differentiation is always defined, so love optimization techniques using differentiation can be used
Final formula
Simple matrix decomposition
Coefficient 1/2 is introduced formally to cancel the coefficient of differentiation

4.2.2 Optimization
Optimization methods for the objective function defined above
Singular value decomposition (SVD)
Several sufficiently reliable and efficient implementations are available, such as LAPACK (Linear Algebra PACKage).
Not generally applicable if X contains missing values
Not general purpose
Optimization by gradient method
Can be applied to any objective function as long as the gradient can be calculated
Relatively easy to add constraints
Can be combined with stochastic gradient method to increase computational efficiency
Effective when X is large
When R is more than 2, the solution has initial value dependency, and the same solution may not be obtained every time.
4.2.3 Interpretation as similarity
What is the interpretation of U and V obtained by matrix factorization?
Decompose each element of X
xij = ui:Tvj: + eij
For simplicity, we assume that the norms of uij and vij are normalized to 1
For all I ∈ [I], j ∈ [J] ∥ui:∥=∥vj:∥= 1
ui:Tvj: is the inner product of the normalized vectors
Similarity between movie I and customer j
1 if ui: and vj: are facing the same direction, -1 if they are facing the opposite direction
Customer j’s evaluation of movie i = similarity between movie i and customer j + error
Similarity between movie i and customer j is sim(ui:,vj:)
Reduce the error = make xij and sim(ui:,vj:) as close as possible
If customer i seems to think highly of movie j, make ui: and vj: as close as possible in the same direction
U and V will capture the characteristics of the customer and the movie
Latent Space
4.3 Various Matrix Decompositions
4.3.1 2 Regularized Matrix Decomposition
In simple matrix factorization, we focus only on the error as the target of minimization
Minimizing only the error may produce bad results
When there is a large noise added to X
The noise component will be incorporated into U and V
One way to remove the effect of noise
Regularization
Narrow the range of solutions to prevent overtraining
ℓ2-regularized matrix decomposition
Equation with regularization added
λ≥0 is a coefficient that determines the strength of the regularization
λ=0 is a simple matrix decomposition
When λ is large, the force to reduce U and V is larger than the error
4.3.2 Non-negative matrix factorization
In the above approaches, each component of X is a whole real number.
Can be restricted to a narrow class depending on the analysis target
Non-negative matrix (xij≥0 for all i,j)
Can benefit from adding non-negative constraints
Example:
Text data consisting of I documents
J types of words appearing in the documents
Each word is assigned an index from 1,…. Each word is assigned an index from 1,…, to J.
Count the number of times each word appears in the document, and summarize the results as Z∈ℕIxJ
zij contains the frequency of occurrence of word j in document i
The total number of words (including duplicates) appearing in all documents is T=∑∑zij
Dividing each element of Z by this, we get X=Z/T
What does X represent?
Row direction
The i-th row vector xi: of X is the (unnormalized) probability distribution of words appearing in document i
Represents the simultaneous probability of document and word
Example
I is a book about Japanese food
Words related to Japanese food such as “dashi”, “soy sauce”, and “egg” appear frequently
The corresponding xi: value will be high
I’ is a book about machine learning
Technical words such as “matrix”, “network”, and “learning” occupy the top of the frequently occurring words.
Conceptual meaning
R is the total number of patterns
p(pattern r) is the ratio of pattern r to all patterns
Expanding the formula for the probability distribution of a word
uir=p(pattern r)p(document I|pattern r)
vjr=p(word j|pattern r)
which is the formula for matrix factorization
Non-negative matrix decomposition
In natural language processing, “pattern r” is called “topic”.
Extracting topics from matrix X (or Z) is called “topic extraction”.
In non-negative matrix decomposition, not only squared error minimization but also Kullback-Leibler pseudo-distance is used.
4.4 Algorithm
4.4.1 First Order Alternating Gradient Descent
Alternating gradient descent
One of the optimization methods for functions of two or more variables
A method for optimizing a function consisting of two or more variables, where θ1, θ2,…,θk for k∈ℕ are used as arguments. The function h(θ1,θ2,…,θk) ∈ ℝ with K∈ℕ variables θ1, θ2,…,θk as arguments. ,θk) ∈ ℝ.
The alternating gradient descent method updates the variables as follows
θknew = θk – η∇θkh(θ1new,… …,θk-1new,θk,θk+1,…,θK) , θK)
η>0 is an optimization parameter called the learning rate
𝛈 is a parameter that determines the width of the gradient direction in a single update.
If 𝛈 is too small, the variable will advance only a little bit and it will take many iterations to reach a fixed store
If it is too large, the solution will diverge or oscillate.
In general, it is better to start with a small value (e.g. 𝛈 = 0.01) and gradually increase it.
Optimization of ℓ2 regularized matrix factorization by alternating gradient descent
First derive the gradient
From the definition of the Frobenius norm, we can rewrite it as a sum over each component of X
Formula 1 on the way
Formula 2
Alternate optimization algorithm for ℓ2 regularized matrix factorization
When a variable is a matrix and we want to differentiate with respect to it
If you try to differentiate the matrix as it is, it becomes much more difficult than necessary.
Convert matrix stones, etc., into element-wise sums and rewrite the objective function as a scalar variable
4.4.2 Pseudo-second-order alternating gradient descent method
In the first-order alternating gradient descent method, the update equation is simple and implementation is relatively easy. Convergence may be slow because only information on the first-order derivative (gradient) of the objective function is used.
In addition to the first-order derivative, information on the second-order derivative is used to speed up convergence.
Use of Hessian matrix
Sequential optimal method using Hesse matrix Newton method
Approximating Hesse matrices with block diagonals allows efficient computation of inverse matrices.
Inverse matrix calculation using the properties of block diagonal matrices
Only the inverse of 𝜱 and 𝜳 needs to be calculated
Since 𝜱 and 𝜳 themselves have block matrices, it is even more efficient.
Omit calculations during the process
Quasi second-order alternating gradient descent method
4.4.3 Constrained Optimization
If you want to include constraints, combine alternating optimization with the projected gradient descent method
Updating U using the projected gradient descent method
Pseudo-second-order alternating projected gradient descent algorithm for ℓ2 regularized nonnegative matrix factorization
4.4.4 Stochastic gradient descent method
When X is dense and I and J are large, and X does not fit into memory
Calculate the matrix decomposition using the stochastic gradient descent method.
Instead of using all samples to compute the gradient exactly, use a small number of samples to approximate the gradient and optimize it.
Intuitive interpretation
When the objective function f can be written as a sum of “functions fn that depend only on sample n” (i.e., f(x1,x2,…) = ∑ fn(xn))
Instead of the gradient for f, use the gradient for fn
Different stochastic gradient algorithms can be defined depending on how the samples are taken
(A) Element-wise stochastic gradient method
Most common method, using each element xij of X as a sample
Summed as per-element sum, it can be rewritten as sum of functions depending only on ui: and vj:.
The gradient of ui: and vj: for this
From this gradient, an update equation is obtained, and this is repeated for all i, j to optimize U, V
Element-wise stochastic gradient algorithm
If you want to solve the problem with large R, it is better to swear by the stochastic gradient descent method.
(B) Matrix-by-matrix stochastic gradient method
It is useful in settings where access to the whole X is expensive, but random access to each component of X is possible at low cost.
What to do if access to each component of X is time consuming?
Example: Matrix Decomposition of a Word Co-occurrence Matrix
What is a co-occurrence matrix?
Given a set of documents, a co-occurrence matrix is a matrix constructed by counting the number of words that occur simultaneously in the same document.
By decomposing it, we can extract the relationship between words.
Given M documents, the total vocabulary is I ∈ ℕ
In the co-occurrence matrix X ∈ ℕIxI, each column and each row corresponds to a word index, and Xij contains the number of documents in which the words i and j appear simultaneously.
Read all documents m=1,…,M It is necessary to read all documents m=1,…,M and calculate xij=∑cij(m).
If I and M are large, the computational complexity will be large.
When X is given as the sum of several matrices, we can derive the stochastic gradient method using each matrix as a sample.
Introduce ℙm[*] as an operator to take the average of the samples
If we denote the objective function of simple matrix factorization as fstd(X;U,V) = 1/2∥X – UVT∥2Fro
Let C(m) be the co-occurrence matrix of document m ∈ [M
The objective function fstd(Z;U,V) for decomposing X can be written as fstd(MC(m);U, V), the sum of the objective functions for decomposing C(m)
Matrix-by-matrix stochastic gradient descent algorithm
4.5 Matrix Decomposition in the Presence of Missing Values
4.5.1 Excluding missing values
A method that assumes missing values are completely unknown and does not use them at all in the training process.
Gradient equation
4.5.2 Completing missing values
The missing values are complemented with some value and included in the objective function.
Let Ẋ be the matrix that completes the missing values of X.
The completion value is the value from the previous iteration
The update equation becomes exactly equivalent to the exclusion case
4.6 Related Topics
4.6.1 Interpretation as a probabilistic model
Matrix factorization can also be interpreted as a probabilistic model
Advantages of solving the matrix factorization as a stochastic model and its parameter estimation problem
Extension of the noise model
When each element of X takes a real value, the process of Gaussian noise, which also takes a real value, is about as natural as rain.
Gaussian noise is not appropriate for discrete values such as movie recommendations
Thinking of it as a probability model, we can deal with this kind of problem by changing the distribution of X (e.g., from normal to Poisson).
Generalized model of X distribution from normal distribution to exponential family of distributions
Generalized matrix factorization
Generalized matrix factorization
Integration of prior knowledge
Depending on the tree, information other than X may also be available
Example: In the case of movie recommendation
Information about the customer (age, gender)
Information about the movie (release date, genre)
These information can be integrated as prior distributions for U and V using the concept of hierarchical Bayesian modeling
Bayesian estimation of U and V
By introducing a prior distribution and performing Bayesian estimation, we can capture the information of U and V as a distribution.
More detailed information can be obtained, such as the certainty of the estimated quantity, which could not be obtained by estimating U and V as points.
Determination of R
We can determine R by calculating the marginal likelihood, which is the likelihood function peripheralized with respect to the parameters.
Literature on probabilistic interpretation of matrix factorization
57
4.6.2 Extension to Nonlinear Models
In this chapter, linearity is assumed between X and U, V
Differentiation is easily obtained and optimization becomes easier
Support for nonlinearity
Kernel Principal Component Analysis
Matrix decomposition nonlinearized by kernel method
Hidden variable Gaussian process model
A Bayesian extension of kernel principal component analysis
Method of modeling Kvk as a Gaussian process
Increasing computational complexity is a problem
The complexity of the model depends on R
4.6.3 How to determine R
How to determine R in a practical way
Look at the validation error
Randomly select some elements of the observed part of X to be the validation data
Use the data selected as validation data as missing values for learning
After learning, measure the error between the validation data and the prediction (validation error)
Perform the above steps for several Rs.
Select the R with the smallest validation error.
4.6.4 Real data
MovieLens data
Movie recommendation data
Medium scale
Music recommendation data from Yahoo!
Data used in KDD cup 2011
Largest size
KONECT
250 real network data of various sizes with hundreds to tens of millions of nodes
SNAP
Relatively large network data

Chapter 5: Higher-Order Relational Data and Tensors

Introduction
When there are more than three types of objects, a matrix does not have enough axes.
Tensors are an extension of matrices, adding more “axes” in addition to rows and columns.
5.1 Definition of Terms
Examples of data represented as tensors
From a binary to a ternary relationship
Example
Customer, movie, time.
Longitude, latitude, wind speed
Subject, predicate, object.
Example: movie
I customers, J movies, K time points (I,J,K∈ℕ)
Evaluation can be represented by an IxJxK tensor 𝓧
𝓧 is a 3-dimensional array that returns the value of the evaluation when three indices are determined.
Each element is xijk = the evaluation of movie j by customer i at time k
The number of object types (3 in this example) is called the number of order or factorial in tensors.
A tensor of order L is called an L-order tensor.
Each axis of the tensor is called a mode.
The length of each axis is called the dimension
In the movie example, the dimension of the mode “customer” is I
The sum of all the dimensions (IxJxK) is called the size of the tensor.
For convenience, tensors where the dimension of a mode is 1 will be omitted for that mode and treated as tensors of one lower order.
An IxJx1 tensor is treated as an IxJ matrix.
5.2 Linear Operations on Tensors
Introduction
Comparison of operations on vectors, matrices, and tensors
Contracting all modes”, which is equivalent to inner product
Reduce one mode” which is equivalent to matrix or vector seat
Changing the dimension of a mode, which is equivalent to a matrix or matrix seat
Increasing a mode, which is equivalent to an outer product
5.2.1 Addition, scalar multiplication, and inner product
In tensors, addition (subtraction) and scalar multiplication are defined in the same way as in matrices
Given the IxJxK tensor 𝓧 and 𝓨
Tensor addition 𝓧+𝓨 is the operation of adding 𝓧 and 𝓨 element by element
For all 𝒊∈[𝑰], 𝑗∈[𝑱], 𝒌∈[𝑲], [𝓧+𝓨]𝒊𝒋𝒌= 𝒙𝒊𝒋𝒌+𝐲𝒊𝒋𝒌
The operation of multiplying all the elements of 𝓧 by α for some real number α is called scalar doubling α𝓧([α𝓧]𝒊𝙟𝒌=α𝒊𝙟𝒌
Inner product of tensors
Definition of the sum of the products of the elements of two tensors
Defining the Frobenius norm with tensors
5.2.2 Modal Product
Preliminaries
The product of vectors and matrices
Given an I-dimensional vector x and an IxJ matrix A, ATx=∑xiai
The product of x and A can be viewed as an operator that takes the IxJ matrix as input and returns a J-dimensional vector.
Generalization of the above
Consider the operation with x such that the tensor of order L ∈ ℕ is the input and the tensor of order (L-1) is returned
In the case of the product of a vector and a matrix, two different operations are possible depending on whether the summation is performed on the rows or the columns
For the product of a vector and a tensor, there are L different operations depending on “whether to sum over the later mode
The product that sums over a mode 𝒍∈[𝐿] is called the mode-l multiplicaton
The mode product A xl B = B of the I1xI2x…xIL tensor and the I1-dimensional vector x is defined as above
B is a tensor of (L-1)-order
Product of a matrix and a tensor
The product of a vector and a tensor is an operation where certain modes are “squashed”.
The product of a matrix and a tensor is an operation in which a particular mode is “stretched” or “shrunk.
Mode l product of Lth order tensor A and IlxJ matrix Y
Equation for the case of L=3
Image
Example
For IxJ matrix A, IxK matrix B and JxN matrix C
A x1 B x2 C =BTAC
5.2.3 Slices
For matrices
For IxJ matrix X
By stopping at the I ∈ [I]th row and moving the index in the column direction, we obtain the row vector xi:.
By stopping at the j∈[J]th row and moving the index in the column direction, we get the row vector x:j
In the case of tensors
The i-th mode 𝒍-slice
𝒳i(l)
I1xI2x…. , defined by fixing the mode 𝒍 of the IL tensor 𝓧 at i∈{I𝒍]th and moving the other indices I1xI2x…. .xI𝒍-1xI𝒍+1x. .IL tensor
Slice image of the tensor

5.2.4 Outer Product
The case of vectors
The cross product xyT of vectors x and y is an operation to create a second order tensor (matrix) from a first order tensor (vector).
Generalization with Tensors
Create a tensor of order L+1 from a tensor of order L
For any tensor 𝓧 and any vector y, write 𝓧x(order of 𝓧+1)y as 𝓧⚪︎𝓨
Image of matrix-vector cross product
5.3 Conversion to Matrix Operations
Introduction
Many tensor operations can be rewritten as equivalent matrix operations by using vectorized operators and Kronecker products.
The rewritten matrix operations can be used in standard linear arithmetic libraries such as BLAS (Basic Linear Algebra Subprograms).
5.3.1 Vectorized Operators and Kronecker Products
As a preparation, we introduce the tensor vectorization operator vec.
An operator that rearranges the elements of a tensor into vectors, starting with the lowest numbered mode
Example
For the IxJxK tensor 𝓧, vsc(𝓧) returns a vector of dimension IJK
[vec(𝓧)]i+Ij+Ijk = xijk
Kronecker product
A⊗B
5.3.2 Modal Product, Outer Product, Inner Product
The modal product
I1xI2x…. .xIL tensor and the modal product of the Il x Il’ matrix Al (l ∈ L).
The vectorization of a tensor defined by an outer product (i.e. a rank 1 tensor) can be written as a Kronecker product
The inner product of tensors can be written as the inner product of vectorized tensors by definition.
5.3.3 Note on computational complexity
Difference between mode product and matrix arithmetic
In straightforward tensor arithmetic, tensors are “inflated” and “compressed” in parallel.
In matrix arithmetic, the “swelling” work is done at once (Kronecker product), and then the “compressing” work is done (vector product)
The variables required for intermediate processing become large.
In practice, we can benefit from a linear arithmetic library
5.3.4 Tips on expression expansion for tensor operations

Chapter 6 Tensor Decomposition

6.1 Dimensional Compression of Tensors
How can we perform dimensional compression on a tensor?
Consider a tensor 𝓧 with size IxJxK as data.
6.2 CP Decomposition
Introduction.
How to decompose a given tensor into a ring of rank-1 tensors
6.2.1 Motivation
6.2.2 Objective Function
6.2.3 Algorithm
6.3 Tucker Decomposition
Introduction
A method that allows more complex expressions than CP decomposition.
6.3.1 Motivation
6.3.2 Objective Function
6.3.3 Algorithm
6.4 Difference between CP Decomposition and Tucker Decomposition
Introduction.
CP decomposition is a special case of Tucker decomposition
6.4.1 Interpretation as Similarity
Usage
CP decomposition if they are independent to some extent
If they are mutually exclusive, use Tucker decomposition
6.5 Supplement
6.5.1 Practical Application Examples
We propose an extension of the Tucker decomposition to extract the relationship between countries.
Country x Country x Time x Action
72
Application of CP decomposition to linguistic data through efficient computation
Subject x Verb x Object
32
Analysis of Brain Information Data
EEG (Electroencephalogram EEG)
Channel x frequency x time
Which EEG regions are activated in relation to what tasks?
6.5.2 Different Optimization and Learning Algorithms
Stochastic descent gradient method
Convex Optimization
6.5.3 Decomposition consisting of only second-order interactions
6.5.4 Given Multiple Tensors and Matrices
6.5.5 Other Tensor Decompositions