Structural Learning

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Machine Learning Algorithms and Data Structures Relational Data Learning Navigation of this blog

About Structural Learning

Learning the structure that data has is important for interpreting what the data is about. The simplest form of structure learning is hierarchical clustering, which is the basic machine learning algorithm for learning by decision trees. There are also relational data learning, which learns relationships between data, graph structure learning, sparse structure learning, and so on.

In this blog, we discuss the following topics related to these structural learning methods.

Implementation

Overview of Structural Learning and Various Applications and Implementations

Structural Learning is a branch of machine learning that refers to methods for learning structures and relationships in data, usually in the framework of unsupervised or semi-supervised learning. Structural learning aims to identify and model patterns, relationships, or structures present in the data to reveal the hidden structure behind the data. Structural learning targets different types of data structures, such as graph structures, tree structures, and network structures.

This section discusses various applications and concrete implementations of structural learning.

Methods for Analyzing Graph Data that Changes Over Time

Techniques for analyzing graph data that changes over time have been applied to a variety of applications, including social network analysis, web traffic analysis, bioinformatics, financial network modeling, and transportation system analysis. Here we provide an overview of this technique, its algorithms, and examples of implementations.

Graphical data analysis that takes into account changes over time with Snapshot Analysis

Snapshot Analysis (Snapshot Analysis) is a method of data analysis that takes into account changes over time by using snapshots of data at different time points (instantaneous data snapshots). This approach helps analyze data sets with information about time to understand temporal patterns, trends, and changes in that data, and when combined with Graphical Data Analysis, allows for a deeper understanding of temporal changes in network and relational data. This section provides an overview of this approach and examples of algorithms and implementations.

Dynamic Community Analysis

Dynamic Community Detection (Dynamic Community Analysis) will be a technique for tracking and analyzing temporal changes in communities (modules or clusters) within a network with time-relevant information (dynamic network). Usually targeting graph data (dynamic graphs) whose nodes and edges have time-related information, the method has been applied in various fields, e.g., social network analysis, bioinformatics, Internet traffic monitoring, financial network analysis, etc. It is used in the following areas.

Graph data analysis that takes time variation into account with Dynamic Centrality Metrics

Dynamic Centrality Metrics is a type of graph data analysis that takes into account changes over time. Usual centrality metrics (e.g., degree centrality, mediation centrality, eigenvector centrality, etc.) are suitable for static networks and It provides a single snapshot of the importance of a node. However, since real networks often have time-related elements, it is important to consider temporal changes in the network.

Graph Data Analysis that Takes into Account Temporal Changes with Dynamic Module Detection

Dynamic module detection is a method of graph data analysis that takes time variation into account. This method tracks changes in communities (modules) in a dynamic network and identifies the community structure at different time snapshots. Here we present more information about dynamic module detection and an example implementation.

Dynamic Graph Embedding for Graph Data Analysis that Takes Time-Variation into Account

Dynamic Graph Embedding is a powerful technique for graph data analysis that takes temporal variation into account. This approach aims to have a representation of nodes and edges on a time axis when graph data varies along time.

Dynamic Module Detection Using Tensor Decomposition Method

Tensor decomposition (TD) is a method for approximating high-dimensional tensor data to low-rank tensors. This technique is used for data dimensionality reduction and feature extraction and is a useful approach in a variety of machine learning and data analysis applications. The application of tensor decomposition methods to dynamic module detection is relevant to tasks such as time series data and dynamic data module detection.

Graph data analysis that takes into account temporal variation with network alignment

Network alignment is a technique for finding similarities between different networks or graphs and mapping them together. By applying network alignment to graph data analysis that takes into account temporal changes, it is possible to map graphs of different time snapshots and understand their changes.

Graph data analysis that takes into account changes over time using time prediction models

Graph data analysis that takes into account changes over time using a time prediction model is used to understand temporal patterns, trends, and predictions in graphical data. This section discusses this approach in more detail.

Subsampling Large Graph Data

Subsampling of large graph data reduces data size and controls computation and memory usage by randomly selecting portions of the graph, and is one technique to improve computational efficiency when dealing with large graph data sets. In this section, we discuss some key points and techniques for subsampling large graph data sets.

Techniques for displaying and animating graph snapshots on a timeline

Displaying and animating graph snapshots on a timeline is an important technique for analyzing graph data, as it helps visualize changes over time and understand the dynamic characteristics of graph data. This section describes libraries and implementation examples used for these purposes.

Creating Graph Animation by Combining NetworkX and Matplotlib

This paper describes the creation of animations of graphs by combining NetworkX and Matplotlib, a technique for visually representing dynamic changes in networks in Python.

Plotting high-dimensional data in low dimensions using dimensionality reduction techniques (e.g., t-SNE, UMAP) to facilitate visualization

Methods for plotting high-dimensional data in low dimensions using dimensionality reduction techniques to facilitate visualization are useful for many data analysis tasks, such as data understanding, clustering, anomaly detection, and feature selection. This section describes the major dimensionality reduction techniques and their methods.

Data Visualization Using Gephi

Gephi is an open-source graph visualization software that is particularly suitable for network analysis and visualization of complex data sets. Here we describe the basic steps and functionality for visualizing data using Gephi.

Data Visualization with Cytoscape.js

Cytoscape.js is a graph theory library written in JavaScript that is widely used for visualizing network and graph data. Cytoscape.js makes it possible to add graph and network data visualization to web and desktop applications. Here are the basic steps and example code for data visualization using Cytoscape.js.

Visualization of Graph Data Using Sigma.js

Sigma.js is a web-based graph visualization library that can be a useful tool for creating interactive network diagrams. Here we describe the basic steps and functions for visualizing graph data using Sigma.js.

Similarity in machine learning

Similarity is a concept that describes the degree to which two or more objects or things have common features or properties and are considered similar to each other, and plays an important role in evaluating, classifying, and grouping objects in terms of comparison and relatedness. This section describes the concept of similarity and general calculation methods for various cases.

Technical Topics

The concrete, the abstract, and the structure of language

The world around us is made up of two opposing concepts, “concrete” and “abstract. The word “concrete” is most often used when explaining something in a way that is easy to understand. The word “concreteness” is most often used when explaining something in an easy-to-understand way, such as “In concrete terms…” or when you don’t understand what the other person is saying, such as “Could you be more specific? The word “abstract,” on the other hand, is used to describe a situation. On the other hand, the word “abstract” can be used in the context of “I don’t understand what that person is talking about because it’s so abstract.

In this way, the generally accepted impression of these concepts is that “concrete” means easy to understand and “abstract” means difficult to understand. As you can see, the word “abstract” is often associated with a negative impression, but in fact it is the fundamental basis of human thought, and it is the concept that makes us human and definitely different from animals.

Hierarchical clustering using hclust in R

Hierarchical clustering using R is described.

Application of Hierarchical k-means to Image Retrieval

While class recognition involves predicting the class to which a target object belongs, instance recognition is the task of identifying the target object itself. The central task of instance recognition is the image retrieval problem, which is to quickly find an image in a database from an input image. Instance recognition is the task of identifying the object itself, such that when we see the Tokyo Tower, we do not recognize it as a radio tower, but as the Tokyo Tower. This can be achieved by searching the database for images that show the same object as the one in the input image.

The implementation of instance recognition is as follows: 1) extract local features from a set of stored images and create an image database, 2) extract local features of the query image, 3) take one local feature of the query image and compare it with all local features in the image database. Cast one vote for the image in the database that has the most similar local features. The object in the image with the most votes in the database is recognized as the object in the query image.

Decision Trees as Interpretable Models

The decision rule will be a simple IF-THEN statement consisting of a condition (also called a premise) and a prediction. For example, if it is raining today and it is April (condition), then it will rain tomorrow (prediction). Prediction can be done with a single decision rule or a combination of several decision rules.

There are three ways to learn rules from data. (There are many more than these)

1. 1. OneR: Learning rules from a single feature, OneR is characterized by its simplicity and ease of understanding.
  2. Sequential covering: Learning rules iteratively and removing data points covered by new rules.
  3. Bayesian Rule Lists: Uses Bayesian statistics to integrate pre-discovered frequent patterns into a decision list. The use of pre-discovered patterns is also a common approach in algorithms that learn many rules.

Decision Tree Algorithm Overview

A decision tree learner is a powerful classifier that uses a tree structure to model the relationship between possible outcomes as futures.

A key feature of the decision tree algorithm is that the flowchart-like tree structure is not necessarily exclusive to the learner, but the output results of the model can be read by humans to provide a great hint as to why or how the model does (or does not) work well for a particular task.

Such a mechanism can be particularly useful when the classification mechanism must be transparent for legal reasons, or when the results are shared with others to make explicit business practices between organizations.

Collective Learning and Decision Trees (Random Forests)

When the data is distributed in a complex way in the feature space, a nonlinear classifier becomes effective. To construct a nonlinear classifier, kernel methods and neural networks can be used. In this section, we describe ensemble learning, which constructs a nonlinear classifier by combining multiple simple classifiers. Collective learning is also called ensemble learning.

As collective learning, we describe bagging, which generates subsets from the training data set and trains a predictor on each subset. This method is particularly effective for unstable learning algorithms. An unstable learning algorithm is one in which small changes in the training data set have a large impact on the structure and parameters of the predictor being learned. Neural networks and decision trees are examples of unstable learning algorithms.

The bootstrap method is a method of generating diverse subsets from a finite set of data. The bootstrap method is a method to generate M new data sets by repeating random recovery extraction from the data set M times.

Hierarchical topic model

Objectively, language can be thought of as a sequence of symbols. Although, when looked at in detail, it is composed of letters, here we will assume that language, like English, is composed of words.

One thing that is immediately noticeable when looking at a language’s word sequence is that the frequency of words is highly skewed. The inverse relationship between word rank and word frequency is known as Zipf’s law, and is one of the basic facts discovered in the 1930s. In recent years, this law has become known as a power law common to many discrete phenomena in nature beyond language.

To express such indeterminacy, we need a probability distribution for the location of p itself. The simplest such distribution is the Dirichlet distribution as in the following equation.

Relational Data Learning

In the simplest case, relational data is data that represents what kind of “relationship” exists for any given pair of N objects. If we consider a matrix as a form of representing the relationship between “something” and “something,” the data representing the relationship is the elements in the matrix itself.

Relational data learning is about extracting patterns in this matrix, and there are two main tasks that can be applied: prediction and knowledge extraction.

A prediction problem is a problem of estimating the value of unobserved data using a statistical model learned and designed from observed data. Typical prediction problems include estimating the presence or absence of links in a relational network (link prediction problem) and estimating the item purchase probability for each user using purchase data. There are two typical prediction problems. These can be realized as a problem of predicting missing values in a relational data matrix. Another important example of a prediction problem is the estimation of information dissemination or information diffusion in a network.

The knowledge extraction problem is to analyze the properties of the relational data itself by computing the graph features, or to extract some useful knowledge or information that leads to knowledge by appropriately modeling the given observation data.

In following pages of this blog, we will provide a theoretical overview, specific algorithms, and various applications of relational data learning.

Graph Neural Networks

An overview of graph neural network technology, a framework for performing deep learning on graph data. It is used for estimating physical properties of compounds, natural language processing using Attension, co-occurrence networks, and image information processing.

SVM approach to other applicable data

Object detection aims to find a rectangular region in an image that surrounds an object such as a person or a car. Many object detection methods propose multiple candidate object regions and use object class recognition methods to determine which object these regions are classified as. Since the number of candidate object regions proposed from images is often huge, methods with low computational cost are often used for object class recognition.

Sliding window method, selective search method, and branch-and-bound method are the methods to propose object region candidates from images. There are also several methods to classify them, such as Exampler-SVM, Random Forest, and R-CNN (regious with CNN feature).

Learning sparse structures in graphs using graphical lasso

We describe a graphical lasso (sparse structure learning of Gaussian graphical models) that introduces sparsity into the relationships, and its application to anomaly detection (extension of the Hotelling T2 score).

Introducing sparsity into relations

When the data is given as a pair of scalar output y and M-dimensional vector x, D={(y(n),x(n))|n=1,…,N}, we will discuss how to achieve a sparse solution as a linear regression (sparsifying the regression coefficients, or sparsifying in the sense of eliminating extra samples rather than variables).

As a use case, I will discuss the case where the observed data is contaminated with noise and not all samples can be trusted, as well as the case where a machine is constantly monitored by a fixed number of sensors by extending it to graph data.

Cluster estimation using Poisson-Gamma model

Epidemiology is a field of study that investigates the frequency and distribution of health-related events in specific populations, elucidates the factors involved, and applies the research results to the prevention and control of health problems. In particular, epidemiological research that uses spatial data is called “spatial epidemiology,” and since it uses maps and location information, it is attracting attention as a research field that is closely related to geographic information systems (GIS).

When dealing with data collected in a certain regional unit, such as prefectures or health centers, if we consider only the number of observations, the number will naturally tend to be larger in areas with a large population. In addition, the age structure of the local population may have an impact on some diseases, and a simple comparison of the number of observations or the number of observations per population divided by the population of the area may be insufficient. For this reason, disease maps using standardized risk indicators are often used to remove the effects of age and other factors as well as population differences. The Standard Morbidity Ratio (SMR), an indirect method, is one of the most commonly used.

In addition, Bayesian estimates are sometimes used for inter-regional comparisons in order to take into account the accuracy of the estimation of risk indicators in the region. In this study, we used the Poisson-Gamma model of SMR to estimate the empirical Bayesian estimates of the number of cases in each prefecture and health center, using Japan as the reference population and adjusting for the age structure.

What is a Complex Network? A New Approach to Deciphering Complex Relationships Reading Memo

“My friend’s ex-boyfriend was my ex-boyfriend!” A small world where you can reach anyone in the world just by tracing six people. Complex networks” is the biggest keyword of the 21st century. By applying the concept of “complex networks,” we can replace the seemingly complicated phenomena around us with simple relationships. You will discover “surprising rules” in the infection routes of infectious diseases and computer viruses, the way neurons and proteins transmit information, and human relationships in companies and society.

Anomaly detection using sparse structure learning – Graph models and L1 regularization that link broken dependencies between variables to anomalies

In the monitoring of systems represented by multivariate variables, it is important to know the contribution of each variable to individual abnormal phenomena. There are not many ways to do this, except for some extreme methods such as the simple Bayesian method. In this section, we will discuss the means of calculating the degree of anomaly of individual variables based on the idea of linking the breakdown of dependency among variables to anomalies. In learning the dependencies among variables, we devise a calculation method to efficiently extract the essential dependencies. This makes it possible to automatically extract the modular structure inherent in the system, even if the apparent dimension is high.

Anomaly detection using density ratio estimation – Anomaly Estimation from Unsupervised Data Using the Kullback-Leibler Density Ratio Estimation Method

In this article, I will discuss the problem of “finding an abnormal sample among data that may contain abnormal samples, based on data that is known to be normal. This is basically a problem of outlier detection, but instead of treating the samples for calculating the degree of abnormality separately, we dare to consider the probability distribution of the entire data set and formulate it as a problem of estimating the probability density ratio for it. The main advantages of this formulation are that it provides a systematic method of determining the parameters included in the anomaly detection model, and that it is expected to improve detection accuracy by suppressing to some extent the adverse effects of noise riding on individual samples.

Change detection using density ratio estimation – Detection of structural changes using the Kullback-Leibler density ratio estimation method

In this article, we will discuss a method for change detection by directly estimating the density ratio without making arbitrary assumptions about the distribution. Change detection is a problem of macroscopically examining the presence or absence of changes in the entire distribution, but in addition, the problem of microscopically examining changes in the individual dependencies among variables can also be handled in the same framework. The problem setting of structural change detection has been attracting attention in recent years as a powerful tool for monitoring complex systems.

Emphasizing the role of probability distributions in the change detection problem, it is sometimes referred to as the distributional cahnge detection problem. The problem is similar to that of the two-sample test in statistics, except that it does not fall within the framework of so-called test theory.

The definition of the degree of change is strictly the same as that of the Kullback-Leibler divergence, which measures the difference between distributions. Kullback-Leibler divergence is a measure of the degree of difference between two distributions.

A Method for Learning Part-Whole Relations

Part-whole relations are important in many domains, but typically receive less attention than subsumption relation. In this paper we describe a method for finding part-whole relations. The method consists of two steps: (i) finding phrase patterns for both explicit and implicit part-whole relations, and (ii) applying these patterns to find part-whole relation instances. We show results of applying this method to a domain of finding sources of carcinogens.

Structured Support Vector Machines

Predicting data with structure by SVM can be done by using an extension called Structural Support Vector Machines. In this method, the constraints that increase with the introduction of complex structures can be efficiently handled by an optimization technique called the resection plane method. Structural type data can be, for example, data represented as tree structures or arrays. In structural tree prediction in natural language processing, grammatical relations between words are expressed as a tree structure. Or, in the problem of finding similar sequences of proteins, proteins may be represented as sequences of amino acids.

Structural Learning of Graphical Models

In this article, we will discuss methods for learning the graph structure itself from data. Specific approaches include learning graph structures from data with Bayesian networks and Markov probability fields, such as Max-Min Hill Climbing (MMHC), Chow-Liu’s algorithm, maximizing the score function, PC (Peter Spirtes and Clark Clymoir) algorithm, GS (Grow-Shrink) algorithm, SGS (Spietes Glymour and Scheines) algorithm, sparse regularization, and independence conditions.