Overview of DeepWalk and examples of algorithms and implementations

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Semantic Web Knowledge Information Processing Graph Data Algorithm Relational Data Learning Recommend Technology Python Time Series Data Analysis Graph Neural Network Navigation of this blog

DeepWalk

DeepWalk is one of the machine learning algorithms for graph data analysis reported by Perozzi et al. in “DeepWalk: Online Learning of Social Representations. It is particularly suited for a task called “Node Embedding”. It aims to embed a node in a low-dimensional vector space by capturing information that the node shares with its neighbors in the graph and using that information to learn to embed the node.

The characteristics of DeepWalk are as follows.

1. Random Walks: DeepWalk performs random walks by randomly selecting nodes in the graph. Random Walks capture the relationships among nodes in the process of visiting them. Specifically, by repeatedly performing random walks on a graph, a large number of sequences of nodes are generated, and these sequences are considered to be word sequences in natural language, and embedding is performed. For details on random walks, see “Overview of Random Walks, Algorithms, and Examples of Implementations.

2. Skip-gram: DeepWalk uses sequences of nodes collected during a random walk to train a skip-gram model. Skip-grams are used in natural language processing tasks such as Word2Vec to create embeddings for each word in a word sequence, and DeepWalk applies them to graphs to predict the context nodes around a given central node. For more details on SkipGram, please refer to “Overview of SkipGram, Algorithm and Example Implementation.

3. Node Embedding: By training the SkipGram model, each node is represented by a low-dimensional vector. In DeepWalk, the input is a graph and the output is a vector, a latent representation of each node.

DeepWalk is widely used as an effective method for learning semantically rich node embeddings from graph data and has shown high performance on multi-label classification problems in various social networks, even with few training examples. It can also be implemented in parallel to learn representations of large graphs, and the algorithm is scalable.

Experimental results from “DeepWalk: Online Learning of Social Representations” also show that the baseline method It is reported to be more accurate than Spectral Clustering, Edge Cluster, and Modularity, especially when the number of labeled vertices is small.

In addition, as successors to DeepWalk, more advanced graph embedding algorithms, such as Node2Vec, described in “Overview of Node2Vec, Algorithm, and Examples of Implementation,” and GraphSAGE described in “Overview, Algorithms, and Example Implementations of GraphSAGE. These methods are used in various applications to analyze graph data and understand node characteristics and relationships.

Specific procedures for DeepWalk

The DeepWalk problem definition is as follows.

Let \(\mathbf{V}\) be the node set of a given graph, let \(\mathbf{E}\) be the edge set, let each node have S kinds of attributes (\(\mathbf{X}\in\mathbb{R}^{|V|\times S}\)), let y be the label set of a node, and let all nodes’ labels as Y (\(\mathbf{Y}\in\mathbb{R}^{|V|\times |y|}\)). The partially labeled graph is represented by \(G_L=(\mathbf{V},\mathbf{E},\mathbf{X},\mathbf{Y})\).

The goal of DeepWalk would be to learn a vector representation \(\mathbf{X}_E\in\mathbb{R}^{|V||\times d}\) that has a small latent dimension d. We denote the random walk from node \(v_i\) by \(W_{v_i}\) and its kth node by \(W_{v_i}^{k}\). The advantages of using random walks include the following.

Local search can be parallelized
By learning based on short random walks, there is no need to recompute everything even if a part of the graph changes (the learning model can be built using the random walk of the changed part).

DeepWalk extends language modeling by considering a sequence of vertices or sentences obtained by a random walk as a clause.

In the language model, the vocabulary is \(\mathbf{V}\), the words in it are \(\omega_i\in\mathbf{V}\), and the word sequence is \(\mathbf{W}_1^n=(\omega_0,\omega_1,\dots,\omega_n)\), If we assume that the sum of all the training corpora is \(Pr(\omega_n|\omega_0,\omega_1,.\dots,\omega_{n-1})\) for all training corpora is maximized to estimate the plausibility of a particular word sequence appearing in the corpus.

In DeepWalk, an extension of this approach, the goal is to learn not only the probability distribution of the co-occurrence of nodes, but also the latent representation, and the mapping function\(\Phi:v\in\mathbf{V}\rightarrow\mathbb{R}^{|V|\\ times d}\) is introduced as a function to the latent representation of each node in the graph. times d}\) is introduced and the objective is to find the plausibility of the following equation.

\[Pr(v_i|(\Phi(v_1),\Phi(v_2),\dots,\Phi(v_{i-1})))\]

Since the computational complexity of this conditional probability explodes when the random walk becomes long, we relax the conditions of the language model so that it predicts a context from one word (rather than one word from a context), and the context is assumed to consist of the left and right sides of a given word. Furthermore, we remove the word order constraint and make the model such that the probability of any word in the context of a given word\(v_i\) is increased. This yields the following optimization problem

\[\min_{\phi}-log Pr(\{v_{i\omega},\dots,v_{i+\omega}\}\setminus v_i|\Phi(v_i))\]

From the variance representation of \(\Phi(v_i)\), the problem is to maximize the probability of the word \(v_{i-\omega},\dots,v_{i+\omega}\) around \(v_i\) (minimizing it as a whole because of the negative sign at the beginning). The DeepWalk algorithm is a random walk from each node and repeats the process of updating the distributed representation\(\Phi\) of the node using the language model SkiGram.

SkipGram is a language model that maximizes the co-occurrence probability of words within a window and approximates the above conditional probabilities by the following independent hypothesis

\[Pr(\{v_{i-\omega},\dots,v_{i+\omega}\}\setminus v_i|\Phi(v_i))=\displaystyle\prod_{j=i-\omega,j\neq i}^{i+\omega}Pr(v_j|\Phi(v_i))\]

Since the computation of \(Pr(v_j|\Phi(v_i))\) is not computationally easy, we use Hierarchical Softmax to decompose the conditional probability. Hierarchical Softmax is a method of computing probabilities for a combination of two classifications by creating a binary tree for multi-class classification, assigning each node to a leaf of the binary tree, and making the prediction problem of \(Pr(v_j||\Phi(v_i))\) a problem of maximizing the probability of a particular path in the hierarchical structure of the binary tree. .

If the path from node\(u_k\) is a tree node sequence\((b_0,b_1,\dots,b_{\lceil log|V|\rceil})\) (\(b_0=root, b_{\lceil log|V|\rceil}=u_k\)), then the following is true

\[Pr(v_k|\Phi(v_j))=\displaystyle\prod_{l=1}^{\lceil log|V|\rceil}Pr(b_l|\Phi(v_j))\]

\(\mathbb{R}^d,Pr(b_l|\phi(v_j))\) can be modeled by a sigmoid function as follows

\[Pr(b_l|\Phi(v_j))=\frac{1}{1+e^{-\Phi(v_j)\cdot \Psi(b_l)}}\]

Where \(\Psi(b_l)\in\mathbb{R}^d\) is the representation assigned to the parent node of the tree node \(b_l\). This reduces the computational complexity of \( Pr(b_l|\Phi(v_j))\)from O(|V|) to O(log|V|).

The code for DeepWalk is available on Perozzi’s git page.

DeepWalk Implementation Examples

DeepWalk can be implemented using Python and common machine learning libraries (e.g. NumPy, scikit-learn). A basic implementation of DeepWalk is shown below.

Importing libraries:

import networkx as nx # Graph processing library
import numpy as np
from gensim.models import Word2Vec # Used to train Word2Vec model

Read or generate graphs:

DeepWalk takes graph data as input. The following is an example of using the NetworkX library to generate a simple undirected graph.

# Undirected graph generation
G = nx.Graph()
G.add_edge(0, 1)
G.add_edge(0, 2)
G.add_edge(1, 2)
G.add_edge(1, 3)
# Add other nodes and edges

Generating a random walk:

Generate a random walk to collect sequences. The following is an example of a simple function that generates a random walk.

def generate_random_walks(graph, num_walks, walk_length):
    walks = []
    for _ in range(num_walks):
        for node in graph.nodes():
            walk = [node]
            for _ in range(walk_length - 1):
                neighbors = list(graph.neighbors(node))
                if len(neighbors) == 0:
                    break
                node = np.random.choice(neighbors)
                walk.append(node)
            walks.append(walk)
    return walks

Skipgram model training:

Train a Word2Vec model using the generated random walk sequences, e.g. the Gensim library.

# Generating a random walk
walks = generate_random_walks(G, num_walks=10, walk_length=5)

# Training for Word2Vec model
model = Word2Vec(walks, vector_size=64, window=5, sg=1, workers=4)

# Save the learned model
model.save("deepwalk_model")

Get Embedded Nodes:

Use the learned Word2Vec model to obtain the embedding of each node.

# Get node embedding
node_embeddings = {str(node): model.wv[str(node)] for node in G.nodes()}

This completes the acquisition of node embeddings trained by DeepWalk. This can then be used for various graph-based tasks.

Challenges for DeepWalk

Although DeepWalk is an excellent graph embedding method, there are some challenges and limitations. The following describes the main challenges of DeepWalk.

1. sensitivity to graph locality:

DeepWalk uses random walks to explore the graph and relies heavily on the starting point of the random walk. Therefore, it is sensitive to local structure within the graph and may not adequately capture the global information of the entire graph. This is a problem when the graph is not connected or when a random walk from a particular node does not cover the entire graph well.

2. sampling bias:

DeepWalk generates sequences of nodes based on random walks, which may cause some nodes to be visited frequently and others to be ignored. This sampling bias can lead to unbalanced node embedding.

3. hyperparameter tuning:

DeepWalk has several hyperparameters (e.g., window size, number of embedding dimensions, number of random walks, etc.), and these parameters need to be tuned appropriately. The choice of parameters has a significant impact on the assignment and needs to be properly adjusted.

4. scalability constraints:

DeepWalk requires a large number of random walks to train the Word2Vec model. This can be computationally expensive for large graphs and requires a more efficient algorithm when scalability is constrained.

5. reliance on structural information only:

DeepWalk relies only on the structural information of the graph to learn embeddings. However, in real-world applications, other information, such as node attribute and time information, may also be important; DeepWalk does not consider this information, so other methods are needed to take advantage of them.

How DeepWalk is Addressing the Challenges

The following approaches and improvements have been proposed to address DeepWalk’s challenges.

1. methods to reduce bias:

There are ways to reduce the bias between frequently visited and hard-to-visit nodes in a random walk. For example, instead of randomly choosing the starting point of a random walk, one could choose a node based on its importance (e.g., a random walk based on its PageRank score as described in “Overview and Implementation of the PageRank Algorithm“).

2. weighting of adjacent information:

During a random walk, edges can be weighted when traversing edges (connections between nodes) so that important edges are visited preferentially. This makes it possible to realize a random walk that takes into account the importance of edges.

3. higher dimensional embedding:

Increasing the dimensionality of the embedding can improve the expressiveness of the embedding. However, care should be taken when dealing with high dimensional embeddings, as the computational cost may increase.

4. incorporation of temporal information:

If time information is available in the graph data, time-based random walks can be considered. By incorporating time information, it is possible to capture node changes and relationships over time.

5. use of more advanced algorithms:

More advanced graph embedding algorithms have been proposed as successors to DeepWalk. Examples include Node2Vec (described in “Overview of Node2Vec, its algorithm and implementation examples“), GraphSAGE (described in “Overview of GraphSAGE and examples of algorithms and implementations“), and HARP (described in “Overview of HARP and examples of algorithms and implementations“), which are designed to address DeepWalk’s challenges.

6. integration of multiple views:

If a graph has multiple views (e.g., structural information, attribute information, temporal information, etc.), a method can be used to integrate these views. This allows for learning richer node embeddings.

7. graph preprocessing:

Graph preprocessing and cleaning can be used to remove unnecessary nodes and edges and reduce noise. This may improve the quality of the embedding.” See also “Noise Removal and Data Cleansing and Missing Value Interpolation in Machine Learning.

Reference Information and Reference Books

For more information on graph data, see “Graph Data Processing Algorithms and Applications to Machine Learning/Artificial Intelligence Tasks. Also see “Knowledge Information Processing Techniques” for details specific to knowledge graphs. For more information on deep learning in general, see “About Deep Learning.

Reference book is

“Hands-On Graph Neural Networks Using Python: Practical techniques and architectures for building powerful graph and deep learning apps with PyTorch“

“Graph Neural Networks: Foundations, Frontiers, and Applications“等がある。

“Introduction to Graph Neural Networks“

“Graph Neural Networks in Action“