Page rank algorithm, a feature of the google search engine

Machine Learning Artificial Intelligence Natural Language Processing Semantic Web Search Technology DataBase Technology Ontology Algorithm Digital Transformation User Interface and DataVisualization Javascript Workflow & Services CSS Navigation of this blog

Summary

An algorithm represents a process that takes input, processes it in some procedure, and finally returns an output, which can be used in many fields, including computer science, mathematics, physics, economics, and psychology. An algorithm can also be a procedure that clearly defines the steps or method of solving a certain problem.

Algorithms are closely related to mathematics. Mathematics allows us to formalize abstract problems and transform them into concrete problems, which can then be formalized into algorithms. This would specifically be the design of search algorithms, for example, using set theory and logic in discrete mathematics.

Such a mathematical approach is used not only in the development of algorithms, but also in the evaluation of algorithms. Algorithms are evaluated in terms of correctness, efficiency, and generality: a correct algorithm should return the correct output for all inputs; an efficient algorithm should be fast, requiring less processing time and resources; a general-purpose algorithm should be able to process different inputs in the same designed to process different inputs in the same procedure. All of these evaluations have a mathematical representation.

Algorithms are particularly important in computer programming, where many programs operate by executing algorithms designed to solve a particular problem. In programming, designing an algorithm is one of the most fundamental and important steps in the design of a computer program.

Here we discuss the principles of various algorithms based on “The 9 Most Powerful Algorithms in the World” by John McCormick.

In the previous article, we discussed search engine matching algorithms. This time we will discuss the page rank algorithm.

Page Rank Algorithm Overview

The Page Rank algorithm was invented by Larry Page, co-founder of Google, and is the algorithm used to calculate the importance of a web page. It is this innovation in search technology using the Page Rank algorithm that has propelled Google to its current position. Page rank technology is one of the elemental technologies used in search engines to rank search results.

Hyperlink technology is used as a prerequisite for this page rank technology. This is a function that allows users to jump to another web page by clicking on it, which is the underlined part in this blog. The idea of pagerank was also mentioned in “As We May Think,” written by American engineer Vanevar Bushji in 1945, as “a mechanism for storing documents and automatically indexing them” and “associative indexing, i.e., any item can be indexed immediately and automatically by selecting other items at will. It is already described as a machine called memex with “a mechanism for saving documents and automatically indexing them” and “associative indexing, i.e., any item can immediately and automatically select any other item at will”.

Using this hyperlinking technique, the base concept of page rank is that pages with as many links as possible will have a higher rank.

The challenges with this hyperlink technique are that it cannot respond to the existence of negative links, such as “that page is not good” instead of positive recommendations, and that it is better to rank pages that are mentioned (linked) by trusted authorities. However, it is difficult to determine the trustworthiness of such a link source by computer.

The computerized method for calculating this trustworthiness is to assume that the trustworthiness (authority score) of a page is the sum of the trustworthiness (authority score) of the pages that link to that page, and that these are added together in a cascade to arrive at the final score for the page. The above method can be used to evaluate a page by adding up the trust (authority score) that each page has.

However, the above method is difficult to put into practice if there is a closed loop (circular loop) in the hyperlink, because the authority score will increase infinitely. The random surfer trick was created as a countermeasure to this problem.

Random Surfer Trick assumes that a person surfs the Internet at random, selects a starting point at random, selects hyperlinks from that point at random, and proceeds in a cascade. As a device for starting a chain of new links, when a user moves to a certain page, a specific probability (e.g., 15%) is set to jump to a new page without following a hyperlink.

By repeating these steps many times, the score of the most frequently visited page will increase, and a score almost equal to the authority score will be obtained even if there is a circular loop. This method is a simulation technique, and a similar approach is used in machine learning from probabilistic models.

The page rank algorithm calculates the ranking of each web page based on this approach. The basic steps of the page rank algorithm are described below.

Create a graph with web pages as nodes.
Set the initial ranking for each page. Normally, the same initial ranking is set for all pages.
Follow the links from each page and calculate the ranking of the linked page. The average value obtained by dividing the ranking of the page from which the link originates by the number of linked pages is used.
Repeat step 3 to update the ranking of each page.
Repeat 4 until the change in ranking is small.
Obtain the final ranking for each page.

The page rank algorithm is used to rank search engine results by quantifying the importance of a web page. However, nowadays, not only the page rank algorithm is used, but also complex algorithms that take various factors into account.

Implementation in python

Several libraries exist for implementing the pagerank algorithm using Python, but here is a simple example implementation. The following code represents a graph as a dictionary type and executes the pagerank algorithm.

def pagerank(graph, damping_factor=0.85, max_iterations=100, epsilon=1.0e-6):
    # Get the number of nodes in the graph
    n = len(graph)

    # Set initial ranking for each node in the graph
    initial_rank = 1.0 / n
    ranks = {node: initial_rank for node in graph}

    # Repeat ranking calculations until convergence
    for i in range(max_iterations):
        new_ranks = {}
        # Calculate a new ranking for all nodes
        for node in graph:
            # Calculate the total ranking of all link sources
            rank_sum = 0.0
            for referring_page in graph:
                if node in graph[referring_page]:
                    rank_sum += ranks[referring_page] / len(graph[referring_page])
            # Calculate new rankings
            new_rank = (1 - damping_factor) / n + damping_factor * rank_sum
            new_ranks[node] = new_rank

        # Convergence judgment is made, and if convergence is achieved, the calculation is terminated.
        converged = True
        for node in graph:
            if abs(ranks[node] - new_ranks[node]) > epsilon:
                converged = False
                break
        if converged:
            break

        # Reflects new rankings
        ranks = new_ranks

    return ranks

In this code, the graph is represented as a dictionary type and the initial ranking of each node is set equal. The ranking is calculated iteratively until the specified number of iterations (max_iterations) or convergence decision condition (epsilon) is reached, and the final ranking is returned.

In this example implementation, 0.85 is used as the damping factor, but it can be changed to other values if necessary. The method of representing the graph in dictionary form is just one example, and other methods can be used.

Algorithm using R

Next, we will implement a page rank algorithm using R. Here we describe an example implementation using the igraph package.

First, install and load the igraph package.

install.packages("igraph")
library(igraph)

Next, a graph is created. Here, we create a simple directed graph consisting of four nodes.

# Graph Definition
edges <- data.frame(from=c(1,1,2,3), to=c(2,3,3,4))
g <- graph_from_data_frame(edges, directed=TRUE)

# plot
plot(g)

Next, the page rank is calculated using the page.rank function included in the igraph package.

# ページランクの計算
page_rank <- page.rank(g, damping=0.85)$vector
page_rank

Thus, the page.rank function takes a graph and a damping factor as arguments and calculates the page rank. The result is returned as a vector containing the page rank values for each node. Note that the igraph package includes various functions for graph analysis in addition to page rank.

Implementation by Clojure

This section describes an example implementation of a page rank algorithm using Clojure.

First, create a lein project and import the clojure.java.io library.

(ns page-rank.core
  (:require [clojure.java.io :as io]))

Next, the adjacency list representing the graph is loaded. Here, the graph is saved in a file called graph.txt and the adjacency list is read in. The graph is represented as follows: The file format of graph.txt is as follows: the number of nodes n and the number of edges m on the first line, and the starting and ending points of each edge on the second and subsequent lines, separated by spaces.

(defn load-graph [filename]
  (let [lines (-> filename
                  io/resource
                  io/reader
                  line-seq)
        [n m] (-> lines
                  first
                  (clojure.string/split #" ")
                  (map read-string))]
        edges (->> lines
                  rest
                  (map #(-> %
                             (clojure.string/split #" ")
                             (map read-string)
                             vec)))
        graph (vec (map #(atom #{}) (range n)))]
    (doseq [[from to] edges]
      (swap! (nth graph from) conj to))
    graph))

Next, define the function pagerank, which calculates the page rank. To calculate pagerank, it is necessary to specify the number of iterations, the damping factor, and the convergence condition. Here, the default values are 1000 iterations, a damping factor of 0.85, and a convergence condition of 0.001.

(defn pagerank
  [graph & {:keys [max-iter damping-factor epsilon]
            :or {max-iter 1000
                 damping-factor 0.85
                 epsilon 0.001}}]
  (let [n (count graph)
        ranks (vec (repeat n (/ 1.0 n)))]
    (loop [iter 0 ranks ranks]
      (let [new-ranks (vec (map #(+
                                  (* (1.0 - damping-factor) (/ 1.0 n))
                                  (* damping-factor
                                     (reduce + (map #(/
                                                       (get ranks %)
                                                       (count (get graph %))))
                                                    (get graph %)))))
                            (range n)))
            converged? (every? #(<= epsilon (Math/abs (- (get ranks %)
                                                          (get new-ranks %))))
                               (range n))]
        (if (or (= iter max-iter) converged?)
          new-ranks
          (recur (inc iter) new-ranks))))))

Finally, define a main function to calculate the page rank and output the result.

(def -main

Overview and Implementation of the Page Rank Algorithm