An overview of twitter’s recommendation algorithm

Machine Learning Probabilistic Generative Model Support Vector Machine Sparse Modeling Artificial Intelligence General Machine Learning and Data Analysis Digital Transformation Recommendation Technology Economy and Business Navigation of this blog

Overviews

Twitter Inc. has released a Twitter recommendation system that is getting a lot of attention.
Below is the officially released technical blog and the source code on GiuHub.

Technology Blog：https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm
GitHub：https://github.com/twitter/the-algorithm

Twitter has two UIs, “Following” and “For You,” and “Following” is the traditional one in which the tweets of users you follow flow in chronological order, while the algorithm released this time is the algorithm for the “for you (recommended for you)” section.

With as many as 500 million tweets posted on Twitter every day, it is a painstaking task to select and display your favorite tweets from among them. The following is an overview of the algorithms that make this possible.

The flow of the above figure is Data (data acquisition) → Feature (feature creation) → Home Mixer (recommendation step). The last step, Home Mixer, is described as consisting of three steps (Candidate Sourcing, Ranking, and Heuristics & Filtering) in the above-mentioned twitter’s technical blog. These are described below.

About Data (Data Acquisition)

<Overview>

The data to be used is mainly composed of the following three components.

Social graph
Tweet engagement
User data

The details are described below.

<Social graph>

First, let’s talk about the Social graph, which is a graph structure that represents the interrelationships among users. This graph will show the relationship between users and their followers or followers, where followers are users who follow a particular user’s account and followed users are users who follow a particular user’s account.

Twitter’s Social graph is used to analyze connections and relationships among users. It can be used, for example, to determine the status of a user’s followers to gauge their influence and popularity, and it can also be used to identify users who are interested in a particular topic or a particular group of users.

Social graph is a piece of information that can be accessed using Twitter’s API, which can be used to obtain a list of a particular user’s followers or followers, or to investigate the relationship between a particular user and other users However, recent privacy considerations and the use of Twitter However, due to recent privacy considerations and security reasons, Twitter may place restrictions on the use of its API, so it may be difficult for the average user to obtain a complete view of the Social graph.

<Tweet engagement>

Next, let’s look at Tweet engagement, which is a measure of how much interest or interaction a tweet has generated. tweet engagement is calculated based on several factors, including

Retweets: Retweets are the number of times a Tweet is retweeted by another user, and are a way for users to share Tweets of interest with their followers.
Likes: Likes are the number of times a tweet is liked, and are a way for users to show a positive reaction to a tweet.
Replies: Replies are the number of replies to a tweet and are a way for users to comment or respond to a tweet.
Mentions: Mentions are the number of times a user mentions another user in a Tweet.

These factors are taken together to calculate the Tweet engagement count. A higher engagement number indicates that the Tweet was retweeted or liked by more people and caused a buzz.

The Tweet engagement count is used to measure the performance and impact of a tweet. It can also help evaluate marketing campaigns and social media strategies. In order for companies and individuals to increase Tweet engagement, it is important to provide interesting content and encourage interaction with followers.

<User data>

Finally, User data in Twitter refers to information related to a Twitter user account. Some common types of User data are described below.

Profile information: This includes profile information for users to introduce themselves, describe their interests, location, and so on. This includes the user name, display name, biography, profile picture, and header image.
Tweets: Contains data on tweets posted by the user. Tweets may include text, images, videos, URLs, etc., and may also include statistics such as the date and time the tweet was posted, number of likes, number of retweets, etc.
Followers and users followed: This includes information about other users that the user follows and the user’s followers. This allows for analysis of relationships and connections between users.
Mentions and replies: This includes information on mentions a user has made to other users and replies a user has received from other users. This allows us to understand patterns of communication and interaction between users.
User Activity: This includes information about activity on the user’s account. This includes login date and time, tweets viewed, likes and retweets, etc.

User data on Twitter is restricted to general User data. Some data can be accessed through the API, but certain restrictions and authentication are required to use the API. It should also be noted that there are restrictions on the acquisition and use of user data based on privacy policies and terms of use.

About Feature (Feature Creation)

<Overview>

Using the above data, twitter uses the following six components to create a Feature.

GraphJet
SimClusters
TwHIN
RealGraph
TweepCred
Trust & Safety (T&S)

They are described in detail below.

<GraphJet>

Developed by Facebook, GraphJet is a real-time graph search and ranking engine for social networks that enables fast and efficient data retrieval, recommendation, and ranking within large social graphs.

One of GraphJet’s unique features is its use of Inverse Link Indexing, which tracks the number of times a node is linked to another node and uses this information to rapidly search and rank nodes in the network. This information can be used to rapidly search and rank nodes in a network. This makes it possible to analyze people’s connections on social networks and identify relevant content and users.

GraphJet can be useful in the following scenarios

Content Searching and Filtering: GraphJet can efficiently search for content within a social graph. For example, you can search for the most popular posts and users related to a specific topic or keyword.
Personalizing feeds: GraphJet analyzes a user’s social graph and interests to determine the optimal order in which to display feeds and recommend content. This allows you to provide information that is most relevant to your users.
User Ranking: GraphJet can rank users within their social graph. This feature may be used to identify influencers, discover trends, or identify groups of users with specific attributes.

GraphJet uses efficient algorithms and data structures for real-time data processing and fast searches, resulting in high performance and scalability on large social networks.

<SimClusters>

SimClusters is an open source similarity clustering tool developed by Facebook that uses feature vectors to cluster data. SimClusters allows for fast and efficient clustering of large data sets. SimClusters enables fast and efficient clustering of large data sets.

The main features of SimClusters are

Fast clustering: SimClusters can perform fast clustering even on very large data sets. This is achieved by combining efficient algorithms with parallel processing.
Scalability: SimClusters can scale out horizontally. This means that multiple machines or clusters can be used to parallelize processing to accommodate fast and large data sets.
Customizable Similarity Metrics: SimClusters allows users to customize similarity metrics. By using the appropriate similarity metric for different types of data, more accurate clustering results can be obtained.

SimClusters has been used in a variety of applications, for example, in natural language processing to cluster documents based on sentence similarity and in image processing to group similar images using image feature vectors. project and is available from Facebook’s GitHub repository.

<TwHIN>

TwHIN is an algorithm that compiles multiple types of interactions and utilizes them for recommendation. It calculates the Embedding of users and Tweets individually for each unit of interaction possible on Twitter (like, RT, follow, etc.). Once the Embedding of individual types is obtained, clustering is performed with k-means for each type of interaction, and the Embedding of the cluster is calculated. Finally, Embedding is weighted according to the contribution of the clusters that interact with the user or Tweet, and a single Embedding is calculated considering multiple interactions. Thus, an embedding calculated based on multiple types of interactions is obtained, and FRS is performed using this embedding.

The abstract of the attached paper is transcribed below. Social networks such as Twitter form heterogeneous information networks (HINs) where nodes represent domain entities (users, content, advertisers, etc.) and edges represent one of many entity interactions (e.g., users re-sharing content or “following” others). networks (HINs). Interactions from multiple relationship types can encode valuable information about social network entities that is not fully captured by a single relationship. For example, the preferences of accounts that users follow may depend on both the user-content engagement interaction and the other users they follow. In this study, we study the knowledge graph embedding of Twitter HIN (TwHIN) entities. We show that these pre-trained representations provide significant offline and online improvements in diverse downstream recommendation and classification tasks such as personalized ad ranking, account following recommendations, offensive content detection, and search ranking. We also address design choices and practical challenges in deploying industry-wide HIN embeddings, such as compression to reduce end-to-end model latency and dealing with parameter drift between versions.”

<RealGraph>

Since the relationship between users who view recommendations and Tweeters who may appear in the recommendations also affects the score, an algorithm called RealGraph is applied.

RealGraph estimates the strength of this relationship and assigns a higher score to users who are likely to interact with the user in the near future.

Internally, RealGraph estimates the relationship between users by using logistic regression to predict the probability of interactions (likes, replies, etc.) between users in the near future, based on the information of the most recent user behavior.

<TweepCred>

It will be Page-Rank to calculate the user’s reputation within Twitter. For more information on PageRank, see “Overview and Implementation of the PageRank Algorithm.

<Trust & Safety (T&S)>

Trust & Safety (T&S) in recommendation algorithms is the concept of providing safe and reliable recommendations to users; T&S aims to minimize harm and risk for users of platforms and services, and includes the following specific initiatives .

Improving the user experience: T&S is important to ensure that users feel safe when using a service. Recommendation algorithms provide personalized content based on a user’s past behavior and preferences, and in the process may introduce inappropriate or harmful content. incorporating the concept of T&S can reduce the risk of content being displayed that users may find offensive or dangerous T&S can help reduce the risk of content that users may find offensive or dangerous.
Improving Platform Reliability: T&S is also important for enhancing the reliability of the platform. Without proper T&S measures in place, users are more likely to encounter harmful content, fraud, or spam. This can damage the platform’s reputation and spread user defection and negative impact; implementing T&S measures will increase the platform’s credibility and maintain user loyalty.

Home Mixer (recommended step)

<Overview>

Two-stage recommendation is an approach to recommendation systems.

While a normal recommendation system aims not only to recommend individual items to a user but also to recommend related items that may be of interest to that user, the first stage of two-stage recommendation is to collect features and feedback to model the interests and preferences of the user, and then to recommend related items that may be of interest to the user. In the first stage, features and feedback are collected to model the user’s interests and preferences, and the user’s past behavior, profile information, and ratings are taken into account to understand the user’s preferences and interests and to make recommendations that meet the user’s individual needs.

Next, the information obtained in the initial phase is processed to generate a list of related items and recommendations. In this process, more specific recommendations are made based on the user’s characteristics. These allow the first stage to model the user’s general preferences, whereas the second stage allows for more personalized recommendations, taking into account individual context and requirements.

The advantage of the two-stage recommendation is that it can make personalized recommendations that take into account the preferences of individual users. Two-stage recommendation may also address the issue of data sparseness (insufficient data) by collecting information at the initial stage. However, Two-stage recommendation has the caveat that it requires more complex processing.

Using this approach, twitter makes recommendations in the following steps

Candidate Sourcing (1st stage)
Raking (2nd stage)
Heuristics & Filtering

The details of these steps are described below.

<Candidate Sourcing>

Candidate Sourcing (collection of candidate items) in a recommendation algorithm is the process or method by which the recommendation system collects candidate items for recommendation to the user. This step plays an important role in providing relevant items to the user. while the method of candidate sourcing may vary depending on the design and purpose of a particular recommendation system, there are several common approaches, including

Using the user’s historical data: This approach involves analyzing the user’s past behavioral and purchase history and collecting relevant items based on that information. This could be, for example, recommending items that are similar to products the user has purchased or content they have viewed in the past.
Content-based filtering: This approach collects items that match the user’s preferences based on the characteristics and attributes of the item itself. This could be, for example, selecting relevant items based on movie genres or book categories.
Collaborative Filtering: This approach uses similarities and common preferences among users to collect items that other similar users might like. This allows for recommendations of items that users have not yet evaluated.
Real-time data feed: This approach gathers the latest trends, popular items, new arrivals, etc. in real-time to provide potential recommendations.

While these methods are sometimes used independently, they are usually combined to collect candidate items with higher accuracy. It is important to select the most appropriate Candidate Sourcing method according to the objectives and constraints of the recommendation algorithm and the characteristics of the data.

In the case of twitter, 1500 “recommended candidate” tweets are selected for each user as Candidate Sourcing. Out-of-Network Sources), which are roughly 50% each.

There are four components that handle Candidate Sourcing

search-index
cr-mixer
user-tweet-entity-graph (UTEG)
follow-recommendation-service (FRS)

They are described below.

<search-index>

A search index (search index) is a data structure used to support fast retrieval of data, and is mainly used in systems such as text search engines and databases. Types of search indexes include hash table, a data structure that stores key-value pairs; binary tree, a tree structure in which each node has at most two child nodes; B-tree, a multi-branching balanced tree with multiple children; hash index, etc.

In twitter’s algorithm, a candidate recommendation is retrieved using a search index, which is a set of candidates based on a list of users that the user follows, and the candidate set is created by retrieving the Tweets of the users that the user is following.

This search is prioritized by a light ranker using EarlyBird, a Tweet search system within Twitter. Even though the search is limited to the Tweets of users who are currently following, there are still a large number of Tweets, so from here, scoring is performed based on static and dynamic features, and the top-ranked Tweets are selected.

<cr-mixer>

The cr-mixer is a component that creates a Candidate by integrating many algorithmic Candidates.

<user-tweet-entity-graph (UTEG)>

The User-Tweet-Entity Graph (UTEG) is a graph structure generated from Twitter data, which represents the relationships among users, tweets, and entities within tweets (keywords, hashtags, mentions, etc.).

The User-Tweet-Entity Graph takes the form of user-generated tweets containing various entities. Nodes in the graph represent users, tweets, and entities, and edges represent relationships between them. This could be in the form, for example, of an edge from a user node to a tweet node and an edge from a tweet node to an entity node.

This User-Tweet-Entity Graph is used to model the relevance of information on Twitter and to analyze connections and topic trends among users. This will provide insights into, for example

Analysis of user relationships: Through the graph, it is possible to visualize followings and interconnections between users and understand their social networks.
Topic identification and analysis: entity nodes can be used to extract tweets related to specific topics or keywords to analyze topic trends and topic diffusion.
Event Detection: Analyze relationships and entity co-occurrence among tweets to detect specific events and trends.

<follow-recommendation-service (FRS)>

A Follow Recommendation Service is a type of recommendation system used by social media platforms and social networking sites, where the service is intended to recommend other users who may be of interest to the user The purpose of this service is to provide a recommendation of other users who may be of interest to the user.

The follow recommendation service is implemented using the following methods and algorithms

User’s behavioral history: Users’ past followers and related accounts are analyzed, and based on this information, users with similar interests are recommended. This can be done, for example, by recommending users with a common following or similar interests.
Content-based recommendation: Recommends other users with similar attributes based on the user’s profile information, posted content, and interests. This may allow users to share common interests and activities.
Social Graph Analysis: Analyzes follow graphs and social networks among users and makes recommendations based on user relationships and connections. This could be, for example, recommending users who are followed by friends or users with strong connections.
Real-time data: Collect the latest trends and popular users in real-time and make recommendations based on this information.

In the case of Twitter, this would be the accounts to follow and the tweets from those accounts. Twitter has a system called “Who to follow” that recommends users who are not yet followed.

Internally, a series of processes such as Candidate Generation -> Filtering -> Ranking -> Transform -> Truncation regarding recommended users seem to be performed, and algorithms such as SimClusters and TwHIN are used in addition to GraphJet.

<Raking>

Once the candidate recommendations are obtained, the next step is to re-prioritize them. The component is called heavy-ranker.

The features used for ranking are broadly classified into three categories: Aggregate/NonAggregate/Embedding Features, in which features are further defined and used in detail. A detailed explanation is given below.

In addition, a CTR prediction model called MaskNet is used as the ranking algorithm. It represents the ratio of the number of times an ad or content is actually clicked to the number of times it is displayed. In general, CTR is expressed in the following format

\[CTR\ =\ (\ number\ of\ clicks\ ÷\\ number\ of\ times\ an\ advertisement\ or\ content\ is\ displayed\ )\ ×\ 100\]

CTR is an important metric for measuring the effectiveness and interest in an advertisement or content, and a high CTR suggests that the advertisement or content attracts the user’s attention and encourages a click action. In general, ads and content with high CTRs are considered indicators of more effective marketing and content strategies.

CTR is used to evaluate campaign performance and ad revenue in digital marketing and online advertising, and is also an important metric in A/B testing and marketing campaign optimization.

A distinctive feature of MaskNet, which uses this CTR prediction model, is the use of an instance-guided mask, a layer that extracts the importance of each feature, which automatically adjusts the number of features that are considered important and efficiently learns the interaction between features. This instance-guided mask automatically adjusts the features to be considered important and efficiently learns the interactions between features.

<Heuristics & Filtering>

It will make adjustments to the rankings created by Heuristics & Filtering to create a balanced and diverse feed, and will include the following features

Exclusion of blocked or muted tweets
Ensuring user diversity so that no one person appears in the Feed consecutively
Adjustments to balance In/Out of Network
Reducing the score of some tweets based on Feedback
Ensure that someone you follow is involved in the tweet or follows the tweeter (intent of recommendation quality assurance)
Threaded tweets are connected from the original tweet (contextual visualization)
Edited tweets are changed to the latest version.

The above is a description of twitter’s recommendation algorithm. As you can see from the above, there are a great many recommendation algorithms in use. For the basic theory and specific implementation of the recommendation techniques mentioned above, please refer to “Recommendation Techniques.