Recommendation systems for streaming data

Machine Learning Probabilistic Generative Model Support Vector Machine Sparse Modeling Artificial Intelligence General Machine Learning and Data Analysis Digital Transformation Clojure Recommendation Technology Navigation of this blog

Recommendation systems for streaming data.

Matrix Factorisation, described in ‘Recommendation systems in Netflix’, is also a useful method when dealing with streaming data. Usually, Matrix Factorisation learns feature vectors by combining all evaluation data into a matrix, but in the case of streaming data, it is not possible to learn all the data together because the data flows at a constant rate.

In such cases, it is possible to use online learning, as described in ‘Overview of online learning, various algorithms, application examples and concrete implementations’, where new data can be learnt immediately as it streams in. In Matrix Factorisation, online learning can also be used to learn streaming data in real time.

Online learning commonly uses the Stochastic Gradient Descent (SGD) method described in “Overview of Stochastic Gradient Descent (SGD), its algorithms and examples of implementation“, Randomly selected data is used to gradually update the model. This method makes it possible to learn streaming data in real time.

Matrix Factorisation can also be combined with time series models to handle streaming data more effectively. Time series models allow future data to be predicted from past data, and this prediction makes it possible to update the current feature vector to take future data into account.

When Matrix Factorization is applied to streaming data, it is common to make predictions based on past data, which improves the prediction accuracy for current data and allows streaming data to be handled effectively.

In order to apply online learning in Matrix Factorization, the algorithm in normal batch learning needs to be slightly modified. Whereas in normal batch learning, all the training data are used at once to find the optimal parameters, in online learning, the parameters are updated immediately for each piece of training data as it is streamed in.

To perform Matrix Factorization in online learning, the following modifications are made to the matrix factorisation algorithm in normal batch learning.

Initialisation: initialise randomly, as in normal batch learning.
Streaming training data: read the training data as if it were streaming.
Parameter updating: each time streaming data is read, parameters are updated for each data. Stochastic Gradient Descent (SGD) is commonly used to update parameters.
Convergence decision: a convergence decision is made each time a parameter update is performed. If convergence is achieved, training is terminated. If convergence is not achieved, go back to 3 if streamed training data remains.

In online learning, the computational cost is lower than in batch learning, and real-time learning on streaming data is possible. However, it tends to take longer to converge to the optimum parameters compared to batch learning, and it is assumed that sufficient training data exist. The initialisation of parameters and setting of convergence decisions in on-line learning are also important issues.

implementation example

An example implementation of a recommendation system for streaming data is shown. Streaming data requires rapid processing and recommendations due to real-time data updates, and this system provides recommendations based on user interactions, particularly related to streaming processing using Apache Kafka and Apache Flink.

The following code example is an example of a simple streaming recommendation system implementation using Python. Here, pandas and scikit-learn are used to perform collaborative filtering-based recommendation, updating data frames in real-time to mimic streaming data.

Install the required libraries: first, install the required libraries.

pip install pandas scikit-learn

Simulation of streaming data: in the following, simple streaming data that mimics user behaviour (item ratings) is simulated and recommendations are made based on this data.

import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
import time

# Simulated streaming data
# User ID, item ID, rating (rating)
def generate_streaming_data():
    # Randomly generated combinations of user ID and item ID
    user_ids = np.random.choice(range(1, 6), size=10)
    item_ids = np.random.choice(range(1, 11), size=10)
    ratings = np.random.choice([1, 2, 3, 4, 5], size=10)

    # Generate data frames
    data = pd.DataFrame({
        'user_id': user_ids,
        'item_id': item_ids,
        'rating': ratings
    })
    return data

# Function to create a user item matrix.
def create_user_item_matrix(data, n_users, n_items):
    user_item_matrix = np.zeros((n_users, n_items))
    for _, row in data.iterrows():
        user_item_matrix[row['user_id']-1, row['item_id']-1] = row['rating']
    return user_item_matrix

# Recommendation functions using collaborative filtering.
def recommend(user_item_matrix, user_id, n_recommendations=3):
    # Perform collaborative filtering using NearestNeighbours.
    model = NearestNeighbors(metric='cosine', algorithm='brute')
    model.fit(user_item_matrix)
    
    # Index of users (IDs starting with 1 changed to starting with 0)
    user_index = user_id - 1
    distances, indices = model.kneighbors(user_item_matrix[user_index].reshape(1, -1), n_neighbors=n_recommendations)
    
    recommended_items = []
    for index in indices[0]:
        if user_item_matrix[user_index, index] == 0:  # Unrated items
            recommended_items.append(index + 1)  # Return item ID to base 1.

    return recommended_items

# Main function for simulation and recommendation of streaming data.
def main():
    n_users = 5  # Number of users
    n_items = 10  # Number of items
    data_stream = []  # List for storing streaming data.

    while True:
        # Generate new streaming data.
        new_data = generate_streaming_data()
        data_stream.append(new_data)
        
        # Combine data frames to create a user item matrix.
        all_data = pd.concat(data_stream, ignore_index=True)
        user_item_matrix = create_user_item_matrix(all_data, n_users, n_items)
        
        # Recommendation for user 1.
        recommended_items = recommend(user_item_matrix, user_id=1, n_recommendations=3)
        print(f"Recommended items for User 1: {recommended_items}")
        
        # Streaming data updated every 5 seconds.
        time.sleep(5)

if __name__ == "__main__":
    main()

Implementation overview

Generating streaming data: generate random user behaviour (rating data) with the generate_streaming_data() function to simulate streaming.
Creation of user and item matrices: the create_user_item_matrix() function represents user and item ratings in matrix form. This matrix is the data structure required for the recommendation algorithm.
Recommendation by collaborative filtering: in the recommend() function, collaborative filtering is performed using NearestNeighbours to recommend the best items to the user. Here, items are recommended based on cosine similarity.
Streaming data update: In the main loop (main()), the streaming data is updated every five seconds and a recommendation is made to user 1 each time.

Execution flow.

When the programme is executed, new evaluation data is generated every five seconds and a recommendation is made to user 1.
The recommend() function recommends unrated items from among the rated items.

Challenges in streaming data

Real-time: when dealing with streaming data, the frequency of data updates and processing times are important, so distributed processing is required for large data sets.
Scalability: as data grows, methods for efficient recommendation (e.g. clustering and indexing of items) need to be considered.
Up-to-dateness: retraining models every time new data is added is computationally demanding, so methods such as online learning or batch processing will typically be used.

Ideas for extensions

Processing real-time data: streaming data processing can be distributed using Kafka or Flink to improve scalability.
Online learning: introducing online learning algorithms to sequentially update the model as data grows allows for near real-time recommendations.

reference book

Reference books related to recommendation systems using streaming data are described.

1. ‘Streaming Systems: the What, Where, When, and How of Large-Scale Data Processing’ by Tyler Akidau, Slava Chernyak, and Reuven Lax
– Abstract: A comprehensive guide to the theory and practice of streaming data processing, covering data pipeline construction, streaming system design and distributed processing techniques using Apache Beam.
– Contents: provides an in-depth knowledge of streaming systems in general, with applications to real-time data processing in recommendation systems.

2. ‘Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems’ by Martin Kleppmann
– Abstract: Focusing on the design of large-scale data-driven applications, the book delves deeply into concepts related to streaming data, real-time processing and scalability.
– Contents: learn about database design and streaming processing techniques that support recommender systems and real-time analysis.

3. ‘Recommender Systems: The Textbook’ by Charu C. Aggarwal
– Description: This textbook covers the theory and practice of recommender systems and provides the basics of recommender technologies such as collaborative filtering, content-based recommender and hybrid systems. It also introduces methods applicable to recommendation systems based on streaming data.
– Content: algorithms for dealing with streaming data and approaches for building real-time recommendation systems are also discussed.

4. “Building Real-Time Analytics Systems: From Events to Insights with Apache Kafka and Apache Pinot”
– Abstract: Focusing on techniques for analysing streaming data in real-time, the book details how to process and analyse streaming data.
– Contents: learn about recommended systems for using streaming data and techniques for visualising and analysing data in real-time.

5. “Machine Learning: Make Your Own Recommender System”
– Abstract: Book on the design and implementation of machine learning based recommender systems, also describes the implementation of recommender systems for streaming data.
– Contents: presents specific examples on recommendation techniques using machine learning algorithms, in particular on how to deal with streaming data.

6. ‘Streaming Data: Understanding the Real-Time Pipeline’ by Andrew G. Psaltis
– Abstract: A practical guide to working with streaming data. Focuses on how to build real-time data pipelines and how to process information from different data sources.
– Contents: learn about frameworks and tools for handling streaming data efficiently and understand how data flows and is processed in recommender systems.

7. ‘Practical Recommender Systems’ by Kim Falk
– Description: A guide to the design and implementation of practical recommender systems, enabling users to learn about streaming data and real-time recommendation methods.
– Contents: focuses on different recommender methods and their implementation, especially approaches for recommendations based on real-time data.

8. ‘Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications’ by Fabian Hueske and Vasiliki Kalavri.
– Abstract: A practical book on streaming data processing with Apache Flink, useful for real-time analysis and design of streaming applications.
– Contents: presents the fundamentals and applications of real-time data processing using Flink, which can be applied to the construction of streaming recommendation systems.