Commentary on libraries and reference books on data analysis using Pyhon

Web Technology Digital Transformation Artificial Intelligence Machine Learning Deep Learning Natural Language Processing Semantic Web Online Learning Reasoning Reinforcement Learning Chatbot and Q&A User Interface Knowledge Information Processing Programming Python Navigation of this blog

Summary

Python will be a general-purpose programming language with many excellent features, such as being easy to learn, easy to write readable code, and usable for a wide range of applications Python was developed by Guido van Rossum in 1991.

As a relatively new language, Python can utilize a variety of effective programming techniques, including object-oriented programming, procedural programming, and functional programming. It is also widely used in web applications, desktop applications, scientific and technical computing, machine learning, artificial intelligence, and other fields because of the many libraries and frameworks available. Python is an interpreted language, which means that it does not need to be compiled, and its REPL-like structure speeds up the development cycle. Python is also characterized by its REPL-like structure, which speeds up the development cycle.

Here, we provide an overview of machine learning techniques using pyhton based on the Iwanami Data Science Series “Bayesian Inference and MCMC Free Software”.

Pythonと機械学習

Python is a general-purpose programming language similar to C, JAVA, etc., unlike DSLs (Domain Specific Language) such as R, which is specialized for statistics/data analysis, or SQL, which is specialized for database access. Unlike C and Fortran, Python is a dynamic language that does not require compilation, yet it is relatively fast and widely used in scientific and technical computing.

In Python, for example, large libraries such as Biopython (http://biopython.org/wiki/Main_Page) and Astropy (http://www.astropy.org/) have been actively developed on Github in the fields of life science and space science, respectively. are actively being developed on Github.

If you want to use Python for data analysis and numerical computation, set up a Pyhton environment as described in “Setting up a Python development environment with SublimeText4 and VS code” and “Installing a python development environment and tensflow package on a mac“, We will use scientific and data analysis libraries such as Numpy.

About Pandas

Pandas is a library that you need to understand how to use before using Pyhton for data analysis, especially data manipulation including data preprocessing. pandas was first developed in 2008 by Wies Mckinnery, a former employee of ACR Capital Management, a world-renowned quantitative hedge fund, as a library for data manipulation in Pyhton (although it is rumored to be because he was tired of the R language for data analysis). The name Pandas comes from the PANei DAta System, and as the name suggests, it is a library for manipulating data in the form of a table of numbers. As the name suggests, Pandas provides data structures and operations for manipulating data and time series in the form of a numerical table.

Pandas makes it easy to read data from a variety of data formats, including CSV and Excel files, SQL databases, and JSON, and to perform operations such as processing, merging, grouping, aggregation, and visualization of data in a data frame, and Pandas is also compatible with NumPy and Pandas is also integrated with scientific and technical computing and data visualization libraries such as Matplotlib, allowing for more advanced data analysis to be performed.

Some sample code using pandas is shown below.

Importing and displaying CSV files

import pandas as pd

# Read CSV file and convert to DataFrame
df = pd.read_csv("data.csv")

# Display DataFrame
print(df)

2. DataFrame Operations

import pandas as pd

# Read CSV file and convert to DataFrame
df = pd.read_csv("data.csv")

# Display only specific columns
print(df["column_name"])

# Display only rows that meet certain criteria
print(df[df["column_name"] > 0])

# Sum column values
print(df["column_name"].sum())

# Calculate the average value of a column
print(df["column_name"].mean())

3. Exporting DataFrame

import pandas as pd

# Read CSV file and convert to DataFrame
df = pd.read_csv("data.csv")

# Export DataFrame to Excel file
df.to_excel("output.xlsx", index=False)

A good reference book for Pandas is 「Pandas for Everyone: Python Data Analysis」.

The book begins with the basics of using PANDAS. The book then introduces a series of standard methods such as data maintenance/aggregation, visualization, model adoption, regularization, etc. The appendices allow you to install Python and check its syntax. This book will serve as a preparatory step before proceeding with data analysis and machine learning, and will allow the reader to review the methods while looking at data processing as a whole.

The contents are listed below.

■Part 1: Basic Usage Basics
Chapter 1: DataFrame Basics
Loading the first data set / Viewing columns, rows, and cells / Grouping and aggregation / Basic graphs

Chapter 2 Data Structures in pandas
Creating your own data / About Series / About DataFrame / Rewriting Series and DataFrame
Exporting and Importing Data

Chapter 3 Graph Drawing by Plotting
Statistical Graphics with matplotlib / matplotlib
seaborn/pandas objects/seaborn themes and styles

Part 2 Cleaning with Data Manipulation
Chapter 4 Assembling Data
"Orderly Data" / Consolidation / Merging Multiple Data Sets

Chapter 5 Dealing with Missing Data
What is NaN / Where do missing values come from / Dealing with missing data

Chapter 6: Creating "well-ordered data
When multiple columns contain values (not variables) / When a column contains multiple variables
When both rows and columns contain variables / When there are multiple units of observation in one table (normalization)
When the same observation unit spans multiple tables

Part 3 Data Preparation - Conversion / Formatting / Merging, etc.
Chapter 7: Overview of Data Types and Conversion
Data types / type conversion / categorical data

Chapter 8 Manipulating Text Strings
Strings / String methods / Other string methods
String formatting / Regular expressions / regex library

Chapter 9. Applying Functions with apply
Functions / Basics of apply / Application of apply / Vectoring functions / Lambda functions

Chapter 10. division-apply-join by groupby operation
Aggregation / Transform / Filtering
DataFrameGroupBy object / Using multiple indexes

Chapter 11. Manipulating Date/Time Data
Python datetime object / Converting to datetime
Loading data containing dates / Extracting date components
Calculating dates and timedelta / Methods of datetime
.........

■Part 4 Fitting the Model to the Data
Chapter 12: Linear Models
Simple linear regression / multiple linear regression / leaving index labels in sklearn

Chapter 13 Generalized Linear Models
Logistic regression / Poisson regression / Other generalized linear models / Survival analysis

Chapter 14 Diagnosing Models
Residuals / Comparing multiple models / k-fraction cross-validation

Chapter 15: Regularization to deal with overlearning
Why Regularize / LASSO Regression / Ridge Regression
ElasticNet / Cross-validation

Chapter 16 Clustering
k-means method / Hierarchical clustering

Part 5: Conclusion - Next Steps
Chapter 17 Powerful features around pandas
Chapter 18 Sources for Further Learning

■Part 6 Appendix
Installation / Command Line / Project Templates
Using Python / Working Directory / Environment
Installing packages / Importing libraries
Lists / Tuples / Dictionaries / Slicing Values / Loops / Comprehensive Notation

Basic mathematical analysis library

Libraries that provide analysis methods to be applied to the aforementioned pandas preprocessed data include Numpy and Scipy (providing basic array and numerical methods), scikit learn, Shogun (machine learning), Statismodels (statistical analysis), Opencv, sckit_image (computer vision), NLTK, gensim (natural language processing), Sympy (mathematical formula processing), and other libraries.

About NumPy

NumPy is a high-performance scientific computing library in the Python programming language designed to efficiently handle multidimensional array and matrix computations. NumPy’s main features include

Creation, manipulation, and transformation of multidimensional arrays
Array arithmetic operations, broadcast, indexing, slicing
Linear algebra, Fourier transforms, random number generation, statistical functions
Interfaces to C and Fortran languages

NumPy can be used to perform calculations faster than those processed using standard Python functions alone, and since it is used as the basis for many scientific computing and data analysis libraries, it also plays an important role in the use of these libraries.

The following is a sample code using NumPy. The following code imports and uses NumPy, creates a NumPy array, and performs some mathematical operations with it.

import numpy as np

# Create NumPy array
a = np.array([1, 2, 3, 4, 5])
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Array Manipulation
print(a.shape)  # output: (5,)
print(b.shape)  # output: (3, 3)

# Array arithmetic operations
c = a + 2
d = b * 3
print(c)  # output: [3 4 5 6 7]
print(d)  # output: [[ 3  6  9]
          #        [12 15 18]
          #        [21 24 27]]

# Array slicing
e = a[2:4]
f = b[1:, :2]
print(e)  # output: [3 4]
print(f)  # output: [[4 5]
          #        [7 8]]

# Linear Algebra Operations
g = np.dot(b, a)
h = np.linalg.inv(b)
print(g)  # output: [22 53 84]
print(h)  # output: [[-0.40740741 -0.81481481  0.40740741]
          #        [-0.05555556  0.16666667 -0.05555556]
          #        [ 0.2962963   0.48148148 -0.2962963 ]]

# Manipulation of statistical functions
i = np.mean(a)
j = np.std(b)
print(i)  # output: 3.0
print(j)  # output: 2.581988897471611

About scikit learn

scikit-learn is a Python open-source machine learning library that provides various machine learning capabilities, including supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), model selection, and preprocessing.

Scikit-learn is integrated with Python scientific computing libraries such as NumPy, SciPy, and Matplotlib mentioned above, and provides an easy-to-use API. scikit-learn implements a variety of machine learning algorithms, and these By combining these algorithms, machine learning models can be constructed, and tools for evaluating and tuning machine learning models are also provided.

Below is a sample code for building a classification model using Scikit-learn.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Loading Data
iris = load_iris()

# Data Division
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

# Model Building
model = DecisionTreeClassifier()

# Model Learning
model.fit(X_train, y_train)

# Model Evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

The code loads the Iris dataset using Scikit-learn, splits the data into training and testing, builds a classification model using the Decision Tree classifier, trains the model using the training data, and finally, uses the testing data to evaluate the model performance is evaluated and accuracy is output.

Reference books for scikit-learn include, for example, “scikit-learn Cookbook – Second Edition”.

The book starts with basic methods and introduces preprocessing, dimensionality reduction, linear models, cross-validation, SVM, ensemble learning as described in “Overview of Ensemble Learning and Examples of Algorithms and Implementations”, text analysis, multicategory, and even neural networks. The contents are listed below.

Chapter 1 Understanding the Machine Learning Framework - From NumPy to Pipelines
［1] NumPy Basics
[2] Reading Iris Data Sets
[3] Visualizing Iris datasets
[4] Visualizing Iris datasets with pandas
[5] Plotting with NumPy and matplotlib
[6] The Smallest Machine Learning Recipe: SVM Classification
[7] Introduction to Cross-validation
[8] Putting it all together in one place
[9] Machine Learning Overview: Classification and Regression

Chapter 2: Workflow and preprocessing before model building - from sample data preparation to stochastic gradient descent
[10] Creating sample data for a simple analysis
[11] Scaling data using the standard normal distribution scale
......
［18] Using stochastic gradient descent for regression

Chapter 3 Dimensionality Reduction - From PCA to Performance Testing
［19] Dimensionality Reduction Using PCA
[20] Using factor analysis for decomposition
......
［25] Testing Dimensionality Reduction Methods in the Pipeline

Chapter 4 Linear Models - From Linear Regression to LARS
[26] Fitting a Line to Data
[27] Using machine learning to fit a straight line to data
......
[32] A More Basic Approach to Regularization with LARS

Chapter 5: Logistic regression - from data loading to pipeline
［33] Loading data from the UCI Machine Learning Repository
[34] Visualizing the Pima Indians Diabetes dataset using pandas
......
[40] Plotting ROC curves without context
［41] Combining loading datasets and plotting ROC curves in one place: the UCI Breast Cancer dataset.

Chapter 6: Building a model using distance indices - from the k-means method to the k-nearest neighbor method
［42] Clustering of data using the k-means method
[43] Optimizing the number of centroids
......
[49] Detecting outliers using the k-means method
[50] Using the k-nearest neighbor (KNN) method for regression

Chapter 7 Cross-validation and Post-model building workflow - From model selection to persistence
[51] Selecting a model using cross-validation
[52] K-partition cross-validation
......
[63] Feature selection using the L1 norm
[64] Persistence of models using joblib or pickle

Chapter 8 Support Vector Machines - From Linear SVM to Support Vector Regression
[65] Classifying data using linear SVM
......
［68] Support Vector Regression

Chapter 9 Decision Tree Algorithms and Ensemble Learning
[69] Basic Classification Using Decision Trees
[70] Visualizing Decision Trees Using pydot
......
[77] Tuning the AdaBoost regressor
［78] Creating a stacking aggregator using scikit-learn

Chapter 10 Text Classification and Multi-class Classification
［79] Using stochastic gradient descent for classification
[80] Using Naive Bayes to Classify Documents
［81] Label propagation using semi-supervised learning

Chapter 11 Neural Networks
[82] Perceptron Classifiers
[83] Neural Networks: Multilayer Perceptron
[84] Stacking with Neural Networks

Chapter 12 Creating Simple Estimators
[85] Creating a Simple Estimator

Other Libraries

In addition, there are libraries for visualizing results (Bokeh, matplotlib, Seabom) and iPython Notebook that can describe the data analysis process itself to ensure reproducibility of data analysis.

In addition, Pyhton-based approaches to deep learning are popular, and are described in detail in “Hello World of Neural Networks, Implementation of Handwriting Acknowledgment with MNIST Data” etc. Please refer to that article as well.

Furthermore, PyPy, Julia, and others have been studied to speed up processing, which is a weak point of Pyhton.

Also, for Bayesian estimation, nonparametric Bayes, and Gaussian processes, as described in “On Stochastic Generative Models” and “On Nonparametric Bayes and Gaussian Processes” libraries in python are provided.

Practice and Reference Books

For specific exercises on specific topics, see “python and algorithms“,”Machine Learning with python“,”Statistical modeling with python“,”Optimization methods with python.

Overview of machine learning and data analysis in Python and introduction to typical libraries