Overview of probability and statistics, its philosophy, and libraries in various languages for specific use

Machine Learning Artificial Intelligence Digital Transformation Probabilistic Generative Models Natural Language Processing Markov Chain Monte Carlo Method Deep Learning Technology Nonparametric Bayesian and Gaussian processes Anomaly and change detection Time series data analysis Python R Language Java Clojure Navigation of this blog

About Probability and Statistics

Probability and statistics is one of the fields of mathematics, which is a theory that deals with the probability of uncertain events and areas such as data analysis and modeling.

Probability can be expressed mathematically as the probability that an event will occur. This would be, for example, the probability of a 1 being obtained when throwing a dice, such as 1/6. There are multiple approaches to calculating probability, including frequentist and Bayesian approaches.

Statistics is a discipline that deals with the collection, analysis, and interpretation of data, and statistical methods can be used to extract patterns from data and to make predictions about the future. Typical statistical methods include mean, standard deviation, regression analysis, t-test, and ANOVA.

Probability and statistics are related fields to each other, and it is also said that statistics is an application of probability. Probability theory is often used as the basic theory of statistics, and statistical models can be built using probability theory for data analysis and machine learning.

Probability, Statistics and AI Technology

Probability/statistics and AI technologies are closely related, and many AI technologies are based on probabilistic/statistical methods.

For example, machine learning algorithms often use probabilistic and statistical methods to extract patterns from data and make predictions about unknown data. Typical machine learning algorithms include supervised learning, unsupervised learning, and reinforcement learning, which use statistical methods to extract patterns from data and build prediction models.

Probabilistic and statistical methods are also used in fields such as natural language processing and image processing. In natural language processing, probabilistic and statistical methods are used in language modeling and machine translation, while in image processing, probabilistic and statistical methods are used in image recognition and object detection.

Thus, in developing AI technologies, probabilistic and statistical knowledge is essential for selecting appropriate algorithms and models. The use of probabilistic and statistical methods is also important in evaluating the output results of AI technologies.

Next, we will discuss the ideological basis for considering this probability and statistics.

Philosophy of Statistics

The philosophy of statistics refers to the ideas and principles for collecting, analyzing, and interpreting data. Statistics is based on probability theory, mathematics, and information science, and is widely used as a scientific method.

The philosophy of statistics includes the following key principles

Concept of Population and Sample: In statistics, the entire subject of a survey or experiment is considered the “population,” and a portion of the population is considered the “sample” from which the survey or experiment is conducted. The objective is to infer the characteristics of the entire population based on the results obtained from the sample.
Probability and Probability Distribution: In statistics, probability is used to express uncertainty and the variability of data as a probability distribution. Probability distributions are widely used as a tool to describe how data are distributed.
Estimation and Testing: In statistics, “estimation” is used to make inferences about the characteristics of an entire population based on data obtained from a sample, or “testing” is used to verify a hypothesis. These methods enable statistical inferences to be made.
Modeling: In statistics, models are constructed to analyze data, and the models are used to predict and interpret the data. Statistical models can capture the characteristics of data by simplifying phenomena.

These statistical ideas are widely applied not only in data collection and analysis, but also in scientific reasoning and decision making.

Philosophy of Probability

Next, we discuss the philosophy of probability. The idea of probability refers to the idea or principle of quantifying the likelihood of a future event occurring. Probability is defined as a number that indicates the likelihood of an event occurring and takes a value between 0 and 1. The closer to 0, the less likely the event will occur, and the closer to 1, the more likely the event will occur.

The philosophy of probability includes the following key principles

Definition of probability: Probability is a number that indicates the likelihood of an event occurring, and the properties and axioms of probability are based on this definition.
Conditional Probability: Conditional probability indicates the likelihood that another event will occur under the condition that one event has occurred. Conditional probability is important for more sophisticated probability calculations using principles such as Bayes’ theorem.
Probability distribution: A probability distribution is a function that indicates what values a variable is likely to have and with what probability. There are several types of probability distributions, including the Bernoulli, binomial, and normal distributions.
Statistical Inference: Probability is an important concept underlying statistical inference. Statistical inference is used to infer population characteristics from a sample and employs methods such as probability distributions and statistical hypothesis testing.

These probability ideas are applied in various fields of science, engineering, and business. This is the case, for example, in signal processing and communication engineering, where noise and errors are modeled probabilistically to restore and transmit signals, and in risk assessment and investment analysis, where uncertainty and risk are evaluated probabilistically.

Philosophy of Data Science

Finally, we discuss the philosophy underlying data science using these probability and statistics. Data science is based on the fields of statistics, machine learning, computer science, and database technology, which are widely used as scientific methods. The philosophy of data science refers to the ideas and principles for extracting valuable information from large amounts of data.

The data science philosophy includes the following key principles

Data collection and preprocessing: In data science, the first step in extracting valuable information from large amounts of data is data collection and preprocessing. For data analysis, it is necessary to collect appropriate data and improve data quality by processing missing values and removing outliers.
Data Visualization and Exploratory Data Analysis: In data science, data visualization is used to identify trends and patterns in the data, and exploratory data analysis is performed. In this phase, it is important to discover trends and patterns in the data and formulate hypotheses.
Machine Learning and Data Mining: In data science, machine learning and data mining are used to extract valuable information from data. Machine learning automatically learns patterns from data and can make predictions and classifications on unknown data, while data mining extracts knowledge from data to solve business and scientific problems.
Modeling and Evaluation: Data science uses modeling to analyze data and make predictions or classifications on unknown data. Models can predict and interpret data by capturing the characteristics of the data, and in evaluating models, it is important to assess their accuracy and generalization capabilities.

The September 2020 issue of Gendai Shiso, Special Issue – Statistics/Data Science, is based on these thought-provoking discussions on probability statistics/data science. A summary is given below.

Gendai Shiso September 2020 Special Issue – Statistics/Data Science Reading Notes

From Gendai Shiso September 2020 Special Issue on Statistics/Data Science.

What is the Idea of “Statistics” Surrounding Us?
Against the backdrop of advances in big data and AI technologies, statistics/data science has become increasingly prominent since the 2010s. But what kind of “philosophy” lies behind it? In this special issue, we will take a fresh look at the history and current state of statistical science from the post-Corona point of view, and examine how it relates to our survival from various perspectives, including a philosophical perspective.

Special Feature* Statistics/Data Science

Discussion
Statistics in Society and Science / Hiroyuki Kojima + Nobuhiro Minaka

Statistics in Society and Science / Hiroyuki Kojima + Nobuhiro Yamanaka
Statistical Literacy in the 2000s / Hiroshi Kambayashi
The History of Statistics in Japan from the Perspective of the "Statistical Distrust Problem" / Masahiro SATO
A Trial of Statistical Thought: For the Conscious User / Masahiro MATSUO

The Bayesian Era
The Revelation of Bayesian Statistics / Nozomu Matsubara

A Place Where Data Breathe
The Various Faces of Data: Data Science in the Ecosystem / Masato FUKUSHIMA
The Dynamics of Numbers: Quantification of Crime Solving and Its Background / Mai Suzuki
The History of Actuarial Science from the Perspective of Data Science: The History of Life Chart Creation / Shinji SUZUKI
The Strange Proliferation of "Evidence": EBM and Society from the Viewpoint of the History of "Evidence" / Kazushi Matsumura
Possibilities and Limitations of "Evidence-Based Education" / Takutaka Terasawa

The Body and "Measurement
Statistics in the Era of Wiz Corona / Takemura, Akimichi
What is the Number of Infections?
Epidemics and Sour Grapes: The Power Means of Tracing the Route of Infection / Luo Zhixian
Lively Data: Postwar Community Medicine and Health / Junko Kitanaka

History of Crossing and Border Crossing
The Relationship between Statistics and Mathematics / Masafumi Akahira
Psychology and Statistics: Looking to the Future through Historical Review / Tatsuya Sato
Maxwell's Statistical Knowledge and Free Will / Hajime Inaba
R is Free Software / Chigusa Kita

The Society Revealed by Numerical Figures
Kotoba no kenkyu to kanbun no kenkyu: 100-nen no ouroborosu / Toshiki Sato
Quantitative and Qualitative Research in Family Sociology: Toward Standardization of Qualitative Research / NAGATA Natsuki
Can We Show "Evidence" of Discrimination / Kikuko Nagayoshi

Philosophy of Data Science
What We Do When We Talk with Data: From the Perspective of Analytical Pragmatism / Xi Zhe Zhu
Artificial Intelligence and the Unlanguageable / Desu Momoki

New Series: Transcending "Postwar Knowledge" - Part 1
The Year of Plague: Introduction / Ryuichi Narita　

A Scientist's Walk: Part 71
Robbed by Machines: The Expansion of Mechanics within Machines and the Mind as a Residual / Fumitaka Sato　

100 Years to the Posthumanities - Part 8
terra incognita / Mitsuki Asanuma　

Research Handbook
Measuring Altruistic Behavior / Yuta Kawamura

Data analysis is performed based on these basic concepts. Without them, mere use of tools alone will not be able to extract the desired information from the data.

Finally, we will discuss libraries in various languages and reference books for practicing these probability and statistics.

Probability and statistics practice (libraries in various languages)

Python: Python libraries used for probability and statistical analysis include NumPy, a basic library for numerical computation; Pandas, a library specialized for data analysis; SciPy, a library for scientific and technical computing; matplotlib, a library specialized for data visualization; seaborn, an advanced data visualization library based on matplotlib; and seaborn, a library for statistical analysis (regression analysis, time series analysis, multivariate analysis, experimental design). matplotlib, a library specialized in data visualization; seaborn, an advanced data visualization library based on matplotlib; and statsmodels, a library for statistical analysis (regression analysis, time series analysis, multivariate analysis, experimental design, etc.).
R language: R libraries used for probability and statistical analysis include dplyr, a library for data manipulation; ggplot2, a library for data visualization; tidyr:, a library for data shaping (transposition, transformation, processing of missing values, etc.); tidyr:, a library for statistical analysis (linear regression analysis, MASS, a library for statistical analysis (linear regression, principal component analysis, clustering, etc.), stats, a library for probability and statistics (generalized linear models, mixture models, time series analysis, etc.), caret, a library for machine learning (regression analysis, classification, clustering, feature selection), Bayesian statistics ( BayesFactor, a library for Bayesian statistics (calculation of Bayes factors, Bayes factors, etc.), and lme4, a library for linear mixed models (random effects models, hierarchical models, analysis of overlapping measurement data, etc.).
Java: Java libraries used for probability and statistics include Apache Commons Math, a library for mathematical functions such as probability and statistics, linear algebra, optimization, and numerical analysis; Weka, a library for machine learning (classification, regression, clustering, feature selection, etc.) Weka, a library for statistical analysis and machine learning (regression, clustering, principal component analysis, time series analysis); Smile, a library for machine learning (regression, classification, clustering, collaborative filtering) with parallel processing speed for large data sets; Apache Spark MLlib, a library for Bayesian statistics (MCMC methods); Colt, a library for numerical analysis (matrix operations, random number generation, etc.); and Mahout, a library for big data analysis (machine learning, clustering, dimension reduction, etc.).
Clojure: As for Clojure libraries used for probability and statistics, Java mentioned above can be used natively, and R and Python libraries can also be used as they are. Incanter, a library for statistical analysis, data visualization, and machine learning; Bayadera, a library for Bayesian statistical analysis that enables parameter estimation using MCMC methods, graphical model inference, and visualization of posterior distributions; Sampling, a library for Monte Carlo simulations that includes random sampling, Markov chain Monte Carlo, and other functions; and Sampling, a library for Monte Carlo simulations with functions such as random sampling and Markov chain Monte Carlo, and Clatrix, a matrix arithmetic library with functions such as PCA, linear regression, and matrix factorization.

reference book and article

For reference books on the theory and history of probability and statistics, see “Probability Theory for Beginners: A Reading Memo” ,”Introduction to Probability Theory: A Reading Memo” ,”Nine Stories of Probability and Statistics that Changed Humans and Society: A Reading Memo” and “134 Stories of Probability and Statistics that Changed the World: A Reading Memo. For specific implementations and applications, see “Statistical Modeling with Python” ,”Statistical Analysis and Correlation Evaluation Using Clojure/Incanter” ,”Probability Distributions Used in Probabilistic Generative Models” etc.