Statistical Hypothesis Testing and Machine Learning Techniques

Machine Learning Artificial Intelligence Digital Transformation Reinforce Learning Intelligent information Probabilistic Generative Model Explainable Machine Learning Natural Language Processing Ontology Technology Mathematics Navigation of this blog

INTRODUCTION

Statistical Hypothesis Testing is a method in statistics to probabilistically evaluate whether a hypothesis is true or not, and is used not only to evaluate statistical methods, but also to evaluate the reliability of predictions and model selection and evaluation in machine learning. It is also used in the evaluation of feature selection as described in “Explainable Machine Learning” and in the verification of the discrimination performance between normal and abnormal as described in “Anomaly Detection and Change Detection Technology” and is a fundamental technology. This section describes various statistical hypothesis testing methods and their specific implementations.

Statistical Hypothesis Testing Overview

Statistical Hypothesis Testing (STAT) is a method in statistics that probabilistically evaluates whether a hypothesis is true or not, and is primarily based on the following steps

Hypothesis Formulation: The first step in hypothesis testing begins with the researcher formulating the hypothesis that he/she wishes to test. Generally, the following two hypotheses are considered
- Null Hypothesis (H0): The hypothesis that some effect or association does not exist. For example, the two means are equal, the two variables are independent, etc.
- Alternative Hypothesis (H1 or Ha): Hypothesis that opposes the null hypothesis. For example, the two means are different, there is an association between the variables, etc.
Data collection and summarization: Collect the necessary data, organize and summarize the data appropriately. The method of summarizing data depends on the type and purpose of the data.
Calculating statistical measures: From the computed data, a statistical measure is obtained. This measure will indicate how likely it is that such data will be obtained if the null hypothesis is true. The choice of the measure depends on the type of hypothesis testing and the nature of the data, e.g., t-test, chi-square test, ANOVA, etc.
Setting the Significance Level: The significance level (Significance Level) refers to the threshold of probability that serves as a criterion for rejecting the null hypothesis. A typical significance level is usually 0.05 (5%), but it can be changed depending on the purpose and field of study.
Interpretation of results and judgment: Based on the computed statistical measure, it is determined whether the null hypothesis is rejected. Specifically, if the computed measure is less than a predetermined significance level, the null hypothesis is generally rejected and the alternative hypothesis is adopted. Conversely, if the scale is above the significance level, the null hypothesis is not rejected and the data are interpreted as supporting the null hypothesis.

Statistical hypothesis testing will be a widely used method for making objective judgments based on data in scientific research and practical problem solving. However, care must be taken in interpreting test results, and it is important to correctly distinguish between statistical significance and actual significance.

Types of statistical hypothesis testing

There are various types of statistical hypothesis tests, each used for different purposes and depending on the nature of the data. The following is a description of typical types of statistical hypothesis testing.

t-test

<Overview>

The t-test becomes a method for testing whether there is a difference between two means, using the mean and variance of two samples to test the null hypothesis (means are equal) and the alternative hypothesis (means are different). There are two main variants: the one-way t-test and the paired t-test.

The one-way t-test and the paired t-test are described below.

One-way t-test (Independent Samples t-test): The one-way t-test tests whether the difference between the means of two independent samples is statistically significant. It is used when comparing the means of two different groups (e.g. a group receiving a new drug and a group receiving a sham drug). The two hypotheses used here will be
- The null hypothesis (H0): The means of the two groups are equal.
- Alternative hypothesis (Ha): The means of the two groups are not equal.
Paired Samples t-test: The paired t-test tests whether the change in means before and after the same sample is statistically significant. This is used, for example, when comparing pre-treatment and post-treatment data. The hypotheses include the following
- The null hypothesis (H0): the difference in means is zero (no change).
- Alternative hypothesis (Ha): the difference in means is non-zero (there is a change).

The procedure for the t-test is as follows

Calculate the mean and the standard deviation from the sample.
Calculate the difference between the two means and determine the standard error.
Calculate the t-value, which is the difference of the means divided by the standard error.
Based on the t-value, determine the p-value using a t-distribution table or statistical software.
The calculated p-value is compared to a pre-defined significance level (usually 0.05) to determine if the null hypothesis should be rejected.

If the p-value is less than the significance level, the null hypothesis is rejected, the alternative hypothesis is adopted, and the decision to accept the negative result or conduct further investigation depends on the purpose and context of the study.

<Implementation>

Here we show how to implement the independent two-group t-test and the corresponding t-test using Python’s scipy library.

The independent two-group t-test will test whether the means of two independent samples are statistically significantly different.

import numpy as np
from scipy.stats import ttest_ind

# Creating dummy data
np.random.seed(42)
group1 = np.random.normal(10, 2, 30)
group2 = np.random.normal(12, 2, 30)

# Running an independent two-group t-test
statistic, p_value = ttest_ind(group1, group2)

# Display p-value
print("p値:", p_value)

# Set significance level
alpha = 0.05

# Compare p-values and significance levels to determine hypotheses
if p_value < alpha:
    print("Reject the null hypothesis: the means of the two groups are statistically significantly different")
else:
    print("The null hypothesis is not rejected: the means of the two groups are not statistically significantly different")

A paired t-test would test whether the means of two measurements on the same subject are statistically significantly different.

import numpy as np
from scipy.stats import ttest_rel

# Creating dummy data
np.random.seed(42)
before = np.array([15, 20, 25, 30, 35])
after = np.array([12, 18, 24, 29, 32])

# Running a paired t-test (paired t-test)
statistic, p_value = ttest_rel(before, after)

# Display p-value
print("p値:", p_value)

# Set significance level
alpha = 0.05

# Compare p-values and significance levels to determine hypotheses
if p_value < alpha:
    print("Reject the null hypothesis: the means of the two measurements are statistically significantly different")
else:
    print("The null hypothesis is not rejected: the means of the two measurements are not statistically significantly different")

In the above example, the ttest_ind function is used to implement a t-test for two independent groups and the ttest_rel function is used to implement a paired t-test. The obtained p-value is used to determine whether to reject the null hypothesis.

Chi-squared test

<Overview>

The chi-square test is a method for testing the goodness of fit or independence of categorical data. In particular, it is used to test the goodness of fit between observed data and theoretical predictions, using the difference between the observed and expected frequencies of categorical data to test the null hypothesis (categorical data follow a theoretical distribution) and the alternative hypothesis (categorical data do not follow a distribution). There are two main types of chi-square tests

Goodness-of-Fit Test: This is used to test whether a given set of categorical data conforms to a theoretical distribution assumed a priori. Specifically, the observed number of occurrences of each category is compared to the number of occurrences predicted based on the theoretical distribution, and the results of the Chi-square test are used to evaluate whether there is a statistically significant difference between the observed data and the theoretical expectation.
Independence Test: The Chi-square independence test is used to test whether two or more categorical variables are independent of each other. This is done by creating a cross-tabulation table (linkage table) and evaluating the difference between the observed data and the data in the expected case of independence. The independence test is useful in investigating the association or influence between two variables.

The procedure for the chi-square test would first calculate the difference between the observed data and the theoretical prediction (or association with another categorical variable), summarize it as a statistical measure, the chi-square value, and then determine whether the chi-square value is statistically significant based on the degrees of freedom and significance level.

However, the chi-square test requires several assumptions to be met. For example, the expected frequency of each cell must not fall below a certain threshold. In addition, the chi-square test is effective when the sample size is large, and should be applied with caution when using small sample sizes.

The chi-square test is one of the powerful statistical methods widely used to analyze categorical data and investigate associations, but proper assumptions and interpretation are important.

<Implementation>

Here we show how to implement the chi-square test (chi-square goodness-of-fit test and chi-square independence test) using Python’s scipy library.

The chi-square goodness-of-fit test tests whether the difference between the observed and expected frequencies is chance. As an example, consider testing whether the distribution of dice rolls is uniform.

import numpy as np
from scipy.stats import chisquare

# Creating dummy data
observed_frequencies = np.array([10, 15, 12, 18, 20, 25])  # Observation frequency for each roll of the dice
expected_frequencies = np.array([20, 20, 20, 20, 20, 20])  # Expected frequency for uniform distribution

# Running a Chi-square goodness-of-fit test
statistic, p_value = chisquare(observed_frequencies, f_exp=expected_frequencies)

# Display p-value
print("p値:", p_value)

# Set significance level
alpha = 0.05

# Compare p-values and significance levels to determine hypotheses
if p_value < alpha:
    print("Reject the null hypothesis: the distribution of stakes is not uniform.")
else:
    print("The null hypothesis is not rejected: the distribution of stakes is uniform")

The chi-square independence test tests whether two variables are independent of each other. As an example, consider testing whether gender and favorite sport choice are independent.

import numpy as np
from scipy.stats import chi2_contingency

# Creating dummy data
data = np.array([[50, 30], [40, 60]])  # Row: Gender, Column: Sport Selection

# Running a Chi-square Independence Test
statistic, p_value, dof, expected = chi2_contingency(data)

# Display p-value
print("p値:", p_value)

# Set significance level
alpha = 0.05

# Compare p-values and significance levels to determine hypotheses
if p_value < alpha:
    print("Rejecting the null hypothesis: gender and sport choice are not independent")
else:
    print("The null hypothesis is not rejected: gender and sport choice are independent")

In the above example, the chisquare function is used to implement a chi-square goodness-of-fit test and the chi2_contingency function is used to implement a chi-square independence test. The obtained p-values are used to determine whether to reject the null hypothesis.

ANOVA（Analysis of Variance）

<Overview>

ANOVA is an extension of the t-test, which compares the means of two groups and is suitable for comparing several groups. There are two main variants of ANOVA: one-way analysis of variance and two-way analysis of variance.

One-Way ANOVA: One-Way ANOVA analyzes a continuous set of objective variables that are affected by a single explanatory variable (group or condition). They are applied to multiple groups to test whether there is a statistically significant difference in means between those groups. The hypotheses include the following
- The null hypothesis (H0): the means of all groups are equal.
- Alternative hypothesis (Ha): the means of at least one group are different.
Two-Way ANOVA: Two-Way ANOVA analyzes a continuous objective variable that is affected by two explanatory variables (factors). This allows simultaneous evaluation of the effects of the two explanatory variables and their interactions. This technique is used to examine main and interaction effects.

The procedure for ANOVA is as follows.

Calculate the mean and standard deviation for each group from the sample.
Calculate the variability between means (interval variability) and within each group (internal variability).
Find the ratio of the variations and calculate the F value; the F value is the ratio of the interval variation to the internal variation.
Based on the F-value, determine the p-value using the F-distribution table or statistical software.
The calculated p-value is compared to a pre-defined significance level (usually 0.05) to determine if the null hypothesis should be rejected.

If the p-value is less than the significance level, the null hypothesis is rejected, indicating that the means of at least one group are statistically significantly different. approach.

<Implementation>

Here we show how to implement a simple one-factor ANOVA using Python’s scipy library.

The assumption is that we have the following data: there are multiple groups, and we perform an ANOVA to determine if the means of each group are the same.

import numpy as np
from scipy.stats import f_oneway

# Creating dummy data
np.random.seed(42)
group1 = np.random.normal(10, 2, 30)  # Mean 10, Standard deviation 2
group2 = np.random.normal(12, 2, 30)  # Mean 12, Standard deviation 2
group3 = np.random.normal(15, 2, 30)  # Mean 15, Standard deviation 2

# Treat all three groups together as data.
data = [group1, group2, group3]

# Perform a one-factor ANOVA
statistic, p_value = f_oneway(*data)

# Display p-value
print("p値:", p_value)

# Set significance level
alpha = 0.05

# Compare p-values and significance levels to determine hypotheses
if p_value < alpha:
    print("Reject the null hypothesis: means of at least one group are different")
else:
    print("The null hypothesis is not rejected: all groups have the same mean")

The code creates three dummy data groups and uses ANOVA to test whether the means of each group are the same. The resulting p-value is used to determine if the null hypothesis (all groups have the same mean) should be rejected.

The f_oneway function takes multiple groups as input, computes the one-factor ANOVA statistic (F value) and p-value, and compares the p-value to the usual significance level to determine if the null hypothesis should be rejected.

Regression Analysis

<Overview>

Regression analysis is a method for examining the relationship between variables, mainly by modeling the relationship between explanatory variables (independent variables) and objective variables (dependent variables) and evaluating the relationship between them. Simple regression analysis examines the relationship between one explanatory variable and the objective variable, while multiple regression analysis examines the relationship between multiple explanatory variables and the objective variable, and tests the significance of the coefficients and the goodness of fit of the model as a whole.

Simple Linear Regression: Simple linear regression analysis is a method that models the relationship between one explanatory variable and one objective variable. A linear regression model is used to evaluate the impact of the explanatory variable on the objective variable and how much of the variation in the objective variable can be explained by the explanatory variable.
Multiple Linear Regression: Multiple linear regression analysis is a technique that models the relationship between two or more explanatory variables and an objective variable. The influence of several explanatory variables on the objective variable is simultaneously evaluated, and the coefficient of each explanatory variable and the goodness of fit of the model as a whole are assessed.

The main elements of statistical hypothesis testing in regression analysis concern the coefficients of the explanatory variables, and the following hypotheses are usually considered

The null hypothesis (H0): the coefficient of the explanatory variable is zero and has no effect on the objective variable.
Alternative hypothesis (Ha): the coefficient of the explanatory variable is non-zero and has an effect on the objective variable.

Statistical hypothesis testing is done through a t-test on the coefficients of the explanatory variables. The specific procedure is as follows.

Construct a regression model and estimate the coefficients and intercepts of the explanatory variables.
Calculate t-values for the coefficients of each explanatory variable, where the t-value is the coefficient divided by the standard error.
Based on the t-values, determine the p-values using a t-distribution table or statistical software.
The calculated p-value is compared to a pre-defined significance level (usually 0.05) to determine if the null hypothesis should be rejected.

If the p-value is less than the significance level, the null hypothesis is rejected and the alternative hypothesis is supported. In other words, the explanatory variable is shown to have a statistically significant effect on the objective variable. When interpreting the results of a regression analysis, it is important to consider not only the p-value, but also the actual magnitude and significance of the coefficient.

<Implementation>

The following example implements regression analysis and t-tests using Python’s statsmodels library.

The assumption is that we have the following data. There is an independent variable X and a dependent variable y. A single regression analysis is performed to examine the effect of X on predicting y.

import numpy as np
import statsmodels.api as sm

# Creating dummy data
np.random.seed(42)
X = np.random.rand(50) * 10
y = 2 * X + 3 + np.random.randn(50)

# Add constant term
X = sm.add_constant(X)

# Construction of a single regression model
model = sm.OLS(y, X).fit()

# Display results of regression analysis
print(model.summary())

The code creates dummy data, builds a single regression model, and displays the results. The results of the regression analysis include regression coefficients and t-values.

In addition, statistical hypothesis testing is performed based on the results obtained through the regression analysis. As an example, consider testing whether the coefficient of the independent variable in a simple regression analysis is zero (no effect). In this case, the null hypothesis (H0) is “coefficient = 0” and the alternative hypothesis (H1) is “coefficient ≠ 0.

The output of model.summary() also displays the t-values and p-values of the coefficients. From here, the p-value is obtained and a statistical hypothesis test is performed.

# Run a t-test
t_statistic = model.tvalues[1]  # t-values of the coefficients of the independent variables
p_value = model.pvalues[1]  # p-value of the coefficient of the independent variable

# Show p-value
print("p値:", p_value)

# Set significance level
alpha = 0.05

# Compare p-values and significance levels to determine hypotheses
if p_value < alpha:
    print("Reject the null hypothesis: coefficient is not zero")
else:
    print("The null hypothesis is not rejected: the coefficient is not different from zero")

In this example, the obtained p-value is compared to a pre-defined significance level (0.05 in this case) to determine whether to reject the null hypothesis. If the p-value is less than the significance level, we reject the null hypothesis and conclude that the coefficient of the independent variable is non-zero.

Nonparametric Tests

<Overview>

Non-parametric Test, a statistical hypothesis testing technique used when a priori assumptions about the nature of the data are difficult to make, is used when the data do not satisfy a normal distribution and is an approach to testing a hypothesis independent of the data distribution. Due to these characteristics, this method is considered to be robust to the effects of non-normal distributions and outliers.

Some representative nonparametric tests are described below.

Wilcoxon Signed-Rank Test: This method is used to evaluate whether the difference between the means or medians of two paired samples is statistically significant. The hypothesis is tested using the rank of the data.
Mann-Whitney U Test: This method is used to evaluate if the difference between the means or medians of two independent samples is statistically significant. It compares two samples using the rank order of the data.
Kruskal-Wallis test: This method is used to evaluate if the difference between the means or medians of three or more independent samples is statistically significant. It is a non-parametric test corresponding to ANOVA.
Friedman test: This method evaluates if the difference between the means or medians of three or more paired samples is statistically significant. This method is a nonparametric test corresponding to the Kruskal-Wallis test.

The procedure for a nonparametric test is to compute the rank or order of the data and then compute a statistical measure (e.g., U-value) based on it. This measure is used to determine if the null hypothesis is rejected. Non-parametric tests can be a useful approach, especially when the data do not follow a normal distribution or when the sample size is small.

However, since nonparametric tests can lose information, it is sometimes better to consider normal parametric tests whenever possible.

<Implementation>

Here we present an example implementation of the Wilcoxon Signed-Rank Test and the Mann-Whitney U Test (Wilcoxon Rank-Sum Test), which are representative nonparametric tests.

The following code is a simple example of running the Wilcoxon Rank-Sum Test and the Mann-Whitney U Test using Python’s scipy library.

import numpy as np
from scipy.stats import wilcoxon, mannwhitneyu

# Example: Wilcoxon's rank sum test
# Two related samples (with correspondence)
data_before = np.array([15, 20, 25, 30, 35])
data_after = np.array([12, 18, 24, 29, 32])

# Wilcoxon rank-sum test
statistic, p_value = wilcoxon(data_before, data_after)
print("Wilcoxon rank sum test results:")
print("statistic:", statistic)
print("p-value:", p_value)
print()

# Example: Mann-Whitney U test
# Two independent samples (no correspondence)
group1 = np.array([23, 28, 32, 35, 40])
group2 = np.array([18, 25, 30, 33, 36])

# Mann-Whitney U test
statistic, p_value = mannwhitneyu(group1, group2)
print("Mann-Whitney U test results:")
print("statistic:", statistic)
print("p-value:", p_value)

Statistical Hypothesis Testing Challenges and Remedies

<challenges>

Statistical hypothesis testing is a powerful tool, but there are some challenges and limitations. Some of the main challenges are discussed below.

Impact of sample size: Small sample sizes can limit the power of statistical tests. Small sample sizes reduce the power to detect true effects and associations, leading to erroneous conclusions.
Multiple comparisons problem: When multiple comparisons are performed, the probability of rejecting the null hypothesis increases, potentially leading to erroneous results. To control for this, methods such as the Bonferroni correction are used, but even then it can be difficult to completely eliminate all errors.
Assumption of exact distribution: Hypothesis testing methods are based on the assumption that the data follow a particular probability distribution. However, it is not always guaranteed that the actual data follow that distribution exactly. In particular, it can be affected by small sample sizes and outliers.
Interpretation and significance of results: In addition to statistical significance, actual significance must also be considered. Even if statistically significant, it is important to consider the magnitude and practical significance of the effect. Even if a small effect is determined to be statistically significant, whether it is important in practice is a separate issue.
Selection Bias: Selection bias can occur when an analyst tries multiple hypothesis tests and selects significant results to report. This increases the risk of false positives (false significance).
Discontinuity of effects: When an actual phenomenon or effect is continuous, comparisons with discrete null hypotheses may not yield appropriate conclusions. Alternative methods may be necessary to properly model continuous effects.

<Improvement plan>

The following measures are available to address the challenges of statistical hypothesis testing.

Actions to address the effect of sample size: When sample sizes are small, the power to detect effects is reduced. Using a larger sample size or estimating an appropriate effect size can improve the reliability of the results.
Countermeasures against multiple comparison problems: To reduce errors due to multiple comparisons, significance levels can be corrected using methods such as Bonferroni correction or Holm correction. Another way to avoid random multiple comparisons is to perform specific hypothesis tests based on explicit predictions.
A response to the assumption of an exact distribution: For non-normally distributed data, nonparametric tests can be used to perform hypothesis testing without resorting to distributional assumptions. The distribution may also be estimated using the bootstrap method.
Interpretation of Results and Measures of Significance: In addition to statistical significance, it can be useful to calculate confidence intervals and effect sizes to assess effect sizes and practical implications. It is also important to evaluate the significance of the results through discussions with experts and stakeholders.
Strategies to address selection bias: A preplanned plan of analysis prior to conducting a hypothesis test can reduce selection bias. It is also important to report results honestly and to clearly state if additional research or reconsideration is needed.
Measures to address discontinuities in effects: To model continuous effects, methods such as regression analysis can be used to obtain more realistic results. It is also important to select the appropriate test method depending on the nature of the effect.

Application of Statistical Hypothetical Tests in Practice

Statistical Hypothesis Tests are used in a variety of real-world problems and situations. Some examples of applications are described below.

Medical Research:
- Evaluating the efficacy of a new drug: Statistical comparisons are made between treatment and control groups to verify the therapeutic effect of a new drug.
- Epidemiological studies: statistical comparisons between patient and control groups to investigate risk factors for a particular disease.
Business Analysis:
- Market research: to compare the purchasing behavior of customers in different market segments to determine if specific factors are influencing them.
- Advertising Effectiveness Measurement: to compare the effectiveness of different advertising strategies and to examine which advertisements are most effective.
Social Science:
- Educational intervention evaluation: compare pre- and post-education performance to assess the effectiveness of a new educational program.
- Survey analysis: to compare survey response patterns among different population groups to determine if there are differences in opinion.
Engineering and Quality Control:
- Product Quality Assessment: Examine the impact of different manufacturing processes and materials on product quality.
- Quality Improvement Projects: When a new process is introduced to improve quality, compare the quality of the old process with that of the new process.
Environmental Science:
- Environmental Impact Assessment: examines the impact of a specific activity (e.g., construction of a factory, construction of a new road) on the environment and evaluates the magnitude of the impact.
- Environmental Policy Evaluation: Evaluate the environmental impact of introducing different environmental policies and determine the most appropriate policy.

A Case Study of the Application of Statistical Hypothesis Testing to Machine Learning Techniques

The concept of statistical hypothesis testing has been widely applied to machine learning techniques. The following are examples of the application of statistical tentative tests to machine learning techniques.

Feature Selection and Feature Extraction:
- Feature usefulness evaluation: In training machine learning models, which features contribute to the prediction of the target variable is statistically evaluated, and important features are selected or eliminated.
Model Selection and Evaluation:
- Model comparison: comparing the performance of multiple models and selecting the best model using statistical methods, including A/B testing.
- Parameter Tuning: In the selection of hyperparameter values, statistical tests are sometimes used to find the optimal hyperparameters.
Anomaly Detection:
- Evaluation of anomaly detection models: Statistical evaluation of the performance of anomaly detection algorithms to verify their ability to discriminate between normal and abnormal conditions.
Reliability evaluation:
- Prediction reliability evaluation: To evaluate how reliable the predictions of machine learning models are, confidence intervals are calculated using statistical methods.
Domain Adaptation:
- Evaluation of domain adaptation: When using models in different domains, statistical hypothesis testing may be used to select appropriate adaptation methods.
Hostile Attack Detection:
- Evaluating Hostile Attack Detection Models: Statistical provisional tests may be used to evaluate a model’s resistance to hostile attacks.

Statistical tentative tests can be a useful method in many aspects of machine learning, including model evaluation, selection, adaptation, and reliability assessment.

Reference Information and Reference Books

For a mathematical approach to statistics, see “Mathematics in Machine Learning. Also see “General Machine Learning and Data Analysis” “Noise Removal, Data Cleansing, and Missing Value Interpolation in Machine Learning” and “Explainable Machine Learning” for approaches in machine learning.

Reference book is “testing Statistical Hypotheses“

“Statistical Hypothesis Testing with SAS and R“

“Testing Statistical Hypotheses of Equivalence“