Machine Learning Professional Series “Submodular Optimization and Machine Learning” reading notes
Submodular functions are a concept corresponding to convex functions on discrete variables, and are an important concept in combinatorial optimization problems. Combinatorial” refers to the procedure of “selecting a part of a set of possible choices” and various computational properties associated with it. Submodular functions are used as an important tool for finding efficient solutions in optimization and decision making problems, and are applied in various cases such as social network analysis, image segmentation, query optimization, and combinatorial optimization.
Here we describe an approach using this submodular function based on the Machine Learning Professional Series “Submodular Optimization and Machine Learning“.
We can’t wait for machines to get faster. This book explains the application of combinatorial optimization methods to key machine learning problems, focusing on examples that can be performed in a normal computing environment. In addition to the theory of optimization involving NP-hard problems, we also discuss speedup of submodular optimization algorithms using data structures.
Preface
The theory of submodular functions and their optimization is a central topic in the field of combinatorial optimization
The concept of a submodular function corresponds to a convex function on discrete variables
1970 paper by Edmonds
Chapter 1: Submodularity in Learning
1.1 Introduction
What is a combination?
The procedure of “selecting a part of some selectable collection
Combinatorial computation plays an important role in machine learning
Examples of combinations
Hospital patient data
Predicting required length of stay for new patients
Machine learning approach
Build regression models from various patient tests/investigations to length of stay and apply to new patients
Selecting irrelevant “items” will degrade the performance of the regression model
Combinatorial computation, in which a subset of useful items are selected for use, is important to obtain a highly valuable regression model.
Selecting a portion of a collection of available targets (or elements)
Optimize a discrete function called a set function
Entirety of the object to be selected
Ground set
V={1,… ,n}
Set function
Assign a real number to each subset S(⊆ V) of a tabletop set V
Set function f: 2V → ℝ
Collection of subsets of V (power set) : 2V
Figure Example
A platform set consisting of 7 elements
Set function assigns a real number to a subset S of 3 elements
In set function optimization, the function f represents some goodness (or badness) of each subset S and maximizes (or minimizes) it.
Given a subset V, the number of all subsets S ⊆ V grows exponentially with the number of elements (|V|) in V
Example
4 ways when |V|=2
1024 ways for |V|=10
1030 ways for |V|=100
10301 ways for |V|=1000
1.2 Introduction to Submodularity
Submodularity is equivalent to convexity in set functions
1.2.1 Definition of a submodular function and its intuitive interpretation
Definition.
f(S) + f(T) ≥ f(S∪ T) + f(S∩T) (∀S,T ⊆ V)
f satisfying “satisfies the submodularity”
A function satisfying submodularity is called a submodular function
Intuitive explanation
V={1}, and V={1,2}
Preparation
Correspondence between a set function and a real-valued function (pseudo boolean function) defined on a 0-1 vector
Given a platform set V
set of units
A set of mere “bare bones” (without any structure)
Each subset S ⊆ V corresponds one-to-one to an N-dimensional 0-1 vector if the component corresponding to an element included in S is 1 and the component corresponding to an element not included is 0
A set function can be said to be a real-valued function on an n-dimensional 0-1 vector
An n-dimensional 0-1 vector is called a characteristic vector
More precisely, for any set S ⊆ V, the characteristic vector xs=(v1,…. T is
vi=
1 i∈ S
0 otherwise
Example representation
Example: x(2,3) ∈ ℝ3 is (0,1,1)T
Submodularity and Convexity of Continuous Functions
A set function f is a function of the form
For V={1}, there are two set functions {},{1}
1-dimensional
For V={1,2}, there are 4 ways {},{1},{2},{1,2}
Two dimensions
f is defined only for points on a 1D or 2D 0-1 lattice
Mathematically rigorous definition of interpolation (convex relaxation) that produces a convex function for a submodular function
The relationship between submodularity and convexity can be rigorously described using Lovász extensions
Necessary and sufficient condition that the set function is a submodular function
Its Lovász extension must be a convex function
Definition of Submodularity
The right side is the increment of the function value when adding i to S or T, respectively.
denotes the difference of sets
Let S={1,2},T={1,4,5}, ST={2}, TS={4,5}
The increase in function value when adding element i (∈ VT) to the “small” set S contained in T is greater than that to the “large” set T containing S
Diminishing: decreasing little by little over time
Functions with diminishing returns frequently appear in the information field.
Intrinsic nature of information
Intuitive interpretation
The value of newly obtained information is greater when there is little known information
If more is already known, the value of the same information is relatively small
1.2.2 Examples of submodular functions
For any n-dimensional real vector α ∈ ℝn, the set function defined above satisfies the submodularity
Modular function
Also expressed as a linear function f(S)=aTXs with characteristic vector Xs∈{1,0}n
Given an arbitrary undirected graph g=(V,E)
Function that returns the number of branches between a selected set of nodes (S⊆V) and its complement (VS)
Submodular functions
Given some finite set of elements and a set V of groups such that each contains some of them
A function that returns the number of elements in a selected group S ⊆ V is called a covering function
Submodular functions
(Simultaneous) entropy often used in the information field
Has a submodularity when viewed as a set function with respect to a set consisting of variables
Using entropy in sensor placement
Given a running matrix M, the rank of the submatrix obtained when some row S ⊆ V is selected from among all rows V
Submodular function
Mutual information content and information gain, utility functions in the economic field (dominant modular functions), determinants of positive definite symmetric matrices, convex games, square multiple correlation coefficients
Aggregate functions in various information fields (and other peripheral fields) are submodular functions
1.3 Submodularity in Machine Learning
Feature selection is formulated as a problem of maximizing a submodular function
Recently, there has been a lot of research on the problem of maximizing the utility of the set to be selected, which is based on the formulation of maximizing a submodular function.
Applications not limited to feature selection, but also in key machine learning algorithms such as active learning, nonparametric Bayesian estimation, spectral methods, and graph structure estimation
Applied in a large husbandry of application areas such as graph mining, viral marketing, network analysis, etc.
Submodular functions are suitable for representing many discrete data structures
Structural dependencies between variables
Represented using a submodular function
There is a hierarchical relationship between variables
Variables contain redundancy or similarity
Structural regularization
Streamlining calculations by incorporating pre-known structures
1.4 Topics covered in this book
Fundamentals of Submodular Optimization
Maximization of submodular functions and application of the greedy method
Maximum flow and graph cuts
Structural Regularization Learning with Submodular Optimization
1.5 Available Software
Matlab toolbox
Chapter 2: Fundamentals of Submodular Optimization
2.1 Definition and Examples of Submodular Functions
2.1.1 Cover functions
One example of a basic submodular function
Coverage function
Example: Sensor installation
fcov(S):Area covered by sensor
Number of round points covered
Area covered by sensor
Sensor 1 ON→12 points covered⇨fcov({1})=12
Sensor 2 ON→10 points covered⇨fcov({2})=10
Sensors 1 and 2 are ON→21 points are covered⇨fcov({1,2})=21
When V=(1,2,3,4}, there are 24=16 subsets of V
The cover function fcov:2V→ℝ is made of 24=16 subsets
2.1.2 Cutting functions for graphs
Cut function of a graph8cut function)
Undirected graph
Undirected graphs and branch subsets
Each branch e has a fixed positive value ce called branch capacity
Example
Find a real-valued fcut for each vertex subset
Sum of capacities of branches spanning S and VS
fcut(S)=fcut(VS)
Directed graph
Effective graph and branch subset
Does not include opposite branches
Cut function
fcut(S)≠fcut(VS)
The cut function of an effective graph is submodular
Therefore, the cut function of an undirected graph is also submodular
2.1.3 Set functions generated by concave functions
Submodular functions generated by concave functions of one variable
Concave functions on real ℝ generate convex functions on {0,1}n, which means
That the set function generated by a concave function of one variable is submodular is
concave function of one variable
Consider the change in function value h(x+∆x)-h(x) when x is changed to x+∆x
The larger x is, the smaller x is.
Example of function h(x)
√x, x(α-x), min{x,β) where α,β are constants
Set function fcnc:2V→h(|S|)
|S| is the number of elements in S
Submodular function
Weighted version of Fcnc
w=(w1,w2,… wn) be an n-dimensional constant vector whose components are non-negative
Submodular functions
Proof of Submodularity of Set Functions Generated by Concave Functions
A=S∪T, B=S∩T
1/2(h(w(S))+h(w(T))) ≥ 1/2(h(w(S∪T))+h(w(S∩T)))
2.1.4 Characterization of Submodularity
The expressions for the two submodulars are equivalent
2.2 Basic Properties of Submodular Functions
2.2.1 Various types of set functions
In the submodular optimization
It is important what properties a function f has in addition to its submodularity
Computational efficiency and algorithms vary greatly depending on the function f
Submodular function typology
A finite set V={1,…. ,n} and a set function f:2V→ℝ Consider
Introduction to terminology
f is normalized (normalization)
f({})=0 holds.
f is non-negative
f(S)≥0 for all S⊆V
f is monotone or nondescreasing
f(S)≤f(T) for all S⊆T⊆V
f is symmetric
When f(VS)=f(S) for any S⊆V
Type
The cover function fcov is a monotonic submodular function
The cut function fcov of an undirected graph is a symmetric submodular function
The cut function fdcut of a directed graph is a non-negative function, but neither monotonic nor symmetric
The submodular functions fcnc and fwcnc generated by concave functions are generally neither minor nor symmetric, but can be minor or target functions depending on how h is chosen
Superior Modular Functions, Modular Functions
Supermodular function
If g(S)+g(T)≤g(S∪T)+g(S∩T) for any S,T⊆V
We say g is a dominant modular function
is both submodular and dominatrix
g(S)+g(T)=g(S∪T)+g(S∩T) for any S,T⊆V
We say that g is a modular function
2.2.2 Basic Operations on Submodular Functions
Given a set function such as a submodular function, what are the basic operations that preserve submodularity?
Sum and scalar multiplication
Simplification and contraction
Simplification (or restriction) and contraction, which are operations to reduce the size of a table set
Reduction
Simplified fS is a function obtained by restricting the function value of f to a subset of S
Submodularity
Construction
The reduced fS is a function obtained by restricting the function value of f to the set of V containing S and subtracting a constant f(S) so that fS({})=0
Submodularity
A submodular function obtained by repeated simplification and contraction with respect to a submodular function f is called a minor of f.
2.3 Concept of Submodular Optimization
Introduction
Minimum cut problem and maximum cut problem
Example
Formulation of the minimum cut problem
Objective.
Cut function fcut(S) of an undirected graph → minimum
Constraints
S ∈ 2𝛎{{}, 𝛎}
Need to explicitly exclude {} and 𝛎 as S
Formulation of the maximum cut problem
Objective.
Cut function fcut(S) of an undirected graph → max
Constraints
S∈2𝝂
Minimum cut problem is theoretically solvable, but maximum cut is difficult to solve successfully
Basic form of submodular optimization
Basic form of submodular optimization
Objective.
f(S) → minimum or maximum
Constraints
S ∈ F ⊆ 2V
F is a set family consisting of the entire subset considered in the optimization
Terminology Introduction
Objective function
f, the symmetry to be optimized
Feasible region
F⊊2V is a family of sets consisting of subsets of V that satisfy certain constraints
Feasible solution
S contained in the feasible region F
Optimal solution
A feasible solution S*∈F that optimizes the value of the objective function
Optimal value
Objective function value f(S*) at optimal solution
Minimizer
Optimal solution when minimization problem
Maximizer
Optimal solution for maximization problem
Different Types of Submodular Problems
Is the submodular function f a general function? Non-negative? Monotonic? Target? Or does it have some structure?
Is it a minimization problem? Maximization problem?
Unconstrained (i.e. F=2V)? If constrained, what is it like?
Information about the submodular function and the computational complexity of the algorithm
What information about the submodular function is available to design an algorithm for a submodular optimization problem?
Given function-valued oracles, their calls yield the value of f(S)
Function value oracle
The Japanese translation of “oracle” is “oracle
In the fields of theoretical computer science and optimization, a device that performs some kind of predetermined calculation.
However, the process of computation is assumed to be a black box
Submodular optimization algorithms reduce both the number of basic operations and the number of function value calls to the order of polynomials with respect to n.
2.3.1 Minimizing a submodular function
For the problem of minimizing an unconstrained submodular function or a symmetric submodular function
General Submodular Function Minimization
Unconstrained minimization
The feasible region F is 2V
Eq.
Minimization of f can be obtained by simply calling and comparing the value of f(S) for all 2n possible subsets S ⊆ V
Exponential time consuming and impractical
If non-negativity (or monotonicity or symmetry) is assumed, S={} is the optimal solution to the problem
A note on minimizing a submodular function in practice
The theory of submodular functions guarantees a minimum line of difficulty for the problem
If you can prove submodularity, you can use various results of the theory of submodular functions
There are (relatively) practical inferior modular function minimization algorithms that are not guaranteed to be polynomial-time algorithms
A method with no guarantee of polynomiality in computation time, called the minimum norm point algorithm, can solve submodular minimization problems faster than many polynomial-time algorithms.
Optimality can be solved faster (both theoretically and practically) depending on the function f
Well-known minimum s-t cut problem with fast solution and wide range of applications
Symmetric submodular function minimization
Proper.
If a subset S ⊆ V is different from {} or V
Symmetric sub modular function minimization
Can be solved by Nagamochi-Ibaraki’s algorithm
Generalized Queryranne’s algorithm
2.3.2 Maximization of the inferior moduli function
Introduction.
NP-hard problems with and without constraints
There are fairly good approximation algorithms for some types of maximization problems
Maximizing monotonic submodular functions
Problem equation
The “Greedy Method Algorithm” by Nemhauser Wolsey Fisher
Maximization of non-negative submodular functions
Problem expression
NP-hard
As for the maximum cut problem
0.878-approximation algorithm based on the semi-positive definite programming problem by Gormans and Williamson
0.5-approximation algorithm using random numbers by Buchbintala
2.4 Submodular optimization and polyhedra
Introduction
V={1,… ,n}
Let F:2V→ℝ be a normalized submodular function
The function f defines two polyhedra (the submodular polyhedron P(f) ⊆ ℝn and the base polyhedron B(f) ⊆ ℝn)
2.4.1 Submodular and base polyhedra
Submodular polyhedron
Define as the set of all x ∈ ℝn that satisfy all 2n-1 inequalities of the form x(S)≤f(S)
P(f) ∈ ℝn
P(f)={x∈ℝn : x(S)≤f(S) ∀S⊆V)}
Base polyhedron
x ∈ P(f) satisfying x(V) = f(V)
B(f)= {x∈P(f): x(V)=f(V)}
Submodular and base polyhedra for n=2
When V={1,2}
The submodular polyhedron P(f) ⊆ ℝ2 is a polyhedron determined by three inequalities
The base polyhedron B(f) ∈ ℝ2 is a polyhedron determined by two inequalities and one equality
p(f) and B(f) when f({1})=4, f({2})=3, f({1,2})=5
Submodular and base polyhedra for n=3
When V={1,2,3}
The submodular polyhedron P(f) ∈ ℝ3 is the set of 3-dimensional vectors (x1,x2,x3) such that all seven inequalities are satisfied
The base polyhedron B(f) ∈ ℝ3 is the set of all 3-dimensional vectors such that (x1,x2,x3)∈P(f) and x1+x2+x3=f({1,2,3})
Examples of P(f) and B(f) for n=3
Properties of submodular and base polyhedra
2.4.2 Endpoints of the base polyhedron
Introduction.
For a polyhedron P ⊆ ℝn, a point x ∈ P is an extreme point if
Define x is not represented as the midpoint of y,z for any two points satisfying y,z∈P and y≠z.
When the polyhedron P is bounded, then
P coincides with the convex hull of the endpoint set of P (the smallest convex set containing the endpoint set of P)
The base polyhedron B(f) is a subset of the submodular polyhedron P(f)
In general, the endpoint sets of B(f) and P(f) coincide
The base polyhedron B(f) for n=2 has two endpoints
Base polyhedron B(f) with n=3 has 6 endpoints
Endpoints and linear ordering of the base polyhedron
The set of n components of ℝn, V={1,… linear ordering L(linear ordering)” which can be obtained by rearranging the n component sets V={1,…,n} of ℝn.
When n=2, there are two linear orders L: (1.2) and (2,1)
When n=3, there are 6 linear orders: (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), (3,2,1), and (3,2,1).
If we determine one linear ordering of V, then the endpoint xL(x1L,x2,L…) of the base polyhedron B(f) corresponding to L ,xnL) ∈ ℝn can be determined
The linear order L and the endpoints xL correspond to each other by the greedy algorithm for the basis polyhedron (or submodular polyhedron) proposed by Edmonds in 1970
Algorithm
Start at an appropriate point x0
x0 satisfies x0≤x for any point x∈B(f) in the base polyhedron
Within P(f), i1,… ,in the order of the components in the order of the components, and let xL be the point obtained by increasing each component in the order of the components
The final xL is always an endpoint of the base polyhedron B(f)
Explanation with a real example
Since n=2, the algorithm is from step 0 to step 2
Step 0
For example, let x0=(0,0)
Step 1
Maximize the first component within P(f), so x1=(4,0)
Step 2
Maximize the second component within P(f) and x2=x(1,2)=(4,1)
Figure 1 of the procedure
Step 2
A greedy algorithm to generate endpoints xL∈ℝn
Let L(j) be a subset of V consisting of the first j elements of L
The endpoint xL of the base polyhedron B(f) corresponding to the linear ordering L is f(L(1)),… . It is easily obtained by calling the n function values of f(L(n))
Example: n=2
x(1,2)=(f({1}), f({1,2}) – f({1}))
x(2,1)=(f({1,2}) – f({2}), f({2}))
Example n=3
2.4.3 Fundamental Polyhedra and Submodular Function Minimization
2.5 Submodular functions and convexity
2.5.1 Lovász extension of set functions
2.5.2 Submodular and Convex Functions
2.5.3 Multiple Linear Extensions of Submodular Functions
Chapter 3: Maximization of Submodular Functions and Application of the Greedy Method
Introduction
On the problem of maximizing a submodular function
f:2V→ℝ is a monotone submodular function
Constraint: At most k elements in the subset to be selected
3.1 Submodular Maximization and Greedy Method
3.1.1 Submodular maximization and approximation algorithms
Example of Cover Functions
Cover function maximization problem
Choose two sets that cover as much of P1,P2,P3,P4 as possible
V={1,2,3,4}
There are 6 subsets: (1,2}, (1,3}, (1,4}, {2,3}, {2,4}, {3,4}
By hand, the combination that achieves the maximum value is fcov({2,4})=11
Approach by approximation algorithm
α: approximation ratio or approximation grantee
The closer α is to 1, the better
greedy method
0.63-Approximation Algorithm
Any monotone submodular function is guaranteed to achieve an objective function value greater than 0.63 times the optimal value
3.1.2 Greedy Method for Submodular Maximization
Greedy algorithm
Start with S={} and “greedily” increase the number of elements until the number of elements in S⊆V is k
Specific Algorithm
Example
Initialization
Let S={} in step 0
Iteration 1
Step 1 does not stop for |S|=0<2
In step 2, fcov({}∪{1})=3, fcov({}∪{2})=5, fcov({}∪{3})=7 fcov({}∪{4})=6 and I=3 is chosen and S={}∪{3}={3
Iteration 2
Step 1 does not stop from|S|=1<2
In step 2, fcov({3}∪{1})=10, fcov({3}∪{2})=9, fcov({3}∪{4})=10 and I=1 is chosen (or i=4) and S={3}∪{1}={1,3
Iteration 3
In step 1, since |S|=2, SGA={1,3} is output and the algorithm stops
3.1.3 Approximation Ratio for Greedy Method*.
Proof of approximation ratio (omitted)
3.2 Application Example 1: Application to Document Summarization
3.2.1 Formulation of Document Summarization as Submodular Maximization
Conceptual Diagram of the Definition Formula for the Submodular Function in Document Summarization
Given a document, a finite set V={1,… n}, where n is the number of sentences that comprise the document.
fdoc(S)=L(S) + λR(S)
S is the set of sentences selected in the document summary
L(S) is the set of selected highly relevant sentences
(1) Let L(S) = ∑sij as the correlation between two sentences sij
Sentences that are highly relevant to the document as a whole
(ii) The set of concepts in the set of sentences S is defined as 𝚪(S) and L(S) = ∑γi as the sum of the importance γi of each concept i ∈ 𝚪(S)
Definition by Lin and Bilmes
Ci(S):2V→ℝ is a monotone submodular function expressing how similar the selected set of sentences is to sentence i (i.e., how well sentence i is covered by S)
0≤γ≤1 is a parameter to adjust the threshold
R(S) is the set of sentences chosen so that there is less redundancy among the sentences to be selected.
Impose some penalty for redundant sentences
(i) Add a reward for the diversity of selecting a set of sentences S
Pk(k=1,… K) is the partition of the whole sentence, and rj>0 is the reward for adding a new sentence j to the empty set.
To obtain the partition Pi, clustering, for example
Reward for selecting sentence i from a partition that has never been selected
Parameters to adjust trade-offs λ
Greedy algorithm for submodular function maximization under knapsack constraints
Consider the length of a sentence as a cost and impose a constraint on the sum of the costs of the selected sentences
Normalized by cost in step 3.
3.2.2 Other Criteria for Document Summarization
Other Document Summarization Criteria
Maximum marginal relevance
Equation for the increment when a new sentence i∈V is added to the set of already selected sentences S
Sim1(I,q) is the similarity between sentence I and query q
Sim2(j,i) is the similarity between sentences j and i
λ:0≤λ≤1 is a coefficient that adjusts the balance between these
8
ROUGE-N score proposed by Lin and put into practice in recent years
34
Reproducibility based on n-grams between a candidate summary and its reference
3.3 Application Example 2: Sensor Placement Problem
Introduction.
Placement of sensors so that the observation error is as small as possible
Modeled with Gaussian process regression using a cover function and obtained by a submodular maximization problem
3.3.1 Estimating the Distribution by Gaussian Process Regression
Assumptions.
Possible sensor locations x1,… xn, where n is the number of possible sensor locations.
The set consisting of subscripts V={1,… n}.
Place sensors at locations corresponding to S ⊆ V
Sensors have noise, and multiple sensors store information about each other to improve accuracy
Multiple sensors are more accurate in the vicinity
The distribution of the observed quantity ys at the location S ⊆ V where the sensor is placed follows a multidimensional normal distribution
Probability density function
μs is the mean vector
∑s is the variance-covariance matrix
|∑s| is the determinant
Consider the distribution of an observable y(x) at an arbitrary location x
Mean μ(x) and variance σ2(x) can be expressed nonlinearly as a function of location x
p(y(x)) does not yet account for sensor observations
What is the observed quantity y(x) at location x, using the quantity at the location where the sensor is placed as a guide?
The conditional probability p(y(x)|ys) (posterior probability distribution) is the probability we want to find.
ys is the observed quantity at the location where the sensor is placed
What does the conditional distribution p(y(x)|ys) look like?
The covariance between the observables at each location is given by the kernel function K(x,x’)
A radial basis functional kernel is used for four intervals
Also known as: Gaussian kernel
γ: Parameter of the kernel function
Simultaneous distribution of observables y(x) and ys
K0:=K(x,x)
K no kakuyosoha K(x,xi)
The conditional distribution p(y(x)|ys)=N(μ(x|S), σ2(x|S)) is the above equation
Using the partitioning formula
Partition of the mean vector μ and variance-covariance matrix ∑
3.3.2 Criteria for sensor placement and submodularity
What are the criteria for evaluating good sensor placement?
How small is the uncertainty of the observation at any point x in the region under consideration?
Entropy-based approach
The simultaneous entropy H(y) of the random variable y and the conditional entropy H(y|y’) conditional on another variable y’ are given by
The conditional entropy of an observed quantity y(x) at a location x when a sensor is placed at location S is
Assuming a normal distribution
Can be calculated using Σ2(x|S)
How much uncertainty can be reduced by placing the sensor at location S?
In the actual calculation
Can’t calculate everything at an arbitrary location.
Some representative locations x1,… xm pre-selected
Calculate this quantity for this location and the location VS where the sensor could not be placed
Mutual information (metal information) of yS and (yVS,y(x))
Proposition.
Entropy is monotonically generated
The more observations we get, the more information we have and the less uncertainty we face
Constraints are placed on the installation fee for sensors.
To obtain accuracy
Monotonicity is sufficient for all sets with up to 2k elements
Make|V| large enough for 2k
Ref. 30.
3.4 Application example 3: Active learning
Introduction.
A method for selecting and labeling only important samples when the cost of labeling is high in supervised learning
Pool-based active learning
Batch-mode active learning
Selecting multiple samples to be labeled at the same time
3.4.1 Lumped Active Learning and Submodularity
Binary classification is used for simplicity
Prerequisite.
Unlabeled samples x1,… ,xn∈ℝd.
The set of its indexes is V={1,… ,n}.
Let yi ∈ {-1, +1} be the (unknown) label of each sample xi
The objective is to select as few samples as possible (or no more than k samples given a priori) to label so that the classifier p(y|x) can be trained to perform as well as possible.
The usefulness of the set of samples to be labeled (how much information they have for the classifier) is given based on the Fisher information matrix described in “Overview of the Fisher Information Matrix and Related Algorithms and Examples of Implementations.
A criterion that gives the amount of information that the observed random variable has for the model (parameters) under consideration.
Since the classifier p(y|x) is described by a finite number of parameters w
The Fisher information matrix Iq(w) is given by the above equation
q(x) is the distribution of samples of interest
Criteria for active learning using the Fisher information matrix
The distribution over all samples is p(x)
Let q(x) be that for the sample to be labeled
A criterion that tries to select samples so that the information held by the sample to be labeled is as close as possible to that of all samples.
Formulated using (linear) logistic regression as a classifier
Logistic regression equation
logistic sigmoid function
Fisher matrix for logistic regression
Finite sample, so rigorous computation of integrals is not possible
Use approximate
Id is a d-by-d unit matrix
Δ≪1 is a small real value used to avoid singular matrices
S⊆V is the set of samples to be labeled
Substitute tr(Iq-1Ip) that you want to minimize
Objective function you want to minimize in the final
The first term is evaluated with respect to V
3.5 Other applications
Subgraph selection submodular function maximization obtained by graph mining by Thoma et al.
52
Das and Kempe, Submodularity of Criteria for Variable Selection in Linear Regression Models
Reed and Gharamini et al. in a type of nonparametric Bayesian estimation called the Indian Bush process.
Efficient implementation by formulating the optimization required during the computation as an approximate submodular function maximization.
Kempe et al. formulation of influence maximization on a network as a submodular function maximization
Applied to marketing in social networks
Document summarization by Lynn, Bilmes, et al.
Use of hierarchical structure of documents
Simultaneous summarization of multiple documents
Proposed criteria and robust algorithms for sensor placement that differ from mutual information content
Particularly important extensions in active learning
Extension to active learning in ever-changing circumstances
adaptive sub modularity
9,17
Further related application topics
30
3.6 Supplement: Setting of possible sensor placement locations*.
Sensor placement problem
Supplement 3.6
Lipschitz continuous
System 3.7
Theorem 3.8
Chapter 4: Maximum Flow and Graph Cut
Introduction
Minimizing a submodular function can be done in polynomial time
In the case of a general submodular function, if the elements of the table set V are n, minimization takes O(n5EO+n6) in the worst case
For symmetric submodular functions, there is a Queyranne algorithm that can be performed in O(n3EO) in the worst case
Even the minimum norm point algorithm is not fast enough for problems with many element points or for iterative calculations
It is necessary to consider a special class of fast minimizable submodular functions
Graph cut algorithms used in computer vision
Introduction.
On the s-t cut function of directed graphs and maximum flow
The s-t cut function is one of the most basic submodular functions
Maximum flow and s-t cuts are closely connected by a principle called the maximum flow minimum cut theorem
4.1.1 Cut and flow
On the concepts of cut and flow in directed graphs
Example of a directed graph
Number of vertices 6 with V={1,2,3,4,}
Number of branches 9
The number on each branch is the branch capacity
s-t cut and minimum s-t cut problems
An ordered partition (V1,V2) of the vertex set V of a directed graph g
In particular, those satisfying s ∈ V1, and t ∈ V2
s-t cut (s-t cut)
Any s-t cut can be expressed as above by using some subset S ⊆ V of V
For an s-t cut, express the capacity Ks-t in the above equation
For a subset V’ ∈ V of the vertices of ɡ
Let δgout(V’) ⊆ ε be the set of branches e ∈ ε in g that exit a vertex of V’ and enter a vertex other than V’
The set function Ks-t:2V→ℝ is called the s-t cut function
The s-t cut function values in the above example directed graph are
Ks-t({})=c(s1,1) + c(s,2) =10
Ks-t({1}) = c(1,3) +c(1,2) + c(s,2) =9
Ks-t({2}) = c(s,1) + c(2,3) + c(2,4) = 13
The s-t cut function Ks-t is the minor (plus a constant) of the cut function of the directed graph
Minimum s-t cut problem
Special case of the submodular function minimization problem
Can be solved fast by solving the maximum flow algorithm
Flow and maximum flow problems
In an effective graph g=({s}∪{t}∪V, ε)
Significance for the flow, which is the flow of stuff from source s, corresponding to the inlet, to sink t, corresponding to the outlet
Let each branch e ∈ ε correspond to a real value 𝝃e ∈ ℝ representing the flow over e
Introduce a real vector 𝛏=(𝛏e)e∈ε
A flow 𝛏 is said to be a feasible flow if it satisfies the following conditions
For each branch e ∈ ε, the flow 𝝃e on e is greater than or equal to 0 and less than or equal to ce
0≤𝝃e≤ce (∀e∈ε)
For each vertex i ∈ V{s,t} = V except S and t, the amount of flow leaving i and the amount of flow entering i are equal
Flow conservation constraints
Example of feasible flow
Representation of a flow 𝝃e flowing over a branch e of branch capacity ce
For a feasible flow, the flow value is expressed above as the net amount of flow leaving the source
Problem of finding a feasible flow 𝝃 that maximizes flow
The optimal solution is called the maximum flow.
Example
Linear optimization problem
Can be solved using linear optimization solvers
Can also be solved quickly using a maximum flow algorithm
Relationship between flow and cut
Relation between the s-t cut of an effective graph and any feasible flow
Maximum flow minimum cut theorem
Example
4.1.2 Maximum flow algorithm
Typical algorithm for finding maximum flow
Aiming for Maximum Fishing
A directed path from source s to sink t in an effective graph is called an s-t path
Simple method for finding the maximum flow in a directed graph g
Procedure
Initialize flow 𝛏 as 𝛏e=0 (e ∈ ε)
Flow along s-t path in g for α
Branch (i,j) is on path P → increase 𝝃(i,j) by α
Branch (i,j) is not on path P → 𝝃(i,j) remains the same
Problems
Maximum flow is not always obtained
Residual network and flow updates
Define a directed graph that expresses how much 𝝃 can be varied while meeting capacity limits
Residual capacity(residual capacity)ce𝝃
Composition of the residual network g𝝃 with respect to feasible flows 𝝃
Black if branch (i,j) ∈ εfor𝛏
Blue if branch (i,j) ∈ εrev𝛏
Update flows using residual network
Residual networ s-t path
The minimum flow rate at which the residual capacity on the increasing path can be increased
Incremental path algorithm for maximum flow problem
augmenting path algorithm
1956, Ford-Fullkerson (Ford-Fullkerson algorithm)
Example
Continued
Image in Fig.
Computational complexity of the maximum flow algorithm
If there are multiple candidates, select the increasing path with the fewest number of branches.
Edmonds-Karp algorithm
Sohma’s Algorithm
Goldberg-Tarjan algorithm
Allow flows to overflow at non-sync vertices
Gallo-Grigoriadis-Tarjan algorithm
Application to parametric maximum flow problem
Validity of the Incremental Path Algorithm and Proof of the Maximum Flow Minimum Cut Theorem
Proposition.
Computation of minimum s-t cut using maximum flow
For any maximum flow 𝛏, the residual network has no increasing path
Algorithm for minimizing s-t cut function using maximum flow
4.2 Inference and Graph Cuts in Markov Stochastic Fields
Introduction.
If each variable in the data has some structure, that structure can be used to recursively describe various inference algorithms
Represent Markovianity among variables using undirected graphs
Maximum-a-posteriori estimation of Markov probability fields and its relation to submodular optimization
Graph cuts in computer vision
4.2.1 Markov Stochastic Fields
A finite set of labels L:={0,1,…. ,l) and d random variables xi(i=1,… d) for a finite set of labels L:={0,1,…,l) and d random variables
Each random variable xi takes on the value of one of the l+1 labels
Vertex set V={1,… ,d} such that for each variable xi we are given an undirected graph g=(V,ε)
The information in the branch set ε represents the relationship between random variables
In Markov probability Grasshopper, the stochastic properties of the above equation
Bi:={xj:(I,j) ∈ ε} denotes the set of vertices adjacent to xi and on the graph g
Decompose probability distributions based on conditional independence
C is the set consisting of the entire clique (the set of vertices adjacent to each other that are maximal) on g
𝚯c is some function defined on the variable xc=(xi)i∈c corresponding to a vertex in the clique c
Stochastic distribution or decomposable model based on Markovianity
Used in probability propagation, etc.
In general, most distributions handled by Markov probability fields are exponential distributions such as the normal distribution.
The function EX(x) is discussed based on the above equation
Function E(x): Cost function, called energy
Due to the transparency between the maximum posterior probability estimate of p(x) and the minimization of the function E(x) (energy minimization) in relation to statistical physical models described in “Statistical physics and its application to artificial intelligence technology” such as the Inzing model
Discussion of the form of the undirected graph g, restricted to a lattice
First-order Markov random field
In a lattice graph, a clique is a set of 2 elements consisting of pairs of variables adjacent to each other in the graph.
The function is decomposed for each pair of branches
Some inference
Given data, an operation that assigns a value to x according to some criterion
Equation for Optimization Problem of Inference
The range of possible values of x is denoted as LV
The first term on the right-hand side
An addition to incorporate quantities defined for each variable
θconst is a constant
Essentially, even without the first term, the second term alone can express the energy function of a first-order Markovian probability field
Example: Conditional random field
Conceptual diagram of a concrete example
Example: Recovering the original image x from a noisy image y
x: Variable not directly observed
Probability distribution p(x) of x
C is some constant
Y: Observable variable
Equation for the maximum posterior probability of obtaining an observed value for variable y
4.2.2 Submodularity in Energy Minimization
Energy minimization problems are NP-hard
If x is a binary variable, it can be computed in polynomial time if the pair term θij satisfies certain conditions
Known conditions for solving in polynomial time, even with many labels
References 27,28
Conditional formula for solving in binomial variables
Denotes some sort of smoothness of the pair terms
A higher tendency to assign zeros or ones to adjacent vertices on the original graph.
Interpretation applied to the case of images
Often coincides with its neighbors in an image
Energy function values are large at boundaries, but rare in the aggregate
Example of applying energy function minimization to denoising
Submodularity
Four subsets of V: {},{1},{2},V
When each random variable is a binary variable, the graph cut algorithm can be used to solve the problem quickly if the quadratic terms satisfy the column modularity property.
Specific Methods
For graph cutting, the energy function must be converted to its standard form by repeating an operation called reparameterization
Graph Cut Algorithm
s-t graph used for graph cuts to first-order energy minimization
4.3 Graphically Representable Submodular Functions
Introduction.
Extension of s-t cut function for directed graphs
graph-representable submodular function
Functions that can be represented as s-t cut functions of directed graphs by adding some auxiliary vertices
4.3.1 Generalization of s-t cut functions
4.3.2 Minimizing the generalized graph cut function
4.4 Supplement: pre-flow push method*.
Chapter 5: Structural Regularization Learning with Submodular Optimization
Introduction.
Machine learning using the structure between each variable in the data
Approaches to Sparsity Modeling
Using prior information to reduce real dimension and improve learnability
5.1 Sparsity Model Estimation by Regularization
Introduction
Estimating the resuscitation model by regularization
5.1.1 Regularization by the 𝓁p norm
Minimize loss function
Squared error function
Cross-entropy error function described in “Overview of Cross-Entropy and Related Algorithms and Implementation Examples,”
Reducing the loss function alone causes over-fitting
If you make the model more complex, you can reduce the amount of garbage with respect to the data D as much as you want, but you will also pick up random noise in the data.
Regularization approach
In addition to the loss function, add a penalty term Ω for increasing parameter w (increasing model complexity) and minimize at the same time
As a penalty term, the norm is often used, which has the effect of shrinking the parameters (i.e., making them smaller).
The norm is a mathematical tool to give a distance to a vector
The ℓp-norm (ℓp-norm) is often used
The further away from the origin (the point where all components of w are 0), the larger the value (the stronger the penalty for minimization)].
Regularization with norm
It is easier to get many components of the solution to zero using the L1 norm
5.1.2 Two-Pair Norm and Fenchel Conjugate*.
Dual norm
Dual norm Ω* of norm Ω
Fenchel conjugate
For any continuous function h:ℝd→ℝ, the Fenshel conjugate h*:ℝd→ℝ is the above expression
Proposition
5.2 Structural Sparsity from Submodular Functions
Introduction.
Regularization by the ℓp-norm is a function of each variable w1,… and wd, and treats them uniformly.
Extend regularization to criminally exploit relationships between variables in the data
Two types of sparsity obtained by regularization
Sparsity in the form of many parameters being degenerate and taking the value of 0
ℓ1 regularization, etc.
Can also be thought of in terms of groups on variables given in advance
Sparsity such that multiple parameters that are close in nature tend to have the same value.
5.2.1 Regularization of group types
Group regularization
Assumptions.
Example: Consider the case where the parameters are 3-dimensional w=(w1,w2,w3)T
Of the three parameters, w1 and w2 are in a group and w3 is not
Within a group, only degenerate like ℓ2 regularization
Between groups, have sparsity and degeneracy effects like ℓ1 regularization
Description as a penalty
w1 and w2 tend to be at the same time when the value is zero
Non-overlapping ℓ1/ℓp-group regularization
The variable that tends to be zero at the same time is divided into groups {1,…,… Let G be a collection of subsets of {1,…,d}.
The penalty term is the above equation
Non-overlapping ℓ1/ℓp-group regularization
Image of group regularization
Left: No overlap
Right side: with duplication
Group-type regularization term obtained from the submodular function
V:={1,… d} as a finite set
For w ∈ ℝd, supp(w)={I∈V : wi≠0} is called the support set
A function that takes into account the cost of the ℓp norm of W and the cost to the support set
Proposition.
Relationship to ℓp regularization and ℓ1/ℓp regularization
Numerical example with group regularization
Relationship to submodular functions
5.2.2 Regularization of coupled types
Provide sparsity such that similar variables have the same value
Express similarity given an undirected graph defined on the variables
If ordered, simply a single chain graph
For cases such as pixels in an image, the graph is a 2-dimensional lattice
An undirected graph is denoted by g=(V,ε) such that pairs of variables whose values we want to be close are adjacent to each other.
The joint normalization term is expressed by the above equation
dij>0 is the weight for each pair (i,j)
0 if adjacent variables have the same value
The further apart adjacent variables are, the greater the penalty
Since the ℓ1 norm is used, it is easy to obtain pairs that take 0 as in lasso time
Often used for training with image data
Total variation denoting
If g is a chain graph, it is called a fused Lasso
Conceptual Diagram of Coupled Regularization
Example of Least Squares Regression with Coupled Regularization
Relationship with Submodular Functions
The joint regularization term is a Lovász extension of the cut function of an undirected graph
Sparsity obtained by joint regularization
Proposition
5.3 Return to decomposable convex function minimization on a submodular polyhedron
5.3.1 Application of the proximity gradient method
The objective function of the optimization problem is a convex function for both the loss function and the regularization term
Loss function l(w) is differentiable, but structural regularization line Ω(w) is not
Gradient calculation is required, and convex minimization algorithms such as the usual gradient method cannot be applied.
Replace the non-differentiable part by a kind of projection and use a procedure similar to the gradient method
For the simplest gradient method (steepest descent for differentiable convex functions)
Find the minimum of a differentiable convex function l(w) with vector w as an argument
For differentiable functions, the gradient (first-order partial derivative) ∂l(w)/∂wi can be calculated
If we move the parameter in the opposite direction of its gradient, we can reduce the function value.
Iterative calculations
w(k) and w(k+1) are parameters before/after update, respectively
∆l(w)=(∂l(w)/∂w1,… ,∂l(w)/∂wd)T
ηk>0 is a parameter that controls the degree of update
Algorithm
The proximity gradient method applies this gradient procedure to the differentiable part
Solve the minimization problem obtained by linearly approximating l(w) around the current solution w(k) and
The last second-order term is the term that ensures that updates do not cause l to deviate too far from the current linear approximation.
L>0 is a parameter to adjust it
Update using that solution
Equivalent to above equation
Without the regularization term Ω, the form is similar to that of the steepest descent method
Just think of ηk=1/L
More general equation
This optimization problem is computing a kind of projection of u
Let proxλΩ(u) be the optimal solution to this problem
The operator that maps U to proxλΩ(u) is called the proximity operator
Updated using information from previous iteration’s solution
FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)
In ℓ1/ℓp group regularization with overlap, to avoid increasing computational cost
Introduce auxiliary variables that avoid group overlap
The computation of the proximity operator can be attributed to
submodular minimization
Problems that involve solving a submodular minimization with a variety of possible parameters
5.3.2 Proximity operator by separating convex function minimization on a submodular polyhedron
Attributing the optimization problem to the optimization of a submodular function
The structured regular term Ωf,p(w) is expressed in a unified manner as in the above equation
Substituting this into the general optimization problem equation (above) yields
Finally, the above equation becomes
𝛹i(ti) given in the last line is defined as above for ti≥0
Since 𝛹I(ti) is a convex function with respect to ti, it follows that the computation of the proximity operator can be done by solving the minimization problem for a separating convex function (a convex function separated for each variable) with a submodular polyhedral power
Application of the decomposition algorithm
Recursively, the submodular function is partitioned into restriction and contraction
Minimization calculations are performed for each to compute the proximity operator
Partitioning Algorithm
5.4 Acceleration by Network Flow Computation
5.4.1 Formulation as parametric submodular minimization
Optimization in which the objective function contains parameters that can vary and a sequence of solutions is computed for those variations
For the moment, consider the above optimization
Minimization problem for separable convex function of a base polyhedral power
f is a monotone submodular function
Complementary question
Complementary problem
Parametric optimization problems can be solved as parametric maximum flow problems
The function f that produces a group-type/connected structural regularization is
Parametric maximum flow problem is computationally equivalent to the pre-flow Bush method
system
5.4.2 Speed-up by parametric flow computation
Gallo Grigoriadis Tarjan (GGT) algorithm
A class of problems called monotone source-sink
Parametric Preflow Method Algorithm
5.5 Supplement: Calculation of Equation (5.21)
コメント