Information Geometry and its Applications to Machine Learning - Shun-Ichi Amari - MLSS 2012 Kyoto Slides

« Information Geometry and its Applications to Machine Learning

Shun-Ichi Amari

Information geometry studies invariant geometrical structures of a family of probability distributions, which forms a geometrical manifold. It has a unique Riemannian metric given by Fisher information matrix and a dual pair of affine connections which determine two types of geodesics. When the manifold is dually flat, there exists a canonical divergence (KL-divergence) and nice theorems such as generalized Pythagorean theorem, projection theorem and orthogonal foliation theorem hold in spite that the manifold is not Euclidean. Machine learning makes use of stochastic structures of the environmental information so that information geometry is not only useful for understanding the essential aspects of machine learning but also provides nice tools for constructing new algorithms. The present talk demonstrates its usefulness for understanding SVM, belief propagation, EM algorithm, boosting and others.

Scroll with j/k | Numbers | Overlaid text | Size

Machine Learning SS: Kyoto U.

Information Geometry
and Its Applications to Machine Learning

Shun-ichi Amari
RIKEN Brain Science Institute

Information Geometry
-- Manifolds of Probability Distributions

M = { p (x)}

Information Geometry
Systems Theory Statistics Combinatorics Math. Information Theory Neural Networks Physics

Information Sciences

AI Vision Riemannian Manifold Dual Affine Connections
Optimization

Manifold of Probability Distributions

Information Geometry ?
Gaussian distributions

S = { p ( x; , )}

( x )2 1 exp p ( x; , ) = 2 2 2

{ p ( x )}
S = { p ( x; )}

= ( , )

Manifold of Probability Distributions

x = 1, 2, 3

p = ( p1 , p 2 , p3 )
p3

S n ={ p ( x )} p1 + p 2 + p3 = 1

M = { p ( x; )}

Invariance

S = { p ( x, )}

Invariant under different representation
y = y ( x), p ( y, ) sufficient statistics

p ( x, ) p ( x, ) dx | p ( y, ) p ( y, ) | dy
2 1 2 2 1 2

Two Geometrical Structures
Riemannian metric affine connection --- geodesic

ds =
2

g ( ) d d
ij i

Fisher information
g ij = E lo g p lo g p j i
Orthogonality: innner product

< d1 , d 2 >= d1 Gd 2
T

Affine Connection
covariant derivative; parallel transport
XY , c X = Y & & geodesic X=X X=X(t) s= gij ( )d i d j

minimal distance
straight line

Duality: two affine connections
{S , g , , *}
X , Y = X , Y < X , Y >= gij X iY j

Y X

Riemannian geometry:

Dual Affine Connections
e-geodesic

( , )

( , )
*

log r ( x, t ) = t log p ( x ) + (1 t ) log q ( x ) + c ( t )
m-geodesic

r ( x, t ) = tp ( x ) + (1 t ) q ( x )
q ( x)

x x(t ) = 0 & & *x x(t ) = 0 & &

p ( x)

Mathematical structure of S = { p ( x, )}
gij ( ) = E i l j l Tijk ( ) = E i l j l k l

{M, g, T}
i = i

l = log p ( x, ) ;

-connection
= {i, j; k} Tijk ijk

: dually

coupled

X Y , Z = X Y , Z + Y , Z X

Divergence:
D [ z : y] 0 D [ z : y ] = 0,

D [ z : y]
M
Y Z

iff z = y

D [ z : z + dz ] = gij dzi dz j

positivedefinite

Kullback-Leibler Divergence
quasi-distance
p( x) D[ p ( x) : q ( x)] = p ( x) log q( x) x D[ p ( x) : q ( x)] 0 D[ p : q ] D[q : p ] =0 iff p ( x) = q ( x)

% f divergence of S
% % S = { p} , % % pi > 0 : ( pi = 1 nn holds)
0 qi % % % % D f [ p : q ] = pi f % pi

% % % % D f [ p : q] = 0 p = q

% not invariant under f ( u ) = f ( u ) c ( u 1)

divergence
1 1+ % % % % % D [ p : q ] = { pi + qi pi 2 2
1 2

% qi

1+ 2

}

KL-divergence

% pi % % % % % D[ p : q ] = { pi log + pi qi } % qi

( , ) divergence
D , [ p : q] = {

qi + p i qi }

= : divergence = 1: -divergence

MetricandConnectionsInducedbyDivergence Riemannian metric
(Eguchi)

1 gij ( z ) = i j D [ z : y ] y= z : D [ z : z + dz ] = gij ( z ) dzi dz j 2

affine connections
ijk ( z ) = i j 'k D [ z : y ] y= z
ijk

{, *} i = , zi = yi
' i

( z ) = j k D [ z : y] y= z
' i '

Duality:

X < Y , Z >=< X Y , Z > + < Y , * X Z >
kji

k g ij = kij + ijk =
ijk

Tijk

{M , g , T }

Dually flat manifold
exponential family; mixture family; {p(x); x discrete}
p ( x, ) = exp { i xi ( )} : exponential family

-coordinates : affine coordinates, flat, geodesics -coordinates: (dual) affine coordinates, flat, geodesics
not Riemannian flat canonical divergence D(P: P') : KL divergence b

Dually flat manifold
potential functions ( ) , ( )

i =

( ) + ( ) ii = 0

( ) ; i = ( ) : i i

Legendre transformation

2 2 ij gij ( ) = ( )L g = ( ) i j i j p ( x, ) = exp {i xi ( )} : exponential family

: cumulant generating function : negative entropy
canonical divergence D(P: P')= ( ) + ( ' ) ii '

Manifold with Convex Function

: coordinates

( )

= ( 1 , 2 ,L , n )
function

: convex

1 i 2 ( ) = ( ) 2
( p ) = p ( x ) log p ( x ) dx

negative entropy

Riemannian metric and flatness {S , ( ), } (affine structure)
Bregman divergence

D ( , ) = ( ) ( ) grad ( )

D ( , + d ) =

1 gij ( ) d i d j 2 gij = i j ( ) , i = i
: geodesic (not Levi-Civita)

Flatness (affine)

Legendre Transformation
i = i ( )

one-to-one

( ) + ( ) i i = 0
= ( ) ,
i i

= i
i

( ) = max { ii ( )}

D ( , ) = ( ) + ( )

Two affine coordinate systems

: geodesic (e-geodesic) : dual geodesic (m-geodesic)

( , )

dually orthogonal
i , j = i j i = , i = i i

X < Y , Z >=< X Y , Z > + < Y , * X Z >

Pythagorean Theorem
(dually flat manifold)
D [ P : Q ] + D [Q : R ] = D [ P : R ]

Euclidean space: self-dual
( ) =
1 2 (i ) 2

Projection Theorem
min D [ P : Q ]
QM

Q = m-geodesic projection of P to M

min D [Q : P ]
QM

Q = e-geodesic projection of P to M

Two Types of Divergence
Invariant divergence (Chentsov, Csiszar) f-divergence: Fisher- structure
D [p : q] =

p(x)f{

q(x) }dx p(x)

Flat divergence (Bregman) convex function
KL-divergence belongs to both classes: flat and invariant

divergence
S = {p} : space of probability distributions

invariance
invariantdivergence
F-divergence Fisher inf metric Alpha connection

duallyflatspace
Flatdivergence
convexfunctions Bregman

KLdivergence

p(x) D[p : q] = p(x) log{ }dx q(x)

Spaceofpositivemeasures: vectors,matrices,arrays
% % S = { p} , % % pi > 0 : ( pi = 1 nn holds)

fdivergence divergence

Bregmandivergence

Slide: Applications of Information Geometry
Statistical Inference Machine Learning and AI Computer Vision Convex Programming Signal Processing (ICA; Sparse) Information Theory, Systems Theory Quantum Information Geometry

Applications of Information Geometry
Statistical Inference Machine Learning and AI Computer Vision Convex Programming Signal Processing (ICA; Sparse) Information Theory, Systems Theory Quantum Information Geometry

Applications to Statistics
curved exponential family:

p ( x, u )

x1, x2 ,...xn

p ( x, ) = exp{ x ( )}

p ( x, u ) = exp ( u ) x ( ( u ) )

{

}

u ( x1 ,..., xn )
1 x = n

: estimator
xk

k =1

x : discrete

X = {0, 1, , n}

S n = { p ( x) | x X }:
n i =0

exponential family
n i i =1

p ( x) = pi i ( x) = exp[ xi ( )]

= log pi log p0 ; xi = i ( x); ( ) = log p0
i

statistical model : p ( x, u )

High-Order Asymptotics
p ( x, (u) ) u = u ( x1 ,L , xn ) : x1 ,L , xn
=x

( u u )( u u )T e=E

1 1 e = G1 + 2 G2 n n

G1 G 1
( e )2

:Cramr-Rao: linear theory
( m )2

G2 = H M + H A

( m )2

quadratic approximation

Information Geometry of Belief Propagation
Shun-ichi Amari (RIKEN BSI) Shiro Ikeda (Inst. Statist. Math.) Toshiyuki Tanaka (Kyoto U.)

Stochastic Reasoning

p ( x, y , z , r , s )

p(x, y, z | r, s) x, y, z,... =1 1 ,

Stochastic Reasoning
q(x1,x2,x3,| observation) X= (x1 x2 x3 ..) x = 1, -1 maximum likelihood

= argmax q(x1, x2 ,x3 ,..) i = sgn E[xi]

least bit error rate estimator

Mean Value
Marginalization: projection to independent distributions

0 q (x) = q1 ( x1 )q2 ( x2 )...qn ( xn ) = q0 (x)
( q i ( x i ) = q ( x1, ..., x n ) d x1 ..d x i ..d x n

= E [ x] = E
q

[ x]

q ( x ) = exp ki xi + cr ( x ) q r =1 cr ( x ) = cr xi1 L xis , r = ( i1 L is )
L

q Boltzmann p { wijspin jglass, neural } ( x ) = ex machine, xi x + hi xi networks
Turbo Codes, LDPC Codes

xi = {1, 1}

r = ( i1 , i2 )

Computationally Difficult

q ( x ) = E [ x]

q ( x ) = exp { cr ( x ) q }
mean-field approximation belief propagation tree propagation, CCCP (convex-concave)

Information Geometry of Mean Field Approximation
m-projection e-projection D[q:p]=
m 0

q(x) q(x)log p(x) x

q = argmin D[q:p]

e q 0

= argmin D[p:q]
p( x ) M 0

M0 = {i pi ( xi )}

Information Geometry
q( x)
Mr
M r' M0

q ( x ) = exp{ cr ( x ) }

M0 = { p0 ( x, )} = exp{ x 0}

M r = pr ( x, r ) = exp{cr ( x ) + r x r }
r = 1, L , L

{

}

Belief Propagation
M r : pr ( x , r ) = exp {cr ( x ) + r x r }

0 pr ( x , r )
t

t +1

= 0 pr ( x, r ) r : belief for cr ( x )
t t t +1

t +1

= r

Belief Prop Algorithm
r r'
r
'r

Mr
Mr '

Equilibrium of BP
1) m-condition

(

, r

)
Mr

= 0 pr ( x , * r )

m-flat submanifold M (

)
1 ( ')

Mr' M0

2) e-condition 1 * = r r L 1
q ( x ) e -flat submanifold

q
Mr
Mr'

Free energy:
F(,1,L,L ) = D[ p0 : q] D[ p0 : pr ]
critical point

F =0 F =0 r
not convex

: e-condition : m-condition

Belief Propagation e-condition OK
1 ( ; 1 , 2 ,..., L ), ' = 'r L 1 ( 1 , 2 ,..., L ) ( '1 , ' 2 ,..., ' L )

CCCP
'

m-condition OK

1 ( '), 1 ( '),..., 1 ( ')

' = 0 pr ( x, ' r ) : ' = r p0 ( x, ' )

Slide:

t +1 r

= r
t +1

t +1

= L
t

= r p0 ( x,
t +1 r

t +1

)

t +1 r

= r
t +1

t +1

= L
t

= r p0 ( x,
t +1 r

t +1

)

Convex-Concave Computational Procedure (CCCP) Yuille

F ( ) = F1 ( ) F2 ( ) F1 (
t +1

) = F2 ( )
t
Elimination of double loops

Boltzmann Machine
p ( xi = 1) = ( wij x j hi )
x4

x1 x2
x3

p ( x ) = exp { wij xi x j hi xi ( w )}
q ( x) p ( x)

Boltzmann machine ---hidden units
EM algorithm e-projection m-projection

EM algorithm
hidden variables

p ( x , y; u ) D = { x1 ,L , x N } M = { p ( x , y; u )}

DM = p ( x , y ) p ( x ) = pD ( x )

{

}
M D

min KL p ( x , y ) : p M m-projection to
min KL p D : p ( x , y; u ) e-projection to

SVM : support vector machine
Embedding Kernel
zi = i ( x) f ( x) = wii ( x) = i yi K ( xi , x)

K ( x, x ') = i ( x) i ( x ')
K ( x, x ') ( x ) ( x ') K ( x, x ')

Conformal change of kernel
( x ) = exp{ | f ( x ) |2 }

Signal Processing
ICA : Independent Component Analysis

xt = Ast

xt st

sparse component analysis positive matrix factorization

mixture and unmixture of independent signals

s1 s2 sn x2

xi = Aij s j
j =1

x = As

Independent Component Analysis

x = As y = Wx

xi = Aij s j W=A
1

observations: x(1), x(2), , x(t) recover: s(1), s(2), , s(t)

Space of Matrices : Lie group
W + dW
W 1

dX = dWW -1

I
dW
2

I + dX

= tr ( d X d X T

) = tr ( d W W

)

l W TW l = W
dX :

non-holonomic basis

Information Geometry of ICA
S ={p(y)}

I = {q1 ( y1 )q2 ( y2 )...qn ( yn )}

{ p (Wx)}

natural gradient l ( W) = KL[ p (y; W) : q (y )] estimating function r (y ) stability, efficiency

Semiparametric Statistical Model
p (x; W, r ) =| W | r ( Wx) 1 r = ri W = A , r ( s ) : unknown
x(1), x(2), , x(t)

Natural Gradient
l ( y, W ) T W = W W W

Basis Given: overcomplete case Sparse Solution

x = As = si ai many solutions many si 0 x = As
t t

sparse

x = As

sparse

x = As

generalized inverse

min si2 L2 -norm:
sparse solution

L1 -norm: min

s
i

Overcomplete Basis and Sparse Solution

min s 1 = si
non-linear denoising

x = si ai = As

min As x p + s

Sparse Solution
min ( ) penalty Fp ( ) = i
F0 ( ) = #1[ i 0] : sparsest solution F1 ( ) = i : L1 solution Fp ( ) : 0 p 1
2

: Bayes prior

Sparse solution: overcomplete case

F2 ( ) = i : generalized inverse solution

Optimization under Spasity Condition
min ( ) : convex function F ( ) c constraint
typical case:

1 1 2 ( ) = y X = ( *)T G ( *) 2 2 p = 2, p = 1, p = 1/ 2 p 1 F ( ) = i ; p

L1-constrained optimization
Pc Problem
min ( ) under F ( ) c solution ( c ) : c = 0

LASSO LARS

c = 0

P Problem

min ( ) + F ( ) solution ( )

=0
= 0

solutions * and * : coincide, = ( c ) , p 1 c p < 1 : = ( c ) multiple, noncontinuous stability different

Projection from *

to F = c (information geometry)
*

Convex Cone Programming
: positive semi-definite matrix
convex potential function

dual geodesic approach

Ax = b,

min c x

Support vector machine

a) Rc : n = 2, p > 1

b) Rc : n = 2, p = 1

non-convex
c) Rc : n = 2, p < 1 Fig. 1

orthogonal projection, dual projection min () = D : * , F ( ) = c : dual geodesic projection
dual

n n = F

Fig. 5 subgradient

LASSO path and LARS path (stagewise solution)
min ( ) : F ( ) = c min ( ) + F ( )

c correspondence

Active set and gradient
A ( ) = {i i 0} sgn ( i ) i (1 p ) , i A Fp ( ) = ( , ) , i A [ 1,1]

Solution path

{

A ( c ) + c A F ( c ) = 0,

& & A A ( c ) + c A A F ( c ) c = c A F ( c )

}

& = K 1 F ( ) ; = d & & c A c c c c dc
K = G ( c ) + cF ( c )

( F1 = 0;

F1 = (sgn i ) : L1 )

Solution path in the subspace of the active set
A ( ) + A F ( ) = 0 & = K A1 A F ( )

A : active direction

turning point A A

Gradient Descent Method
L = { L( x )}: xi

min L(x+a): gij a a =
i j

covariant

ji %L ={ g x L( x )}: contravariant i

xt +1 = xt cL( xt )

Extended LARS (p = 1) and Minkovskian grad
norm a = ai
p

max ( + a )

under
p

( + a) a
p = 1+

sgni , i = max {1 ,L , N } 1 ( ) = 1 A otherwise 0,

= ( )

i = arg max fi
max fi = fi = f j

% ( F )

1, for i = i and j , = otherwise. 0

% t +1 = t F

LARS

% F = f

Euclidean case

% F = c ( sgn fi ) fi

1 p 1

% F = c sgn fi

(

)

0 M 1 0 M 0

L1/2 constraint: non-convex optimization

-trajectory
Ex. 1-dim
( ) =

and-trajectory c
1 2 *) ( 2

1 2 f ( ) = + F = ( 2 ) + 2 2

Pc : min ( ) ,
2

c = c
=0

P : f = 0
R ( )

= R ( )

: Xu Zongben's operator

c = c ( c )
c

ICCN-Huangshan

Sparse Signal Analysis
Shun-ichi Ammari
RIKEN Brain Science Institute Collaborator: Masahiro Yukawa, Niigata University)

Solution Path : c
not continuous, not-monotone jump
c

An Example of the greedy path

Linear Programming

A x b max c x ( x ) = log ( A x
ij j i i i ij i

bi )

inner method

Convex Programming Inner Method
LP : Ax b,
min

c x 0
c x

( x ) = log ( Aij x j bi )
+ log xi

= i ( x )

Simplex method ; inner method

Polynomial-Time Algorithm
curvature : step-size
H
( m)
2

min : tc x + ( x )

x = (t )

geodesic

Neural Networks
Multilayer Perceptron Higher-order correlations
Synchronous firing

Multilayer Perceptrons

y = vi ( wi x ) + n

x = ( x1 , x2 ,..., xn )
2 1 p ( y x; ) = c exp ( y f ( x , ) ) 2 f ( x , ) = vi ( wi x )

= ( w1 ,..., wm ; v1 ,..., vm )

Multilayer Perceptron
neuromanifold space of functions

(x)

y = f ( x, ) = ( v 1 L , vm ; w1, L , wm ) = vi ( wi x )

singularities

Geometry of singular model

y = v ( w x ) + n

v | w |= 0

Backpropagation ---gradient learning
examples : ( y1 , x1 ) ,L ( yt , xt )
2 1 E = y f ( x , ) = log p ( y, x; ) 2

E t = t f ( x , ) = vi ( wi x )

natural gradient (Riemannian) % E = G 1E --steepest descent

conformaltransformation
q Fisherinformation
q F gij ( p ) = gij ( p ) hq ( p )
(q)

q divergence 1 q 1q (1 p(x) r(x) dx) Dq[ p(x): r(x)] = (1 q)hq ( p)

Total Bregman Divergence and its Applications to Shape Retrieval

Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Total Bregman Divergence
TD [ x : y ] = D [ x : y] 1 + f
2

rotational invariance conformal geometry

Total Bregman divergence (Vemuri)
TBD( p : q) =

( p ) ( q ) ( q ) ( p q )
1+ | (q ) |
2

100

Clustering : t-center
y

E = { x1 ,L , xm }
T-center of E x = arg min TD [ x , xi ]
i

101

t-centerx
w f ( x ) f ( x ) = w
i i i

wi =

1 1 + f ( xi )
2

102

q superrobustestimator(Eguchi)
p ( x, ) max p ( x, ) max hq +1 bias-corrected q-estimating function sq ( x, ) = p ( x, ) { i log p cq +1 ( )} 1 cq +1 = log hq +1 ( ) q +1

s ( x , ) = 0
i =0 q i

1 max p ( xi , ) hq +1

103

Conformalchangeofdivergence
% D ( p : q ) = ( p ) D [ p : q]
% gij = ( p ) gij
% Tijk = (Tijk + sk gij + s j gik + si g jk ) si = i log

104

t-center is robust
E = { x1 ,L , xn ; y} % x = x + z ( x ; y ) , = 1 n

influence function z ( x ; y )

z < c as y : robust

105

z ( x , y ) = G

f ( y ) f ( x ) 1 w( y)

1 G = w ( xi )f ( xi ) n

Robust: z is bounded f ( y ) f ( y ) = w( y) 1 + f ( y ) w( y) > 1

z ( x , y ) = G 1 <

y 1+ y
2

z ( x , y ) = y

Euclidean case f =

1 2 x 2

106

MPEG7 database
Great intraclass variability, and small interclass dissimilarity.

107

Other TBD applications
Diffusion tensor imaging (DTI) analysis
[Vemuri]

Interpolation Segmentation
Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear

108