« Information Geometry and its Applications to Machine Learning

Shun-Ichi Amari

Information geometry studies invariant geometrical structures of a family of probability distributions, which forms a geometrical manifold. It has a unique Riemannian metric given by Fisher information matrix and a dual pair of affine connections which determine two types of geodesics. When the manifold is dually flat, there exists a canonical divergence (KL-divergence) and nice theorems such as generalized Pythagorean theorem, projection theorem and orthogonal foliation theorem hold in spite that the manifold is not Euclidean. Machine learning makes use of stochastic structures of the environmental information so that information geometry is not only useful for understanding the essential aspects of machine learning but also provides nice tools for constructing new algorithms. The present talk demonstrates its usefulness for understanding SVM, belief propagation, EM algorithm, boosting and others.

Scroll with j/k | | | Size

Slide: Machine Learning SS: Kyoto U.

Information Geometry
and Its Applications to Machine Learning

Shun-ichi Amari
RIKEN Brain Science Institute

Machine Learning SS: Kyoto U.

Information Geometry
and Its Applications to Machine Learning

Shun-ichi Amari
RIKEN Brain Science Institute

1

Slide: Information Geometry
-- Manifolds of Probability Distributions

M = { p (x)}

Information Geometry
-- Manifolds of Probability Distributions

M = { p (x)}

2

Slide: Information Geometry
Systems Theory Statistics Combinatorics Math. Information Theory Neural Networks Physics

Information Sciences

AI Vision Riemannian Manifold Dual Affine Connections
Optimization

Manifold of Probability Distributions

Information Geometry
Systems Theory Statistics Combinatorics Math. Information Theory Neural Networks Physics

Information Sciences

AI Vision Riemannian Manifold Dual Affine Connections
Optimization

Manifold of Probability Distributions

3

Slide: Information Geometry ?
Gaussian distributions

S = { p ( x;  ,  )}

 ( x   )2  1   exp  p ( x;  ,  ) =  2 2  2   



{ p ( x )}
S = { p ( x;  )}

 = ( , )

Information Geometry ?
Gaussian distributions

S = { p ( x; , )}

( x )2 1 exp p ( x; , ) = 2 2 2

{ p ( x )}
S = { p ( x; )}

= ( , )

4

Slide: Manifold of Probability Distributions

x = 1, 2, 3

p = ( p1 , p 2 , p3 )
p3

S n ={ p ( x )} p1 + p 2 + p3 = 1

M = { p ( x; )}

p1

p2

Manifold of Probability Distributions

x = 1, 2, 3

p = ( p1 , p 2 , p3 )
p3

S n ={ p ( x )} p1 + p 2 + p3 = 1

M = { p ( x; )}

p1

p2

5

Slide: Invariance

S = { p ( x,  )}

Invariant under different representation
y = y ( x), p ( y, ) sufficient statistics

 p ( x, )  p ( x, ) dx   | p ( y,  )  p ( y,  ) | dy
2 1 2 2 1 2

Invariance

S = { p ( x, )}

Invariant under different representation
y = y ( x), p ( y, ) sufficient statistics

p ( x, ) p ( x, ) dx | p ( y, ) p ( y, ) | dy
2 1 2 2 1 2

6

Slide: Two Geometrical Structures
Riemannian metric affine connection --- geodesic

ds =
2

 g ( ) d  d 
ij i

j

Fisher information
    g ij = E  lo g p lo g p   j   i   
Orthogonality: innner product

< d1 , d 2 >= d1 Gd 2
T

Two Geometrical Structures
Riemannian metric affine connection --- geodesic

ds =
2

g ( ) d d
ij i

j

Fisher information
g ij = E lo g p lo g p j i
Orthogonality: innner product

< d1 , d 2 >= d1 Gd 2
T

7

Slide: Affine Connection
covariant derivative; parallel transport
XY , c X = Y & & geodesic  X=X X=X(t) s= gij ( )d i d j 

minimal distance
straight line

Affine Connection
covariant derivative; parallel transport
XY , c X = Y & & geodesic X=X X=X(t) s= gij ( )d i d j

minimal distance
straight line

8

Slide: Duality: two affine connections
{S , g , , *}
X , Y = X ,  Y < X , Y >=  gij X iY j


Y X

*

Y

X

Riemannian geometry:

 =

Duality: two affine connections
{S , g , , *}
X , Y = X , Y < X , Y >= gij X iY j

Y X

*

Y

X

Riemannian geometry:

=

9

Slide: Dual Affine Connections
e-geodesic

( ,  )


( ,  )
*

log r ( x, t ) = t log p ( x ) + (1  t ) log q ( x ) + c ( t )
m-geodesic

r ( x, t ) = tp ( x ) + (1  t ) q ( x )
q ( x)

 x x(t ) = 0 & &  *x x(t ) = 0 & &

p ( x)

Dual Affine Connections
e-geodesic

( , )

( , )
*

log r ( x, t ) = t log p ( x ) + (1 t ) log q ( x ) + c ( t )
m-geodesic

r ( x, t ) = tp ( x ) + (1 t ) q ( x )
q ( x)

x x(t ) = 0 & & *x x(t ) = 0 & &

p ( x)

10

Slide: Mathematical structure of S = { p ( x,  )}
gij ( ) = E  i l  j l    Tijk ( ) = E  i l  j l  k l   

{M, g, T}
 i = i 

l = log p ( x,  ) ;

 -connection
 = {i, j; k}   Tijk ijk

    

: dually

coupled

X Y , Z =  X Y , Z + Y ,  Z X

Mathematical structure of S = { p ( x, )}
gij ( ) = E i l j l Tijk ( ) = E i l j l k l

{M, g, T}
i = i

l = log p ( x, ) ;

-connection
= {i, j; k} Tijk ijk

: dually

coupled

X Y , Z = X Y , Z + Y , Z X

11

Slide: Divergence:
D [ z : y]  0 D [ z : y ] = 0,

D [ z : y]
M
Y Z

iff z = y

D [ z : z + dz ] =  gij dzi dz j

positivedefinite

Divergence:
D [ z : y] 0 D [ z : y ] = 0,

D [ z : y]
M
Y Z

iff z = y

D [ z : z + dz ] = gij dzi dz j

positivedefinite

12

Slide: Kullback-Leibler Divergence
quasi-distance
p( x) D[ p ( x) : q ( x)] =  p ( x) log q( x) x D[ p ( x) : q ( x)]  0 D[ p : q ]  D[q : p ] =0 iff p ( x) = q ( x)

Kullback-Leibler Divergence
quasi-distance
p( x) D[ p ( x) : q ( x)] = p ( x) log q( x) x D[ p ( x) : q ( x)] 0 D[ p : q ] D[q : p ] =0 iff p ( x) = q ( x)

13

Slide: % f divergence of S
% % S = { p} , % % pi > 0 : (  pi = 1 nn holds)
 0   qi % % % % D f [ p : q ] =  pi f  %  pi

% % % % D f [ p : q] = 0  p = q

% not invariant under f ( u ) = f ( u )  c ( u  1)

% f divergence of S
% % S = { p} , % % pi > 0 : ( pi = 1 nn holds)
0 qi % % % % D f [ p : q ] = pi f % pi

% % % % D f [ p : q] = 0 p = q

% not invariant under f ( u ) = f ( u ) c ( u 1)

14

Slide:  divergence
1 1+  % % % % % D [ p : q ] =  { pi + qi  pi 2 2
1 2

% qi

1+ 2

}

KL-divergence

% pi % % % % % D[ p : q ] =  { pi log + pi  qi } % qi

divergence
1 1+ % % % % % D [ p : q ] = { pi + qi pi 2 2
1 2

% qi

1+ 2

}

KL-divergence

% pi % % % % % D[ p : q ] = { pi log + pi qi } % qi

15

Slide: ( ,  )  divergence
D , [ p : q] = {

 +

pi

 +

+

 +

qi +   p i qi  }

 =  :   divergence  = 1:  -divergence

( , ) divergence
D , [ p : q] = {

+

pi

+

+

+

qi + p i qi }

= : divergence = 1: -divergence

16

Slide: MetricandConnectionsInducedbyDivergence Riemannian metric
(Eguchi)

1 gij ( z ) =  i  j D [ z : y ] y= z : D [ z : z + dz ] = gij ( z ) dzi dz j 2

affine connections
ijk ( z ) =  i  j  'k D [ z : y ] y= z 
 ijk

{, *}  i = , zi   = yi
' i

( z ) =   j  k D [ z : y] y= z
' i '

MetricandConnectionsInducedbyDivergence Riemannian metric
(Eguchi)

1 gij ( z ) = i j D [ z : y ] y= z : D [ z : z + dz ] = gij ( z ) dzi dz j 2

affine connections
ijk ( z ) = i j 'k D [ z : y ] y= z
ijk

{, *} i = , zi = yi
' i

( z ) = j k D [ z : y] y= z
' i '

17

Slide: Duality:

X < Y , Z >=<  X Y , Z > + < Y , * X Z >
 kji

 k g ij =  kij +   ijk = 
 ijk

 Tijk

{M , g , T }

Duality:

X < Y , Z >=< X Y , Z > + < Y , * X Z >
kji

k g ij = kij + ijk =
ijk

Tijk

{M , g , T }

18

Slide: Dually flat manifold
exponential family; mixture family; {p(x); x discrete}
p ( x,  ) = exp { i xi   ( )} : exponential family

 -coordinates : affine coordinates, flat, geodesics  -coordinates: (dual) affine coordinates, flat, geodesics
not Riemannian flat canonical divergence D(P: P') : KL  divergence b

Dually flat manifold
exponential family; mixture family; {p(x); x discrete}
p ( x, ) = exp { i xi ( )} : exponential family

-coordinates : affine coordinates, flat, geodesics -coordinates: (dual) affine coordinates, flat, geodesics
not Riemannian flat canonical divergence D(P: P') : KL divergence b

19

Slide: Dually flat manifold
potential functions  ( ) ,  ( )

i =

 ( ) +  ( )  ii = 0

   ( ) ; i =  ( ) : i i

Legendre transformation

2 2 ij gij ( ) =  ( )L g =  ( ) i  j i  j p ( x,  ) = exp {i xi   ( )} : exponential family

 : cumulant generating function  : negative entropy
canonical divergence D(P: P')= ( ) +  ( ' )  ii '

Dually flat manifold
potential functions ( ) , ( )

i =

( ) + ( ) ii = 0

( ) ; i = ( ) : i i

Legendre transformation

2 2 ij gij ( ) = ( )L g = ( ) i j i j p ( x, ) = exp {i xi ( )} : exponential family

: cumulant generating function : negative entropy
canonical divergence D(P: P')= ( ) + ( ' ) ii '

20

Slide: Manifold with Convex Function

S

: coordinates

 ( )

 = ( 1 , 2 ,L , n )
function

: convex

1 i 2  ( ) =  ( ) 2
 ( p ) =  p ( x ) log p ( x ) dx

negative entropy

Manifold with Convex Function

S

: coordinates

( )

= ( 1 , 2 ,L , n )
function

: convex

1 i 2 ( ) = ( ) 2
( p ) = p ( x ) log p ( x ) dx

negative entropy

21

Slide: Riemannian metric and flatness {S ,  ( ),  } (affine structure)
Bregman divergence

D ( ,  ) =  ( )  (    )  grad  ( )

D ( ,  + d ) =

1  gij ( ) d i d j 2  gij =  i  j ( ) ,  i = i 
: geodesic (not Levi-Civita)

Flatness (affine)

Riemannian metric and flatness {S , ( ), } (affine structure)
Bregman divergence

D ( , ) = ( ) ( ) grad ( )

D ( , + d ) =

1 gij ( ) d i d j 2 gij = i j ( ) , i = i
: geodesic (not Levi-Civita)

Flatness (affine)

22

Slide: Legendre Transformation
i =  i ( )
 
one-to-one

 ( ) + ( )  i i = 0
 =   ( ) ,
i i

  = i
i

 ( ) = max { ii  ( )}

D ( ,  ) =  ( ) +  ( )

Legendre Transformation
i = i ( )

one-to-one

( ) + ( ) i i = 0
= ( ) ,
i i

= i
i

( ) = max { ii ( )}

D ( , ) = ( ) + ( )

23

Slide: Two affine coordinate systems


: geodesic (e-geodesic) : dual geodesic (m-geodesic)

( , )

dually orthogonal
i ,  j = i j i =   , i =  i i

X < Y , Z >=<  X Y , Z > + < Y , * X Z >

Two affine coordinate systems

: geodesic (e-geodesic) : dual geodesic (m-geodesic)

( , )

dually orthogonal
i , j = i j i = , i = i i

X < Y , Z >=< X Y , Z > + < Y , * X Z >

24

Slide: Pythagorean Theorem
(dually flat manifold)
D [ P : Q ] + D [Q : R ] = D [ P : R ]

Euclidean space: self-dual
 ( ) =
1 2 (i )  2

 =

Pythagorean Theorem
(dually flat manifold)
D [ P : Q ] + D [Q : R ] = D [ P : R ]

Euclidean space: self-dual
( ) =
1 2 (i ) 2

=

25

Slide: Projection Theorem
min D [ P : Q ]
QM

Q = m-geodesic projection of P to M

min D [Q : P ]
QM

Q = e-geodesic projection of P to M

Projection Theorem
min D [ P : Q ]
QM

Q = m-geodesic projection of P to M

min D [Q : P ]
QM

Q = e-geodesic projection of P to M

26

Slide: Two Types of Divergence
Invariant divergence (Chentsov, Csiszar) f-divergence: Fisher-  structure
D [p : q] =

 p(x)f{

q(x) }dx p(x)

Flat divergence (Bregman)  convex function
KL-divergence belongs to both classes: flat and invariant

Two Types of Divergence
Invariant divergence (Chentsov, Csiszar) f-divergence: Fisher- structure
D [p : q] =

p(x)f{

q(x) }dx p(x)

Flat divergence (Bregman) convex function
KL-divergence belongs to both classes: flat and invariant

27

Slide: divergence
S = {p} : space of probability distributions

invariance
invariantdivergence
F-divergence Fisher inf metric Alpha connection

duallyflatspace
Flatdivergence
convexfunctions Bregman

KLdivergence

p(x) D[p : q] =  p(x) log{ }dx q(x)

divergence
S = {p} : space of probability distributions

invariance
invariantdivergence
F-divergence Fisher inf metric Alpha connection

duallyflatspace
Flatdivergence
convexfunctions Bregman

KLdivergence

p(x) D[p : q] = p(x) log{ }dx q(x)

28

Slide: Spaceofpositivemeasures: vectors,matrices,arrays
% % S = { p} , % % pi > 0 : (  pi = 1 nn holds)

fdivergence divergence

Bregmandivergence

Spaceofpositivemeasures: vectors,matrices,arrays
% % S = { p} , % % pi > 0 : ( pi = 1 nn holds)

fdivergence divergence

Bregmandivergence

29

Slide: Applications of Information Geometry
Statistical Inference Machine Learning and AI Computer Vision Convex Programming Signal Processing (ICA; Sparse) Information Theory, Systems Theory Quantum Information Geometry

Applications of Information Geometry
Statistical Inference Machine Learning and AI Computer Vision Convex Programming Signal Processing (ICA; Sparse) Information Theory, Systems Theory Quantum Information Geometry

30

Slide: Applications to Statistics
curved exponential family:

p ( x, u )

x1, x2 ,...xn

p ( x,  ) = exp{  x  ( )}

p ( x, u ) = exp  ( u )  x  ( ( u ) )

{

}

 u ( x1 ,..., xn )
1 x = n

: estimator
xk

 =x



n

 u

k =1

Applications to Statistics
curved exponential family:

p ( x, u )

x1, x2 ,...xn

p ( x, ) = exp{ x ( )}

p ( x, u ) = exp ( u ) x ( ( u ) )

{

}

u ( x1 ,..., xn )
1 x = n

: estimator
xk

=x

n

u

k =1

31

Slide: x : discrete

X = {0, 1, , n}

S n = { p ( x) | x  X }:
n i =0

exponential family
n i i =1

p ( x) =  pi i ( x) = exp[  xi  ( )]

 = log pi  log p0 ; xi =  i ( x);  ( ) =  log p0
i

statistical model : p ( x, u )

x : discrete

X = {0, 1, , n}

S n = { p ( x) | x X }:
n i =0

exponential family
n i i =1

p ( x) = pi i ( x) = exp[ xi ( )]

= log pi log p0 ; xi = i ( x); ( ) = log p0
i

statistical model : p ( x, u )

32

Slide: High-Order Asymptotics
p ( x,  (u) )  u = u ( x1 ,L , xn ) : x1 ,L , xn
 =x

 u

( u  u )( u  u )T   e=E   

1 1 e = G1 + 2 G2 n n

:

G1  G 1
( e )2

:Cramr-Rao: linear theory
( m )2

G2 = H M + H A

+

( m )2

quadratic approximation

High-Order Asymptotics
p ( x, (u) ) u = u ( x1 ,L , xn ) : x1 ,L , xn
=x

u

( u u )( u u )T e=E

1 1 e = G1 + 2 G2 n n

:

G1 G 1
( e )2

:Cramr-Rao: linear theory
( m )2

G2 = H M + H A

+

( m )2

quadratic approximation

33

Slide: Information Geometry of Belief Propagation
 Shun-ichi Amari (RIKEN BSI)  Shiro Ikeda (Inst. Statist. Math.)  Toshiyuki Tanaka (Kyoto U.)

Information Geometry of Belief Propagation
Shun-ichi Amari (RIKEN BSI) Shiro Ikeda (Inst. Statist. Math.) Toshiyuki Tanaka (Kyoto U.)

34

Slide: Stochastic Reasoning

p ( x, y , z , r , s )

p(x, y, z | r, s) x, y, z,... =1 1 ,

Stochastic Reasoning

p ( x, y , z , r , s )

p(x, y, z | r, s) x, y, z,... =1 1 ,

35

Slide: Stochastic Reasoning
q(x1,x2,x3,| observation) X= (x1 x2 x3 ..) x = 1, -1 maximum likelihood

= argmax q(x1, x2 ,x3 ,..) i = sgn E[xi]

least bit error rate estimator

Stochastic Reasoning
q(x1,x2,x3,| observation) X= (x1 x2 x3 ..) x = 1, -1 maximum likelihood

= argmax q(x1, x2 ,x3 ,..) i = sgn E[xi]

least bit error rate estimator

36

Slide: Mean Value
Marginalization: projection to independent distributions

 0 q (x) = q1 ( x1 )q2 ( x2 )...qn ( xn ) = q0 (x)
( q i ( x i ) =  q ( x1, ..., x n ) d x1 ..d x i ..d x n

 = E [ x] = E
q

q0

[ x]

Mean Value
Marginalization: projection to independent distributions

0 q (x) = q1 ( x1 )q2 ( x2 )...qn ( xn ) = q0 (x)
( q i ( x i ) = q ( x1, ..., x n ) d x1 ..d x i ..d x n

= E [ x] = E
q

q0

[ x]

37

Slide:   q ( x ) = exp  ki  xi +  cr ( x )  q  r =1   cr ( x ) = cr xi1 L xis , r = ( i1 L is )
L

q Boltzmann p { wijspin jglass, neural } ( x ) = ex machine, xi x +  hi xi networks
Turbo Codes, LDPC Codes

xi = {1, 1}

r = ( i1 , i2 )

q ( x ) = exp ki xi + cr ( x ) q r =1 cr ( x ) = cr xi1 L xis , r = ( i1 L is )
L

q Boltzmann p { wijspin jglass, neural } ( x ) = ex machine, xi x + hi xi networks
Turbo Codes, LDPC Codes

xi = {1, 1}

r = ( i1 , i2 )

38

Slide: Computationally Difficult

q ( x )   = E [ x]

q ( x ) = exp { cr ( x )  q }
mean-field approximation belief propagation tree propagation, CCCP (convex-concave)

Computationally Difficult

q ( x ) = E [ x]

q ( x ) = exp { cr ( x ) q }
mean-field approximation belief propagation tree propagation, CCCP (convex-concave)

39

Slide: Information Geometry of Mean Field Approximation
 m-projection  e-projection D[q:p]=
m 0

q(x) q(x)log p(x) x

 q = argmin D[q:p]



e q 0

= argmin D[p:q]
p( x )  M 0

M0 = {i pi ( xi )}

Information Geometry of Mean Field Approximation
m-projection e-projection D[q:p]=
m 0

q(x) q(x)log p(x) x

q = argmin D[q:p]

e q 0

= argmin D[p:q]
p( x ) M 0

M0 = {i pi ( xi )}

40

Slide: Information Geometry
q( x)
Mr
M r' M0

q ( x ) = exp{ cr ( x )   }



M0 = { p0 ( x, )} = exp{  x  0}

M r = pr ( x, r ) = exp{cr ( x ) + r  x  r }
r = 1, L , L

{

}

Information Geometry
q( x)
Mr
M r' M0

q ( x ) = exp{ cr ( x ) }

M0 = { p0 ( x, )} = exp{ x 0}

M r = pr ( x, r ) = exp{cr ( x ) + r x r }
r = 1, L , L

{

}

41

Slide: Belief Propagation
M r : pr ( x ,  r ) = exp {cr ( x ) +  r  x   r }

 0 pr ( x ,  r )
t

r 

t +1

=  0 pr ( x,  r )   r : belief for cr ( x )
t t t +1

t +1

=  r

Belief Propagation
M r : pr ( x , r ) = exp {cr ( x ) + r x r }

0 pr ( x , r )
t

r

t +1

= 0 pr ( x, r ) r : belief for cr ( x )
t t t +1

t +1

= r

42

Slide: Belief Prop Algorithm
r r'
 r
 'r

Mr
Mr '

M0

Belief Prop Algorithm
r r'
r
'r

Mr
Mr '

M0

43

Slide: Equilibrium of BP
1) m-condition

(



, r



)
Mr

  =  0 pr ( x ,  * r )

m-flat submanifold M (



)
1 ( ') 

Mr' M0

2) e-condition 1   *  =  r r L 1
q ( x )  e -flat submanifold

q
Mr
Mr'

M0

Equilibrium of BP
1) m-condition

(

, r

)
Mr

= 0 pr ( x , * r )

m-flat submanifold M (

)
1 ( ')

Mr' M0

2) e-condition 1 * = r r L 1
q ( x ) e -flat submanifold

q
Mr
Mr'

M0

44

Slide: Free energy:
F(,1,L,L ) = D[ p0 : q] D[ p0 : pr ]
critical point

F =0  F =0  r
not convex

: e-condition : m-condition

Free energy:
F(,1,L,L ) = D[ p0 : q] D[ p0 : pr ]
critical point

F =0 F =0 r
not convex

: e-condition : m-condition

45

Slide: Belief Propagation e-condition OK
1 ( ; 1 ,  2 ,...,  L ),  ' =   'r L 1 ( 1 ,  2 ,...,  L )  (  '1 ,  ' 2 ,...,  ' L )

CCCP
  '

m-condition OK

1 ( '), 1 ( '),..., 1 ( ')

 ' =  0 pr ( x,  ' r ) :  ' =  r p0 ( x,  ' )

Belief Propagation e-condition OK
1 ( ; 1 , 2 ,..., L ), ' = 'r L 1 ( 1 , 2 ,..., L ) ( '1 , ' 2 ,..., ' L )

CCCP
'

m-condition OK

1 ( '), 1 ( '),..., 1 ( ')

' = 0 pr ( x, ' r ) : ' = r p0 ( x, ' )

46

Slide: 

t +1 r

=  r 
t +1

t +1

= L   
t

=  r p0 ( x, 
t +1 r

t +1

)

t +1 r

= r
t +1

t +1

= L
t

= r p0 ( x,
t +1 r

t +1

)

47

Slide: Convex-Concave Computational Procedure (CCCP) Yuille

F ( ) = F1 ( )  F2 ( ) F1 (
t +1

) = F2 ( )
t
Elimination of double loops

Convex-Concave Computational Procedure (CCCP) Yuille

F ( ) = F1 ( ) F2 ( ) F1 (
t +1

) = F2 ( )
t
Elimination of double loops

48

Slide: Boltzmann Machine
p ( xi = 1) =  (  wij x j  hi )
x4

x1 x2
x3

p ( x ) = exp { wij xi x j   hi xi  ( w )}
q ( x)  p ( x)

B

Boltzmann Machine
p ( xi = 1) = ( wij x j hi )
x4

x1 x2
x3

p ( x ) = exp { wij xi x j hi xi ( w )}
q ( x) p ( x)

B

49

Slide: Boltzmann machine ---hidden units
 EM algorithm  e-projection  m-projection

D

M

Boltzmann machine ---hidden units
EM algorithm e-projection m-projection

D

M

50

Slide: EM algorithm
hidden variables

p ( x , y; u ) D = { x1 ,L , x N } M = { p ( x , y; u )}

DM = p ( x , y ) p ( x ) = pD ( x )

{

}
M D

min KL  p ( x , y ) : p  M  m-projection to  
  min KL  p  D : p ( x , y; u )  e-projection to

EM algorithm
hidden variables

p ( x , y; u ) D = { x1 ,L , x N } M = { p ( x , y; u )}

DM = p ( x , y ) p ( x ) = pD ( x )

{

}
M D

min KL p ( x , y ) : p M m-projection to
min KL p D : p ( x , y; u ) e-projection to

51

Slide: SVM : support vector machine
Embedding Kernel
zi = i ( x) f ( x) =  wii ( x) =   i yi K ( xi , x)

K ( x, x ') =  i ( x) i ( x ')
K ( x, x ')   ( x )  ( x ') K ( x, x ') 

Conformal change of kernel
 ( x ) = exp{ | f ( x ) |2 }

SVM : support vector machine
Embedding Kernel
zi = i ( x) f ( x) = wii ( x) = i yi K ( xi , x)

K ( x, x ') = i ( x) i ( x ')
K ( x, x ') ( x ) ( x ') K ( x, x ')

Conformal change of kernel
( x ) = exp{ | f ( x ) |2 }

52

Slide: Signal Processing
ICA : Independent Component Analysis

xt = Ast

xt  st

sparse component analysis positive matrix factorization

Signal Processing
ICA : Independent Component Analysis

xt = Ast

xt st

sparse component analysis positive matrix factorization

53

Slide: mixture and unmixture of independent signals

x1

xm

s1 s2 sn x2

xi =  Aij s j
j =1

n

x = As

mixture and unmixture of independent signals

x1

xm

s1 s2 sn x2

xi = Aij s j
j =1

n

x = As

54

Slide: Independent Component Analysis

x = As y = Wx

xi =  Aij s j W=A
1

s

A

W

y

x

observations: x(1), x(2), , x(t) recover: s(1), s(2), , s(t)

Independent Component Analysis

x = As y = Wx

xi = Aij s j W=A
1

s

A

W

y

x

observations: x(1), x(2), , x(t) recover: s(1), s(2), , s(t)

55

Slide:

56

Slide: Space of Matrices : Lie group
W + dW
W 1

W

dX = dWW -1

I
dW
2

I + dX

= tr ( d X d X T

) = tr ( d W W

1

W

T

dW

T

)

l W TW l = W
dX :

non-holonomic basis

Space of Matrices : Lie group
W + dW
W 1

W

dX = dWW -1

I
dW
2

I + dX

= tr ( d X d X T

) = tr ( d W W

1

W

T

dW

T

)

l W TW l = W
dX :

non-holonomic basis

57

Slide: Information Geometry of ICA
S ={p(y)}

r

q

I = {q1 ( y1 )q2 ( y2 )...qn ( yn )}

{ p (Wx)}

natural gradient l ( W) = KL[ p (y; W) : q (y )] estimating function r (y ) stability, efficiency

Information Geometry of ICA
S ={p(y)}

r

q

I = {q1 ( y1 )q2 ( y2 )...qn ( yn )}

{ p (Wx)}

natural gradient l ( W) = KL[ p (y; W) : q (y )] estimating function r (y ) stability, efficiency

58

Slide: Semiparametric Statistical Model
p (x; W, r ) =| W | r ( Wx) 1 r = ri W = A , r ( s ) : unknown
x(1), x(2), , x(t)

Semiparametric Statistical Model
p (x; W, r ) =| W | r ( Wx) 1 r = ri W = A , r ( s ) : unknown
x(1), x(2), , x(t)

59

Slide: Natural Gradient
l ( y, W ) T W =  W W W

Natural Gradient
l ( y, W ) T W = W W W

60

Slide: Basis Given: overcomplete case Sparse Solution

x = As =  si ai many solutions many si  0  x = As
t t

sparse

 A:

 x = As

Basis Given: overcomplete case Sparse Solution

x = As = si ai many solutions many si 0 x = As
t t

sparse

A:

x = As

61

Slide: sparse

 A:

 x = As

generalized inverse

 min  si2 L2 -norm:
sparse solution

L1 -norm: min

 s
i

i

sparse

A:

x = As

generalized inverse

min si2 L2 -norm:
sparse solution

L1 -norm: min

s
i

i

62

Slide:

63

Slide: Overcomplete Basis and Sparse Solution

min s 1 =  si
non-linear denoising

x =  si ai = As

min As  x p +  s

p'

Overcomplete Basis and Sparse Solution

min s 1 = si
non-linear denoising

x = si ai = As

min As x p + s

p'

64

Slide: Sparse Solution
min  (  ) penalty Fp (  ) =   i
F0 (  ) = #1[  i  0] : sparsest solution F1 (  ) =   i : L1 solution Fp (  ) : 0  p  1
2

p

: Bayes prior

Sparse solution: overcomplete case

F2 (  ) =   i : generalized inverse solution

Sparse Solution
min ( ) penalty Fp ( ) = i
F0 ( ) = #1[ i 0] : sparsest solution F1 ( ) = i : L1 solution Fp ( ) : 0 p 1
2

p

: Bayes prior

Sparse solution: overcomplete case

F2 ( ) = i : generalized inverse solution

65

Slide: Optimization under Spasity Condition
 min  (  ) : convex function   F ( )  c constraint 
typical case:

1 1 2  (  ) = y  X  = (    *)T G (    *) 2 2 p = 2, p = 1, p = 1/ 2 p 1 F (  ) =  i ; p

Optimization under Spasity Condition
min ( ) : convex function F ( ) c constraint
typical case:

1 1 2 ( ) = y X = ( *)T G ( *) 2 2 p = 2, p = 1, p = 1/ 2 p 1 F ( ) = i ; p

66

Slide: L1-constrained optimization
Pc Problem
min  (  ) under F (  )  c solution   ( c ) : c = 0  

LASSO LARS

 c = 0   

P Problem

min  (  ) +  F (  ) solution   (  )

 =0
  = 0   

solutions * and * : coincide,  =  ( c ) , p  1 c  p < 1 :  =  ( c ) multiple, noncontinuous stability different

L1-constrained optimization
Pc Problem
min ( ) under F ( ) c solution ( c ) : c = 0

LASSO LARS

c = 0

P Problem

min ( ) + F ( ) solution ( )

=0
= 0

solutions * and * : coincide, = ( c ) , p 1 c p < 1 : = ( c ) multiple, noncontinuous stability different

67

Slide:  Projection from *

to F = c (information geometry)
*

*

Projection from *

to F = c (information geometry)
*

*

68

Slide: Convex Cone Programming
: positive semi-definite matrix
convex potential function

P

dual geodesic approach

Ax = b,

min c  x

Support vector machine

Convex Cone Programming
: positive semi-definite matrix
convex potential function

P

dual geodesic approach

Ax = b,

min c x

Support vector machine

69

Slide: a) Rc : n = 2, p > 1

b) Rc : n = 2, p = 1

non-convex
c) Rc : n = 2, p < 1 Fig. 1

a) Rc : n = 2, p > 1

b) Rc : n = 2, p = 1

non-convex
c) Rc : n = 2, p < 1 Fig. 1

70

Slide: orthogonal projection, dual projection min  () = D  : *  , F (  ) = c : dual geodesic projection  
dual

orthogonal projection, dual projection min () = D : * , F ( ) = c : dual geodesic projection
dual

71

Slide:    
n n = F

Fig. 5 subgradient


n n = F

Fig. 5 subgradient

72

Slide: LASSO path and LARS path (stagewise solution)
min  (  ) : F (  ) = c min  (  ) +  F (  )

  (c) ,   ( )

c   correspondence

LASSO path and LARS path (stagewise solution)
min ( ) : F ( ) = c min ( ) + F ( )

(c) , ( )

c correspondence

73

Slide: Active set and gradient
A (  ) = {i  i  0} sgn (  i )  i (1 p ) , i  A  Fp (  ) = ( ,  ) , i A [ 1,1]

Active set and gradient
A ( ) = {i i 0} sgn ( i ) i (1 p ) , i A Fp ( ) = ( , ) , i A [ 1,1]

74

Slide: Solution path

{

 A (  c ) + c  A F (  c ) = 0,

 c

& &  A A (  c ) + c  A A F (  c )   c = c  A F (  c )

}

& =  K 1 F (   ) ;  = d  & & c A c c c c dc
K = G (  c ) + cF (  c )

( F1 = 0;

F1 = (sgn  i ) : L1 )

Solution path

{

A ( c ) + c A F ( c ) = 0,

c

& & A A ( c ) + c A A F ( c ) c = c A F ( c )

}

& = K 1 F ( ) ; = d & & c A c c c c dc
K = G ( c ) + cF ( c )

( F1 = 0;

F1 = (sgn i ) : L1 )

75

Slide: Solution path in the subspace of the active set
   A (   ) +  A F (   ) = 0   &   =  K A1 A F (   )

 A : active direction

turning point A  A

Solution path in the subspace of the active set
A ( ) + A F ( ) = 0 & = K A1 A F ( )

A : active direction

turning point A A

76

Slide: Gradient Descent Method
 L = { L( x )}: xi

min L(x+a): gij a a = 
i j

2

covariant

ji  %L ={   g x L( x )}: contravariant i

xt +1 = xt  cL( xt )

Gradient Descent Method
L = { L( x )}: xi

min L(x+a): gij a a =
i j

2

covariant

ji %L ={ g x L( x )}: contravariant i

xt +1 = xt cL( xt )

77

Slide: Extended LARS (p = 1) and Minkovskian grad
norm a =  ai
p

p

max (  +  a )

under
p

a

p

=1

 (  +  a)   a
p = 1+

sgni , i = max {1 ,L ,  N }  1 (  ) = 1 A  otherwise  0, 

 =  (  )

Extended LARS (p = 1) and Minkovskian grad
norm a = ai
p

p

max ( + a )

under
p

a

p

=1

( + a) a
p = 1+

sgni , i = max {1 ,L , N } 1 ( ) = 1 A otherwise 0,

= ( )

78

Slide: i  = arg max fi
max fi = fi = f j

% ( F )

i

1, for i = i  and j  , = otherwise. 0

%  t +1 =  t  F

LARS

i = arg max fi
max fi = fi = f j

% ( F )

i

1, for i = i and j , = otherwise. 0

% t +1 = t F

LARS

79

Slide: % F =  f

Euclidean case

% F = c ( sgn fi ) fi

1 p 1

% F = c sgn fi

(

)

0  M    1    0  M    0 

 1

% F = f

Euclidean case

% F = c ( sgn fi ) fi

1 p 1

% F = c sgn fi

(

)

0 M 1 0 M 0

1

80

Slide: L1/2 constraint: non-convex optimization

-trajectory
Ex. 1-dim
 ( ) =

and-trajectory c
1 2    *) ( 2

1 2 f  (  ) =  +  F = (   2 ) + 2  2

L1/2 constraint: non-convex optimization

-trajectory
Ex. 1-dim
( ) =

and-trajectory c
1 2 *) ( 2

1 2 f ( ) = + F = ( 2 ) + 2 2

81

Slide: Pc : min (     ) ,
2

 c


0

 c = c
 =0 

c





P : f  = 0
R (   )

   +

  = R (   )

: Xu Zongben's operator

c = c (    c )
c

Pc : min ( ) ,
2

c

0

c = c
=0

c

P : f = 0
R ( )

+

= R ( )

: Xu Zongben's operator

c = c ( c )
c

82

Slide: ICCN-Huangshan

Sparse Signal Analysis
Shun-ichi Ammari 
RIKEN Brain Science Institute Collaborator: Masahiro Yukawa, Niigata University)

ICCN-Huangshan

Sparse Signal Analysis
Shun-ichi Ammari
RIKEN Brain Science Institute Collaborator: Masahiro Yukawa, Niigata University)

83

Slide: Solution Path :  c
not continuous, not-monotone jump
  c

Solution Path : c
not continuous, not-monotone jump
c

84

Slide: An Example of the greedy path

An Example of the greedy path

85

Slide: Linear Programming

A x b max  c x  ( x ) =  log (  A x
ij j i i i ij i

j

 bi )

inner method

Linear Programming

A x b max c x ( x ) = log ( A x
ij j i i i ij i

j

bi )

inner method

86

Slide: Convex Programming  Inner Method
LP : Ax  b,
min

c x  0
c x

 ( x ) =  log (  Aij x j  bi )
+  log xi

 =  i ( x )

Simplex method ; inner method

Convex Programming Inner Method
LP : Ax b,
min

c x 0
c x

( x ) = log ( Aij x j bi )
+ log xi

= i ( x )

Simplex method ; inner method

87

Slide: Polynomial-Time Algorithm
curvature : step-size
H
( m)
2

min : tc  x + ( x )

x =  (t )

  geodesic

Polynomial-Time Algorithm
curvature : step-size
H
( m)
2

min : tc x + ( x )

x = (t )

geodesic

88

Slide: Neural Networks
Multilayer Perceptron Higher-order correlations
Synchronous firing

Neural Networks
Multilayer Perceptron Higher-order correlations
Synchronous firing

89

Slide: Multilayer Perceptrons

y =  vi ( wi  x ) + n

x



x = ( x1 , x2 ,..., xn )
2  1 p ( y x; ) = c exp  ( y  f ( x , ) )   2  f ( x , ) =  vi ( wi  x )

 = ( w1 ,..., wm ; v1 ,..., vm )

Multilayer Perceptrons

y = vi ( wi x ) + n

x

x = ( x1 , x2 ,..., xn )
2 1 p ( y x; ) = c exp ( y f ( x , ) ) 2 f ( x , ) = vi ( wi x )

= ( w1 ,..., wm ; v1 ,..., vm )

90

Slide: Multilayer Perceptron
neuromanifold space of functions

(x)

y = f ( x,  )  = ( v 1 L , vm ; w1, L , wm ) =  vi  ( wi  x )

Multilayer Perceptron
neuromanifold space of functions

(x)

y = f ( x, ) = ( v 1 L , vm ; w1, L , wm ) = vi ( wi x )

91

Slide: singularities

singularities

92

Slide: Geometry of singular model

y = v ( w  x ) + n

v | w |= 0

v

Geometry of singular model

y = v ( w x ) + n

v | w |= 0

v

93

Slide: Backpropagation ---gradient learning
examples : ( y1 , x1 ) ,L ( yt , xt )
2 1 E = y  f ( x ,  ) =  log p ( y, x; ) 2

E  t = t  f ( x ,  ) =  vi ( wi  x )

natural gradient (Riemannian) % E = G 1E --steepest descent

Backpropagation ---gradient learning
examples : ( y1 , x1 ) ,L ( yt , xt )
2 1 E = y f ( x , ) = log p ( y, x; ) 2

E t = t f ( x , ) = vi ( wi x )

natural gradient (Riemannian) % E = G 1E --steepest descent

94

Slide:

95

Slide:

96

Slide: conformaltransformation
q Fisherinformation
q F gij ( p ) = gij ( p ) hq ( p )
(q)

q  divergence 1 q 1q (1  p(x) r(x) dx) Dq[ p(x): r(x)] = (1 q)hq ( p)

conformaltransformation
q Fisherinformation
q F gij ( p ) = gij ( p ) hq ( p )
(q)

q divergence 1 q 1q (1 p(x) r(x) dx) Dq[ p(x): r(x)] = (1 q)hq ( p)

97

Slide: Total Bregman Divergence and its Applications to Shape Retrieval

Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Total Bregman Divergence and its Applications to Shape Retrieval

Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari, Frank Nielsen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

98

Slide: Total Bregman Divergence
TD [ x : y ] = D [ x : y] 1 + f
2

rotational invariance conformal geometry

Total Bregman Divergence
TD [ x : y ] = D [ x : y] 1 + f
2

rotational invariance conformal geometry

99

Slide: Total Bregman divergence (Vemuri)
TBD( p : q) =

 ( p )   ( q )   ( q )  ( p  q )
1+ |  (q ) |
2

Total Bregman divergence (Vemuri)
TBD( p : q) =

( p ) ( q ) ( q ) ( p q )
1+ | (q ) |
2

100

Slide: Clustering : t-center
y

E = { x1 ,L , xm }
T-center of E x  = arg min  TD [ x , xi ]
i

E

x

Clustering : t-center
y

E = { x1 ,L , xm }
T-center of E x = arg min TD [ x , xi ]
i

E

x

101

Slide: t-centerx
 w f ( x ) f ( x ) = w
 i i i



wi =

1 1 +  f ( xi )
2

t-centerx
w f ( x ) f ( x ) = w
i i i

wi =

1 1 + f ( xi )
2

102

Slide: q superrobustestimator(Eguchi)
p ( x,  )  max p ( x,  )  max hq +1 bias-corrected q-estimating function  sq ( x,  ) = p ( x,  ) { i log p  cq +1 ( )} 1 cq +1 =  log hq +1 ( ) q +1

 s ( x , ) = 0
i =0 q i

N

1   max  p ( xi ,  ) hq +1

q superrobustestimator(Eguchi)
p ( x, ) max p ( x, ) max hq +1 bias-corrected q-estimating function sq ( x, ) = p ( x, ) { i log p cq +1 ( )} 1 cq +1 = log hq +1 ( ) q +1

s ( x , ) = 0
i =0 q i

N

1 max p ( xi , ) hq +1

103

Slide: Conformalchangeofdivergence
% D ( p : q ) =  ( p ) D [ p : q]
% gij =  ( p ) gij
% Tijk =  (Tijk + sk gij + s j gik + si g jk ) si =  i log

Conformalchangeofdivergence
% D ( p : q ) = ( p ) D [ p : q]
% gij = ( p ) gij
% Tijk = (Tijk + sk gij + s j gik + si g jk ) si = i log

104

Slide: t-center is robust
E  = { x1 ,L , xn ; y} % x = x +  z ( x ; y ) ,  = 1 n

influence function z ( x ; y )


z < c as y   : robust

t-center is robust
E = { x1 ,L , xn ; y} % x = x + z ( x ; y ) , = 1 n

influence function z ( x ; y )

z < c as y : robust

105

Slide: z ( x , y ) = G

f ( y )  f ( x  ) 1 w( y)

1 G =  w ( xi )f ( xi ) n

Robust: z is bounded f ( y ) f ( y ) = w( y) 1 + f ( y ) w( y) > 1

z ( x  , y ) = G 1 <

y 1+ y
2

2

z ( x , y ) = y

Euclidean case f =

1 2 x 2

z ( x , y ) = G

f ( y ) f ( x ) 1 w( y)

1 G = w ( xi )f ( xi ) n

Robust: z is bounded f ( y ) f ( y ) = w( y) 1 + f ( y ) w( y) > 1

z ( x , y ) = G 1 <

y 1+ y
2

2

z ( x , y ) = y

Euclidean case f =

1 2 x 2

106

Slide: MPEG7 database
 Great intraclass variability, and small interclass dissimilarity.

MPEG7 database
Great intraclass variability, and small interclass dissimilarity.

107

Slide: Other TBD applications
Diffusion tensor imaging (DTI) analysis
[Vemuri]

 Interpolation  Segmentation
Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear

Other TBD applications
Diffusion tensor imaging (DTI) analysis
[Vemuri]

Interpolation Segmentation
Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear

108

Slide: TBD application-shape retrieval
 Using MPEG7 database;  70 classes, with 20 shapes each class (Meizhu Liu)

TBD application-shape retrieval
Using MPEG7 database; 70 classes, with 20 shapes each class (Meizhu Liu)

109

Slide: Multiterminal Information & Statistical Inference

X
Y

:x

n

RX RY

: yn

 

p ( x, y;  ) M X = 2 RX n M r = 2 RY n

Multiterminal Information & Statistical Inference

X
Y

:x

n

RX RY

: yn

p ( x, y; ) M X = 2 RX n M r = 2 RY n

110

Slide: p(x,y)
correlation

p(x,y,z)
marginal

G = GM + GC 0-rate Slepian-Wolf

p(x,y)
correlation

p(x,y,z)
marginal

G = GM + GC 0-rate Slepian-Wolf

111

Slide: Linear Systems

ut
ARMA

xt
1 p

xt +1 =

1 + b1 z + L + bq z

 = ( a1 ,L , a p : b1 ,L , bq )
xt +1 = f (, z 1 , ut )

1 + a1 z 1 + L + a p z  p

ut

AR---e-flat MA---m-flat

Linear Systems

ut
ARMA

xt
1 p

xt +1 =

1 + b1 z + L + bq z

= ( a1 ,L , a p : b1 ,L , bq )
xt +1 = f (, z 1 , ut )

1 + a1 z 1 + L + a p z p

ut

AR---e-flat MA---m-flat

112

Slide: Machine Learning
Boosting : combination of weak learners

D = {( x1 , y1 ) , ( x2 , y2 ) ,L , ( x N , yN )}

yi =   1
f ( x , u ) : y = h ( x , u ) = sgn f ( x , u )

Machine Learning
Boosting : combination of weak learners

D = {( x1 , y1 ) , ( x2 , y2 ) ,L , ( x N , yN )}

yi = 1
f ( x , u ) : y = h ( x , u ) = sgn f ( x , u )

113

Slide: Weak Learners
H ( x ) = sgn (   t ht ( x ) )

 t : Prob {ht ( xi )  yi }

Wt

Wt +1 ( i ) = cWt ( i ) exp { t yi ht ( xi )}

weight distribution

Weak Learners
H ( x ) = sgn ( t ht ( x ) )

t : Prob {ht ( xi ) yi }

Wt

Wt +1 ( i ) = cWt ( i ) exp { t yi ht ( xi )}

weight distribution

114

Slide: Boosting generalization
% Qt = Qt ( y x ) = Qt 1 ( y x ) exp  t yht ( x )  f
Ft = { P ( y, x ) Eyht ( x ) = const}

{

{

}}

%  t : min D  P : Qt   

% % D P, Qt +1 < D P, Qt

(

)

(

)

Boosting generalization
% Qt = Qt ( y x ) = Qt 1 ( y x ) exp t yht ( x ) f
Ft = { P ( y, x ) Eyht ( x ) = const}

{

{

}}

% t : min D P : Qt

% % D P, Qt +1 < D P, Qt

(

)

(

)

115

Slide: Integration of evidences:
x1 , x2 ,...xm
arithmetic mean geometric mean harmonic mean  -mean

Integration of evidences:
x1 , x2 ,...xm
arithmetic mean geometric mean harmonic mean -mean

116

Slide: Various Means
a+b 2
arithmetic

2
:

ab
geometric

:

1 1 + a b
harmonic

Anyothermean?

Various Means
a+b 2
arithmetic

2
:

ab
geometric

:

1 1 + a b
harmonic

Anyothermean?

117

Slide: Generalized mean: f-mean
f(u): monotone; f-representation of u
f (a ) + f (b) m f ( a, b) = f { } 2
1

scalefree
-representation

m f (ca, cb) = cm f (a, b) f (u ) = u
1 2

,

log u ,

 1  =1

Generalized mean: f-mean
f(u): monotone; f-representation of u
f (a ) + f (b) m f ( a, b) = f { } 2
1

scalefree
-representation

m f (ca, cb) = cm f (a, b) f (u ) = u
1 2

,

log u ,

1 =1

118

Slide: -mean :
 =1 :
ab a+b  = 1: 2

m ( p1 ( s ), p2 ( s ))

a+b 1 ab  = 0 : ( a + b) = + 4 2 m = min(a, b)  =
2

 = 

m = max(a, b)

-mean :
=1 :
ab a+b = 1: 2

m ( p1 ( s ), p2 ( s ))

a+b 1 ab = 0 : ( a + b) = + 4 2 m = min(a, b) =
2

=

m = max(a, b)

119

Slide:   Family of Distributions
{ p1 ( s),L , pk ( s)}

p ( x; ) = f { i f ( pi ( x))}
1

mixturefamily: k pmix ( s ) =  ti pi ( s ),
i =1

t

i

=1

exponentialfamily: log pexp ( s ) =  ti log pi ( s )

Family of Distributions
{ p1 ( s),L , pk ( s)}

p ( x; ) = f { i f ( pi ( x))}
1

mixturefamily: k pmix ( s ) = ti pi ( s ),
i =1

t

i

=1

exponentialfamily: log pexp ( s ) = ti log pi ( s )

120