« Graphical Models and Message-passing: Some Introductory Lectures

Martin Wainwright

Graphical models allow for flexible modeling of large collections of random variables, and play an important role in various areas of statistics and machine learning. In this series of introductory lectures, we introduce the basics of graphical models, as well as associated message-passing algorithms for computing marginals, modes, and likelihoods in graphical models. We also discuss methods for learning graphical models from data.

Scroll with j/k | | | Size

Slide: Graphical models and message-passing Part I: Basics and MAP computation
Martin Wainwright

UC Berkeley Departments of Statistics, and EECS Tutorial materials (slides, monograph, lecture notes) available at: www.eecs.berkeley.edu/ wainwrig/kyoto12
September 2, 2012

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

1 / 35

Graphical models and message-passing Part I: Basics and MAP computation
Martin Wainwright

UC Berkeley Departments of Statistics, and EECS Tutorial materials (slides, monograph, lecture notes) available at: www.eecs.berkeley.edu/ wainwrig/kyoto12
September 2, 2012

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

1 / 35

1

Slide: Introduction
graphical model:   graph G = (V, E) with N vertices random vector: (X1 , X2 , . . . , XN )

(a) Markov chain

(b) Multiscale quadtree

(c) Two-dimensional grid

useful in many statistical and computational elds:
    

machine learning, articial intelligence computational biology, bioinformatics statistical signal/image processing, spatial statistics statistical physics communication and information theory

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

2 / 35

Introduction
graphical model: graph G = (V, E) with N vertices random vector: (X1 , X2 , . . . , XN )

(a) Markov chain

(b) Multiscale quadtree

(c) Two-dimensional grid

useful in many statistical and computational elds:

machine learning, articial intelligence computational biology, bioinformatics statistical signal/image processing, spatial statistics statistical physics communication and information theory

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

2 / 35

2

Slide: Graphs and factorization
2 47 3 4 7 7

1

5 456

6

clique C is a fully connected subset of vertices compatibility function C dened on variables xC = {xs , s  C} factorization over all cliques p(x1 , . . . , xN ) = 1 Z C (xC ).
CC
September 2, 2012 3 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

Graphs and factorization
2 47 3 4 7 7

1

5 456

6

clique C is a fully connected subset of vertices compatibility function C dened on variables xC = {xs , s C} factorization over all cliques p(x1 , . . . , xN ) = 1 Z C (xC ).
CC
September 2, 2012 3 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

3

Slide: Example: Optical digit/character recognition

Goal: correctly label digits/characters based on noisy versions E.g., mail sorting; document scanning; handwriting recognition systems

Example: Optical digit/character recognition

Goal: correctly label digits/characters based on noisy versions E.g., mail sorting; document scanning; handwriting recognition systems

4

Slide: Example: Optical digit/character recognition

Goal: correctly label digits/characters based on noisy versions strong sequential dependencies captured by (hidden) Markov chain message-passing spreads information along chain
(Baum & Petrie, 1966; Viterbi, 1967, and many others)

Example: Optical digit/character recognition

Goal: correctly label digits/characters based on noisy versions strong sequential dependencies captured by (hidden) Markov chain message-passing spreads information along chain
(Baum & Petrie, 1966; Viterbi, 1967, and many others)

5

Slide: Example: Image processing and denoising

8-bit digital image: matrix of intensity values {0, 1, . . . 255} enormous redundancy in typical images (useful for denoising, compression, etc.)

Example: Image processing and denoising

8-bit digital image: matrix of intensity values {0, 1, . . . 255} enormous redundancy in typical images (useful for denoising, compression, etc.)

6

Slide: Example: Image processing and denoising

8-bit digital image: matrix of intensity values {0, 1, . . . 255} enormous redundancy in typical images (useful for denoising, compression, etc.) multiscale tree used to represent coecients of a multiscale transform (e.g., wavelets, Gabor lters etc.)
(e.g., Willsky, 2002)

Example: Image processing and denoising

8-bit digital image: matrix of intensity values {0, 1, . . . 255} enormous redundancy in typical images (useful for denoising, compression, etc.) multiscale tree used to represent coecients of a multiscale transform (e.g., wavelets, Gabor lters etc.)
(e.g., Willsky, 2002)

7

Slide: Example: Depth estimation in computer vision

Stereo pairs: two images taken from horizontally-oset cameras

Example: Depth estimation in computer vision

Stereo pairs: two images taken from horizontally-oset cameras

8

Slide: Modeling depth with a graphical model
Introduce variable at pixel location (a, b): xab  Oset between images in position (a, b) ab,cd (xab , xcd ) cd (xcd ) Left image ab (xab )

Right image Use message-passing algorithms to estimate most likely oset/depth map.
(Szeliski et al., 2005)

Modeling depth with a graphical model
Introduce variable at pixel location (a, b): xab Oset between images in position (a, b) ab,cd (xab , xcd ) cd (xcd ) Left image ab (xab )

Right image Use message-passing algorithms to estimate most likely oset/depth map.
(Szeliski et al., 2005)

9

Slide: Many other examples
natural language processing (e.g., parsing, translation) computational biology (gene sequences, protein folding, phylogenetic reconstruction) social network analysis (e.g., politics, Facebook, terrorism.) communication theory and error-control decoding (e.g., turbo codes, LDPC codes) satisability problems (3-SAT, MAX-XORSAT, graph colouring) robotics (path planning, tracking, navigation) sensor network deployments (e.g., distributed detection, estimation, fault monitoring) ...

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

8 / 35

Many other examples
natural language processing (e.g., parsing, translation) computational biology (gene sequences, protein folding, phylogenetic reconstruction) social network analysis (e.g., politics, Facebook, terrorism.) communication theory and error-control decoding (e.g., turbo codes, LDPC codes) satisability problems (3-SAT, MAX-XORSAT, graph colouring) robotics (path planning, tracking, navigation) sensor network deployments (e.g., distributed detection, estimation, fault monitoring) ...

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

8 / 35

10

Slide: Core computational challenges
Given an undirected graphical model (Markov random eld): p(x1 , x2 , . . . , xN ) How to eciently compute? most probable conguration (MAP estimate): Maximize : x = arg max p(x1 , . . . , xN ) = arg max
xX N xX N

=

1 Z

C (xC )
CC

C (xC ).
CC

the data likelihood or normalization constant Sum/integrate : Z =
xX N CC

C (xC )

marginal distributions at single sites, or subsets: Sum/integrate : p(Xs = xs ) = 1 Z C (xC )
xt , t=s CC
September 2, 2012 9 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

Core computational challenges
Given an undirected graphical model (Markov random eld): p(x1 , x2 , . . . , xN ) How to eciently compute? most probable conguration (MAP estimate): Maximize : x = arg max p(x1 , . . . , xN ) = arg max
xX N xX N

=

1 Z

C (xC )
CC

C (xC ).
CC

the data likelihood or normalization constant Sum/integrate : Z =
xX N CC

C (xC )

marginal distributions at single sites, or subsets: Sum/integrate : p(Xs = xs ) = 1 Z C (xC )
xt , t=s CC
September 2, 2012 9 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

11

Slide: Goal: Compute most probable conguration (MAP estimate) on a tree:     exp(s (xs ) exp(st (xs , xt )) . x = arg max  xX N 
sV (s,t)E

1. Max-product message-passing on trees

M12

M32

1

2

3

x1 ,x2 ,x3

max p(x) = max exp(2 (x2 ))
x2 t1,3

max exp[t (xt ) + 2t (x2 , xt )]
xt

Max-product strategy: Divide and conquer: break global maximization into simpler sub-problems. (Lauritzen & Spiegelhalter, 1988)

Goal: Compute most probable conguration (MAP estimate) on a tree: exp(s (xs ) exp(st (xs , xt )) . x = arg max xX N
sV (s,t)E

1. Max-product message-passing on trees

M12

M32

1

2

3

x1 ,x2 ,x3

max p(x) = max exp(2 (x2 ))
x2 t1,3

max exp[t (xt ) + 2t (x2 , xt )]
xt

Max-product strategy: Divide and conquer: break global maximization into simpler sub-problems. (Lauritzen & Spiegelhalter, 1988)

12

Slide: Max-product on trees
Decompose:
x1 ,x2 ,x3 ,x4 ,x5

max

p(x) = max exp(1 (x1 ))
x2

tN (2)

Mt2 (x2 ) .

replacements M12 M32

4

M43 1 2 3 M53

5 Update messages: M32 (x2 ) = max exp(3 (x3 ) + 23 (x2 , x3 )
x3



vN (3)\2

Mv3 (x3 )

Max-product on trees
Decompose:
x1 ,x2 ,x3 ,x4 ,x5

max

p(x) = max exp(1 (x1 ))
x2

tN (2)

Mt2 (x2 ) .

replacements M12 M32

4

M43 1 2 3 M53

5 Update messages: M32 (x2 ) = max exp(3 (x3 ) + 23 (x2 , x3 )
x3

vN (3)\2

Mv3 (x3 )

13

Slide: Putting together the pieces
Max-product is an exact algorithm for any tree.

Tw w Mut u Tu Mwt t s Mts Mvt v Tv
exp st (xs , x ) + t (x ) t t
vN (t)\s tN (s)

Mts N (t)

 

message from node t to s neighbors of node t

Update: Max-marginals:

Mts (xs )  max 

ps (xs ; )  exp{s (xs )}

xt Xt

Mvt (xt )

Mts (xs ).
September 2, 2012 12 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

Putting together the pieces
Max-product is an exact algorithm for any tree.

Tw w Mut u Tu Mwt t s Mts Mvt v Tv
exp st (xs , x ) + t (x ) t t
vN (t)\s tN (s)

Mts N (t)

message from node t to s neighbors of node t

Update: Max-marginals:

Mts (xs ) max

ps (xs ; ) exp{s (xs )}

xt Xt

Mvt (xt )

Mts (xs ).
September 2, 2012 12 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

14

Slide: Summary: max-product on trees
converges in at most graph diameter # of iterations updating a single message is an O(m2 ) operation overall algorithm requires O(N m2 ) operations upon convergence, yields the exact max-marginals: ps (xs )  exp{s (xs )} Mts (xs ).
tN (s)

when arg maxxs ps (xs ) = {xs } for all s  V , then x = (x , . . . , x ) is the 1 N unique MAP solution otherwise, there are multiple MAP solutions and one can be obtained by back-tracking

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

13 / 35

Summary: max-product on trees
converges in at most graph diameter # of iterations updating a single message is an O(m2 ) operation overall algorithm requires O(N m2 ) operations upon convergence, yields the exact max-marginals: ps (xs ) exp{s (xs )} Mts (xs ).
tN (s)

when arg maxxs ps (xs ) = {xs } for all s V , then x = (x , . . . , x ) is the 1 N unique MAP solution otherwise, there are multiple MAP solutions and one can be obtained by back-tracking

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

13 / 35

15

Slide: 2. Max-product on graph with cycles?
Tw w Mut u Tu Mwt t s Mts Mvt v Tv max-product can be applied to graphs with cycles (no longer exact) empirical performance is often very good
Mts N (t)   message from node t to s neighbors of node t

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

14 / 35

2. Max-product on graph with cycles?
Tw w Mut u Tu Mwt t s Mts Mvt v Tv max-product can be applied to graphs with cycles (no longer exact) empirical performance is often very good
Mts N (t) message from node t to s neighbors of node t

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

14 / 35

16

Slide: Partial guarantees for max-product
single-cycle graphs and Gaussian models
(Aji & McEliece, 1998; Horn, 1999; Weiss, 1998, Weiss & Freeman, 2001)

local optimality guarantees:
 

tree-plus-loop neighborhoods optimality on more general sub-graphs

(Weiss & Freeman, 2001) (Wainwright et al., 2003) (Wainwright et al., 2003) (Bayati et al., 2005, 2008, Jebara &

existence of xed points for general graphs exactness for certain matching problems
Huang, 2007, Sanghavi, 2008)

no general optimality results

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

15 / 35

Partial guarantees for max-product
single-cycle graphs and Gaussian models
(Aji & McEliece, 1998; Horn, 1999; Weiss, 1998, Weiss & Freeman, 2001)

local optimality guarantees:

tree-plus-loop neighborhoods optimality on more general sub-graphs

(Weiss & Freeman, 2001) (Wainwright et al., 2003) (Wainwright et al., 2003) (Bayati et al., 2005, 2008, Jebara &

existence of xed points for general graphs exactness for certain matching problems
Huang, 2007, Sanghavi, 2008)

no general optimality results

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

15 / 35

17

Slide: Partial guarantees for max-product
single-cycle graphs and Gaussian models
(Aji & McEliece, 1998; Horn, 1999; Weiss, 1998, Weiss & Freeman, 2001)

local optimality guarantees:
 

tree-plus-loop neighborhoods optimality on more general sub-graphs

(Weiss & Freeman, 2001) (Wainwright et al., 2003) (Wainwright et al., 2003) (Bayati et al., 2005, 2008, Jebara &

existence of xed points for general graphs exactness for certain matching problems
Huang, 2007, Sanghavi, 2008)

no general optimality results Questions:  Can max-product return an incorrect answer with high condence?  Any connection to classical approaches to integer programs?
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 15 / 35

Partial guarantees for max-product
single-cycle graphs and Gaussian models
(Aji & McEliece, 1998; Horn, 1999; Weiss, 1998, Weiss & Freeman, 2001)

local optimality guarantees:

tree-plus-loop neighborhoods optimality on more general sub-graphs

(Weiss & Freeman, 2001) (Wainwright et al., 2003) (Wainwright et al., 2003) (Bayati et al., 2005, 2008, Jebara &

existence of xed points for general graphs exactness for certain matching problems
Huang, 2007, Sanghavi, 2008)

no general optimality results Questions: Can max-product return an incorrect answer with high condence? Any connection to classical approaches to integer programs?
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 15 / 35

18

Slide: Standard analysis via computation tree
standard tool: computation tree of message-passing updates
(Gallager, 1963; Weiss, 2001; Richardson & Urbanke, 2001)

1 2 1 2 3 4 (a) Original graph 1 3 4 4 3 1 2 4 3 4 2

2

2 2 3 3 3 1 1 (b) Computation tree (4 iterations)

level t of tree: all nodes whose messages reach the root (node 1) after t iterations of message-passing
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 16 / 35

Standard analysis via computation tree
standard tool: computation tree of message-passing updates
(Gallager, 1963; Weiss, 2001; Richardson & Urbanke, 2001)

1 2 1 2 3 4 (a) Original graph 1 3 4 4 3 1 2 4 3 4 2

2

2 2 3 3 3 1 1 (b) Computation tree (4 iterations)

level t of tree: all nodes whose messages reach the root (node 1) after t iterations of message-passing
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 16 / 35

19

Slide: Example: Inexactness of standard max-product
(Wainwright et al., 2005)

Intuition:
max-product solves (exactly) a modied problem on computation tree nodes not equally weighted in computation tree  max-product can output an incorrect conguration

1 1 3 2 3 1 4 2 4 3 1 2 4 3 4 2

4 (a) Diamond graph Gdia

2 2 2 1 3 3 3 1 (b) Computation tree (4 iterations)

for example: asymptotic node fractions  in this computation tree: (1) (2) (3) (4) = 0.2393 0.2607 0.2607 0.2393
17 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

Example: Inexactness of standard max-product
(Wainwright et al., 2005)

Intuition:
max-product solves (exactly) a modied problem on computation tree nodes not equally weighted in computation tree max-product can output an incorrect conguration

1 1 3 2 3 1 4 2 4 3 1 2 4 3 4 2

4 (a) Diamond graph Gdia

2 2 2 1 3 3 3 1 (b) Computation tree (4 iterations)

for example: asymptotic node fractions in this computation tree: (1) (2) (3) (4) = 0.2393 0.2607 0.2607 0.2393
17 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

20

Slide: A whole family of non-exact examples
 2  4  1  3
s (xs ) st (xs , xt ) = xs xs  0 if s = 1 or s = 4 if s = 2 or s = 3 if xs = xt otherwise

for  suciently large, optimal solution is always either 14 = 1 1 1 1 or (1)4 = (1) (1) (1) (1) rst-order LP relaxation always exact for this problem max-product and LP relaxation give dierent decision boundaries: 14 if 0.25 + 0.25  0 x= Optimal/LP boundary: (1)4 otherwise Max-product boundary: x= 14 (1)4 if 0.2393 + 0.2607  0 otherwise
September 2, 2012 18 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

A whole family of non-exact examples
2 4 1 3
s (xs ) st (xs , xt ) = xs xs 0 if s = 1 or s = 4 if s = 2 or s = 3 if xs = xt otherwise

for suciently large, optimal solution is always either 14 = 1 1 1 1 or (1)4 = (1) (1) (1) (1) rst-order LP relaxation always exact for this problem max-product and LP relaxation give dierent decision boundaries: 14 if 0.25 + 0.25 0 x= Optimal/LP boundary: (1)4 otherwise Max-product boundary: x= 14 (1)4 if 0.2393 + 0.2607 0 otherwise
September 2, 2012 18 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

21

Slide: by introducing weights on edges, obtain a more general family of reweighted max-product algorithms with suitable edge weights, connected to linear programming relaxations many variants of these algorithms:
   

3. A more general class of algorithms

tree-reweighted max-product sequential TRMP convex message-passing dual updating schemes

(W., Jaakkola & Willsky, 2002, 2005) (Kolmogorov, 2005) (Weiss et al., 2007) (e.g., Globerson & Jaakkola, 2007)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

19 / 35

by introducing weights on edges, obtain a more general family of reweighted max-product algorithms with suitable edge weights, connected to linear programming relaxations many variants of these algorithms:

3. A more general class of algorithms

tree-reweighted max-product sequential TRMP convex message-passing dual updating schemes

(W., Jaakkola & Willsky, 2002, 2005) (Kolmogorov, 2005) (Weiss et al., 2007) (e.g., Globerson & Jaakkola, 2007)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

19 / 35

22

Slide: Tree-reweighted max-product algorithms
(Wainwright, Jaakkola & Willsky, 2002)

Message update from node t to node s:
reweighted messages st (xs , x ) t + t (x ) exp t st reweighted edge Mvt (xt )
vN (t)\s vt

Mts (xs )



 max 

xt Xt

Mst (xt )

(1ts )

.

opposite message

Properties: 1. Modied updates remain distributed and purely local over the graph.  Messages are reweighted with st  [0, 1]. 2. Key dierences:  Potential on edge (s, t) is rescaled by st  [0, 1].  Update involves the reverse direction edge. 3. The choice st = 1 for all edges (s, t) recovers standard update.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

20 / 35

Tree-reweighted max-product algorithms
(Wainwright, Jaakkola & Willsky, 2002)

Message update from node t to node s:
reweighted messages st (xs , x ) t + t (x ) exp t st reweighted edge Mvt (xt )
vN (t)\s vt

Mts (xs )

max

xt Xt

Mst (xt )

(1ts )

.

opposite message

Properties: 1. Modied updates remain distributed and purely local over the graph. Messages are reweighted with st [0, 1]. 2. Key dierences: Potential on edge (s, t) is rescaled by st [0, 1]. Update involves the reverse direction edge. 3. The choice st = 1 for all edges (s, t) recovers standard update.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

20 / 35

23

Slide: Edge appearance probabilities
Experiment: What is the probability e that a given edge e  E belongs to a tree T drawn randomly under ? f f f f

b

b

b

b

e
(a) Original

e
(b) (T 1 ) =
1 3

e
(c) (T 2 ) =
1 3

e
(d) (T 3 ) =
1 3

In this example:

b = 1;

e = 2 ; 3

f = 1 . 3

The vector e = { e | e  E } must belong to the spanning tree polytope.
(Edmonds, 1971)
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 21 / 35

Edge appearance probabilities
Experiment: What is the probability e that a given edge e E belongs to a tree T drawn randomly under ? f f f f

b

b

b

b

e
(a) Original

e
(b) (T 1 ) =
1 3

e
(c) (T 2 ) =
1 3

e
(d) (T 3 ) =
1 3

In this example:

b = 1;

e = 2 ; 3

f = 1 . 3

The vector e = { e | e E } must belong to the spanning tree polytope.
(Edmonds, 1971)
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 21 / 35

24

Slide: 4. Reweighted max-product and linear programming
MAP as integer program: f  = max N
xX

s (xs ) +
sV (s,t)E

st (xs , xt )

dene local marginal distributions (e.g., for m = 3 states):
 s (0) s (1) s (xs ) = s (2)  st (0, 0) st (xs , xt ) = st (1, 0) st (2, 0)  st (0, 1) st (1, 1) st (2, 1)  st (0, 2) st (1, 2) st (2, 2)

4. Reweighted max-product and linear programming
MAP as integer program: f = max N
xX

s (xs ) +
sV (s,t)E

st (xs , xt )

dene local marginal distributions (e.g., for m = 3 states):
s (0) s (1) s (xs ) = s (2) st (0, 0) st (xs , xt ) = st (1, 0) st (2, 0) st (0, 1) st (1, 1) st (2, 1) st (0, 2) st (1, 2) st (2, 2)

25

Slide: 4. Reweighted max-product and linear programming
MAP as integer program: f  = max N
xX

s (xs ) +
sV (s,t)E

st (xs , xt )

dene local marginal distributions (e.g., for m = 3 states):
 s (0) s (1) s (xs ) = s (2)  st (0, 0) st (xs , xt ) = st (1, 0) st (2, 0)  st (0, 1) st (1, 1) st (2, 1)  st (0, 2) st (1, 2) st (2, 2)

alternative formulation of MAP as linear program? g =
(s ,st )M(G)

max

Es [s (xs )]
sV

+
(s,t)E

Est [st (xs , xt )] s (xs )s (xs ).
xs

Local expectations:

Es [s (xs )]

:=

4. Reweighted max-product and linear programming
MAP as integer program: f = max N
xX

s (xs ) +
sV (s,t)E

st (xs , xt )

dene local marginal distributions (e.g., for m = 3 states):
s (0) s (1) s (xs ) = s (2) st (0, 0) st (xs , xt ) = st (1, 0) st (2, 0) st (0, 1) st (1, 1) st (2, 1) st (0, 2) st (1, 2) st (2, 2)

alternative formulation of MAP as linear program? g =
(s ,st )M(G)

max

Es [s (xs )]
sV

+
(s,t)E

Est [st (xs , xt )] s (xs )s (xs ).
xs

Local expectations:

Es [s (xs )]

:=

26

Slide: 4. Reweighted max-product and linear programming
MAP as integer program: f  = max N
xX

s (xs ) +
sV (s,t)E

st (xs , xt )

dene local marginal distributions (e.g., for m = 3 states):
 s (0) s (1) s (xs ) = s (2)  st (0, 0) st (xs , xt ) = st (1, 0) st (2, 0)  st (0, 1) st (1, 1) st (2, 1)  st (0, 2) st (1, 2) st (2, 2)

alternative formulation of MAP as linear program? g =
(s ,st )M(G)

max

Es [s (xs )]
sV

+
(s,t)E

Est [st (xs , xt )] s (xs )s (xs ).
xs

Local expectations:

Es [s (xs )]

:=

Key question: What constraints must local marginals {s , st } satisfy?

4. Reweighted max-product and linear programming
MAP as integer program: f = max N
xX

s (xs ) +
sV (s,t)E

st (xs , xt )

dene local marginal distributions (e.g., for m = 3 states):
s (0) s (1) s (xs ) = s (2) st (0, 0) st (xs , xt ) = st (1, 0) st (2, 0) st (0, 1) st (1, 1) st (2, 1) st (0, 2) st (1, 2) st (2, 2)

alternative formulation of MAP as linear program? g =
(s ,st )M(G)

max

Es [s (xs )]
sV

+
(s,t)E

Est [st (xs , xt )] s (xs )s (xs ).
xs

Local expectations:

Es [s (xs )]

:=

Key question: What constraints must local marginals {s , st } satisfy?

27

Slide: Marginal polytopes for general undirected models
M(G)  set of all globally realizable marginals {s , st }:   p (x), and st (xs , xt ) =   Rd s (xs ) = 
xt ,t=s

p (x)
xu ,u=s,t

  

for some p () over (X1 , . . . , XN )  {0, 1, . . . , m  1}N .

M(G) a

a T   bi i
polytope in d = m|V | + m2 |E| dimensions (m per vertex, m2 per edge) with mN vertices number of facets?

Marginal polytopes for general undirected models
M(G) set of all globally realizable marginals {s , st }: p (x), and st (xs , xt ) = Rd s (xs ) =
xt ,t=s

p (x)
xu ,u=s,t

for some p () over (X1 , . . . , XN ) {0, 1, . . . , m 1}N .

M(G) a

a T bi i
polytope in d = m|V | + m2 |E| dimensions (m per vertex, m2 per edge) with mN vertices number of facets?

28

Slide: Marginal polytope for trees
M(T )  special case of marginal polytope for tree T local marginal distributions on nodes/edges (e.g., m = 3)
 s (0) s (1) s (xs ) = s (2)   st (0, 0) st (1, 0) st (xs , xt ) = st (2, 0) st (0, 1) st (1, 1) st (2, 1)  st (0, 2)  st (1, 2) st (2, 2)

Deep fact about tree-structured models: If {s , st } are non-negative and locally consistent: Normalization :
xs

s (xs ) st (xs , x ) t
 xt

= =

1 s (xs ),

Marginalization :

then on any tree-structured graph T , they are globally consistent. Follows from junction tree theorem
(Lauritzen & Spiegelhalter, 1988).

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

24 / 35

Marginal polytope for trees
M(T ) special case of marginal polytope for tree T local marginal distributions on nodes/edges (e.g., m = 3)
s (0) s (1) s (xs ) = s (2) st (0, 0) st (1, 0) st (xs , xt ) = st (2, 0) st (0, 1) st (1, 1) st (2, 1) st (0, 2) st (1, 2) st (2, 2)

Deep fact about tree-structured models: If {s , st } are non-negative and locally consistent: Normalization :
xs

s (xs ) st (xs , x ) t
xt

= =

1 s (xs ),

Marginalization :

then on any tree-structured graph T , they are globally consistent. Follows from junction tree theorem
(Lauritzen & Spiegelhalter, 1988).

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

24 / 35

29

Slide: Max-product on trees: Linear program solver
MAP problem as a simple linear program:   Es [s (xs )] + f (x) = arg max M(T ) 
sV

Est [st (xs , xt )]
(s,t)E

      .

Max-product and LP solving:

subject to  in tree marginal polytope:   s (xs ) = 1, M(T ) =   0,  x
s

st (xs , x ) = s (xs ) t
 xt

on tree-structured graphs, max-product is a dual algorithm for solving the tree LP. (Wai. & Jordan, 2003) max-product message Mts (xs )  Lagrange multiplier for enforcing the constraint x st (xs , x ) = s (xs ). t
t Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 25 / 35

Max-product on trees: Linear program solver
MAP problem as a simple linear program: Es [s (xs )] + f (x) = arg max M(T )
sV

Est [st (xs , xt )]
(s,t)E

.

Max-product and LP solving:

subject to in tree marginal polytope: s (xs ) = 1, M(T ) = 0, x
s

st (xs , x ) = s (xs ) t
xt

on tree-structured graphs, max-product is a dual algorithm for solving the tree LP. (Wai. & Jordan, 2003) max-product message Mts (xs ) Lagrange multiplier for enforcing the constraint x st (xs , x ) = s (xs ). t
t Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 25 / 35

30

Slide: Tree-based relaxation for graphs with cycles
Set of locally consistent pseudomarginals for general graph G: L(G) =   Rd |   0, s (xs ) = 1,
xs xt

st (xs , x ) = s (xs ) . t

Integral vertex

M(G)

Fractional vertex

L(G) Key: For a general graph, L(G) is an outer bound on M(G), and yields a linear-programming relaxation of the MAP problem: f (x) =
M(G)

max T   max T  .
 L(G)

Tree-based relaxation for graphs with cycles
Set of locally consistent pseudomarginals for general graph G: L(G) = Rd | 0, s (xs ) = 1,
xs xt

st (xs , x ) = s (xs ) . t

Integral vertex

M(G)

Fractional vertex

L(G) Key: For a general graph, L(G) is an outer bound on M(G), and yields a linear-programming relaxation of the MAP problem: f (x) =
M(G)

max T max T .
L(G)

31

Slide: Looseness of L(G) with graphs with cycles
 
2

   
1

   
3

Locally consistent (pseudo)marginals

 

   

 

Pseudomarginals satisfy the obvious local constraints: Normalization: Marginalization:
x s x s

s (x ) = 1 for all s  V . s s (x , xt ) = t (xt ) for all edges (s, t). s
September 2, 2012 27 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

Looseness of L(G) with graphs with cycles

2


1


3

Locally consistent (pseudo)marginals

Pseudomarginals satisfy the obvious local constraints: Normalization: Marginalization:
x s x s

s (x ) = 1 for all s V . s s (x , xt ) = t (xt ) for all edges (s, t). s
September 2, 2012 27 / 35

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

32

Slide: TRW max-product and LP relaxation
First-order (tree-based) LP relaxation:   f (x)  max Es [s (xs )] +  L(G) 
sV

Est [st (xs , xt )]
(s,t)E

  

Results:

(Wainwright et al., 2005; Kolmogorov & Wainwright, 2005):

(a) Strong tree agreement Any TRW xed-point that satises the strong tree agreement condition species an optimal LP solution. (b) LP solving: For any binary pairwise problem, TRW max-product solves the rst-order LP relaxation. (c) Persistence for binary problems: Let S  V be the subset of vertices  for which there exists a single point x  arg maxxs s (xs ). Then for any s  optimal solution, it holds that ys = xs .

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

28 / 35

TRW max-product and LP relaxation
First-order (tree-based) LP relaxation: f (x) max Es [s (xs )] + L(G)
sV

Est [st (xs , xt )]
(s,t)E

Results:

(Wainwright et al., 2005; Kolmogorov & Wainwright, 2005):

(a) Strong tree agreement Any TRW xed-point that satises the strong tree agreement condition species an optimal LP solution. (b) LP solving: For any binary pairwise problem, TRW max-product solves the rst-order LP relaxation. (c) Persistence for binary problems: Let S V be the subset of vertices for which there exists a single point x arg maxxs s (xs ). Then for any s optimal solution, it holds that ys = xs .

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

28 / 35

33

Slide: On-going work on LPs and conic relaxations
tree-reweighted max-product solves rst-order LP for any binary pairwise problem (Kolmogorov & Wainwright, 2005) convergent dual ascent scheme; LP-optimal for binary pairwise problems
(Globerson & Jaakkola, 2007)

convex free energies and zero-temperature limits
(Wainwright et al., 2005, Weiss et al., 2006; Johnson et al., 2007)

coding problems: adaptive cutting-plane methods
Dimakis et al., 2006)

(Taghavi & Siegel, 2006;

dual decomposition and sub-gradient methods:
Komodakis et al., 2007, Duchi et al., 2007)

(Feldman et al., 2003; (e.g., Sontag et al., 2008;

solving higher-order relaxations; rounding schemes
Ravikumar et al., 2008)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

29 / 35

On-going work on LPs and conic relaxations
tree-reweighted max-product solves rst-order LP for any binary pairwise problem (Kolmogorov & Wainwright, 2005) convergent dual ascent scheme; LP-optimal for binary pairwise problems
(Globerson & Jaakkola, 2007)

convex free energies and zero-temperature limits
(Wainwright et al., 2005, Weiss et al., 2006; Johnson et al., 2007)

coding problems: adaptive cutting-plane methods
Dimakis et al., 2006)

(Taghavi & Siegel, 2006;

dual decomposition and sub-gradient methods:
Komodakis et al., 2007, Duchi et al., 2007)

(Feldman et al., 2003; (e.g., Sontag et al., 2008;

solving higher-order relaxations; rounding schemes
Ravikumar et al., 2008)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

29 / 35

34

Slide: Hierarchies of conic programming relaxations
tree-based LP relaxation using L(G): rst in a hierarchy of hypertree-based relaxations (Wainwright & Jordan, 2004) hierarchies of SDP relaxations for polynomial programming
Parrilo, 2002) (Lasserre, 2001;

intermediate between LP and SDP: second-order cone programming (SOCP) relaxations (Ravikumar & Laerty, 2006; Kumar et al., 2008) all relaxations: particular outer bounds on the marginal polyope Key questions: when are particular relaxations tight? when does more computation (e.g., LP  SOCP  SDP) yield performance gains?

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

30 / 35

Hierarchies of conic programming relaxations
tree-based LP relaxation using L(G): rst in a hierarchy of hypertree-based relaxations (Wainwright & Jordan, 2004) hierarchies of SDP relaxations for polynomial programming
Parrilo, 2002) (Lasserre, 2001;

intermediate between LP and SDP: second-order cone programming (SOCP) relaxations (Ravikumar & Laerty, 2006; Kumar et al., 2008) all relaxations: particular outer bounds on the marginal polyope Key questions: when are particular relaxations tight? when does more computation (e.g., LP SOCP SDP) yield performance gains?

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

30 / 35

35

Slide: Stereo computation: Middlebury stereo benchmark set
standard set of benchmarked examples for stereo algorithms
Szeliski, 2002) (Scharstein &

Tsukuba data set: Image sizes 384  288  16 (W  H  D)

(a) Original image

(b) Ground truth disparity

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

31 / 35

Stereo computation: Middlebury stereo benchmark set
standard set of benchmarked examples for stereo algorithms
Szeliski, 2002) (Scharstein &

Tsukuba data set: Image sizes 384 288 16 (W H D)

(a) Original image

(b) Ground truth disparity

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 2, 2012

31 / 35

36

Slide: Comparison of dierent methods

(a) Scanline dynamic programming

(b) Graph cuts

(c) Ordinary belief propagation

(d) Tree-reweighted max-product

(a), (b): Scharstein & Szeliski, 2002; (c): Sun et al., 2002 (d): Weiss, et al., 2005;

Comparison of dierent methods

(a) Scanline dynamic programming

(b) Graph cuts

(c) Ordinary belief propagation

(d) Tree-reweighted max-product

(a), (b): Scharstein & Szeliski, 2002; (c): Sun et al., 2002 (d): Weiss, et al., 2005;

37

Slide: Ordinary belief propagation

Ordinary belief propagation

38

Slide: Tree-reweighted max-product

Tree-reweighted max-product

39

Slide: Ground truth

Ground truth

40

Slide: Graphical models and message-passing Part II: Marginals and likelihoods
Martin Wainwright

UC Berkeley Departments of Statistics, and EECS Tutorial materials (slides, monograph, lecture notes) available at: www.eecs.berkeley.edu/ wainwrig/kyoto12
September 3, 2012

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

1 / 23

Graphical models and message-passing Part II: Marginals and likelihoods
Martin Wainwright

UC Berkeley Departments of Statistics, and EECS Tutorial materials (slides, monograph, lecture notes) available at: www.eecs.berkeley.edu/ wainwrig/kyoto12
September 3, 2012

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

1 / 23

41

Slide: Graphs and factorization
2 47 3 4 7 7

v

1

5 456

6

clique C is a fully connected subset of vertices compatibility function C dened on variables xC = {xs , s  C} factorization over all cliques p(x1 , . . . , xN ) = 1 Z C (xC ).
CC
September 3, 2012 2 / 23

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

Graphs and factorization
2 47 3 4 7 7

v

1

5 456

6

clique C is a fully connected subset of vertices compatibility function C dened on variables xC = {xs , s C} factorization over all cliques p(x1 , . . . , xN ) = 1 Z C (xC ).
CC
September 3, 2012 2 / 23

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

42

Slide: Core computational challenges
Given an undirected graphical model (Markov random eld): p(x1 , x2 , . . . , xN ) How to eciently compute? most probable conguration (MAP estimate): Maximize : x = arg max p(x1 , . . . , xN ) = arg max
xX N xX N

=

1 Z

C (xC )
CC

C (xC ).
CC

the data likelihood or normalization constant Sum/integrate : Z =
xX N CC

C (xC )

marginal distributions at single sites, or subsets: Sum/integrate : p(Xs = xs ) = 1 Z C (xC )
xt , t=s CC
September 3, 2012 3 / 23

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

Core computational challenges
Given an undirected graphical model (Markov random eld): p(x1 , x2 , . . . , xN ) How to eciently compute? most probable conguration (MAP estimate): Maximize : x = arg max p(x1 , . . . , xN ) = arg max
xX N xX N

=

1 Z

C (xC )
CC

C (xC ).
CC

the data likelihood or normalization constant Sum/integrate : Z =
xX N CC

C (xC )

marginal distributions at single sites, or subsets: Sum/integrate : p(Xs = xs ) = 1 Z C (xC )
xt , t=s CC
September 3, 2012 3 / 23

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

43

Slide: Goal: Compute marginal distribution at node u on a tree:     x = arg max exp(s (xs ) exp(st (xs , xt )) .  xX N 
sV (s,t)E

1. Sum-product message-passing on trees

M12

M32

1

2

3

p(x) =
x1 ,x2 ,x3 x2

exp(1 (x1 ))
t1,3 xt

exp[t (xt ) + 2t (x2 , xt )]

Goal: Compute marginal distribution at node u on a tree: x = arg max exp(s (xs ) exp(st (xs , xt )) . xX N
sV (s,t)E

1. Sum-product message-passing on trees

M12

M32

1

2

3

p(x) =
x1 ,x2 ,x3 x2

exp(1 (x1 ))
t1,3 xt

exp[t (xt ) + 2t (x2 , xt )]

44

Slide: Putting together the pieces
Sum-product is an exact algorithm for any tree.

Tw w Mut u Tu Mwt t s Mts Mvt v Tv
exp st (xs , x ) + t (x ) t t
vN (t)\s tN (s)

Mts N (t)

 

message from node t to s neighbors of node t

Update: Sum-marginals:

Mts (xs ) 

Mvt (xt )

ps (xs ; )  exp{s (xs )}

x Xt t

Mts (xs ).
September 3, 2012 5 / 23

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

Putting together the pieces
Sum-product is an exact algorithm for any tree.

Tw w Mut u Tu Mwt t s Mts Mvt v Tv
exp st (xs , x ) + t (x ) t t
vN (t)\s tN (s)

Mts N (t)

message from node t to s neighbors of node t

Update: Sum-marginals:

Mts (xs )

Mvt (xt )

ps (xs ; ) exp{s (xs )}

x Xt t

Mts (xs ).
September 3, 2012 5 / 23

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

45

Slide: Summary: sum-product on trees
converges in at most graph diameter # of iterations updating a single message is an O(m2 ) operation overall algorithm requires O(N m2 ) operations upon convergence, yields the exact node and edge marginals: ps (xs )  es (xs ) Mus (xs )
uN (s)

pst (xs , xt )  es (xs )+t (xt )+st (xs ,xt )

Mus (xs )
uN (s) uN (t)

Mut (xt )

messages can also be used to compute the partition function Z=
x1 ,...,xN sV

es (xs )
(s,t)E

est (xs ,xt ) .

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

6 / 23

Summary: sum-product on trees
converges in at most graph diameter # of iterations updating a single message is an O(m2 ) operation overall algorithm requires O(N m2 ) operations upon convergence, yields the exact node and edge marginals: ps (xs ) es (xs ) Mus (xs )
uN (s)

pst (xs , xt ) es (xs )+t (xt )+st (xs ,xt )

Mus (xs )
uN (s) uN (t)

Mut (xt )

messages can also be used to compute the partition function Z=
x1 ,...,xN sV

es (xs )
(s,t)E

est (xs ,xt ) .

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

6 / 23

46

Slide: as with max-product, a widely used heuristic with a long history:
   

2. Sum-product on graph with cycles

error-control coding: Gallager, 1963 articial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc..

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

7 / 23

as with max-product, a widely used heuristic with a long history:

2. Sum-product on graph with cycles

error-control coding: Gallager, 1963 articial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc..

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

7 / 23

47

Slide: as with max-product, a widely used heuristic with a long history:
   

2. Sum-product on graph with cycles

error-control coding: Gallager, 1963 articial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc..

some concerns with sum-product with cycles:
  

no convergence guarantees can have multiple xed points nal estimate of Z is not a lower/upper bound

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

7 / 23

as with max-product, a widely used heuristic with a long history:

2. Sum-product on graph with cycles

error-control coding: Gallager, 1963 articial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc..

some concerns with sum-product with cycles:

no convergence guarantees can have multiple xed points nal estimate of Z is not a lower/upper bound

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

7 / 23

48

Slide: as with max-product, a widely used heuristic with a long history:
   

2. Sum-product on graph with cycles

error-control coding: Gallager, 1963 articial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc..

some concerns with sum-product with cycles:
  

no convergence guarantees can have multiple xed points nal estimate of Z is not a lower/upper bound

as before, can consider a broader class of reweighted sum-product algorithms

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

7 / 23

as with max-product, a widely used heuristic with a long history:

2. Sum-product on graph with cycles

error-control coding: Gallager, 1963 articial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc..

some concerns with sum-product with cycles:

no convergence guarantees can have multiple xed points nal estimate of Z is not a lower/upper bound

as before, can consider a broader class of reweighted sum-product algorithms

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

7 / 23

49

Slide: Tree-reweighted sum-product algorithms
Message update from node t to node s:
reweighted messages st (xs , x ) t exp + t (x ) t st reweighted edge Mvt (xt )
vN (t)\s vt

Mts (xs )




x Xt t

Mst (xt )

(1ts )

.

opposite message

Properties: 1. Modied updates remain distributed and purely local over the graph.  Messages are reweighted with st  [0, 1]. 2. Key dierences:  Potential on edge (s, t) is rescaled by st  [0, 1].  Update involves the reverse direction edge. 3. The choice st = 1 for all edges (s, t) recovers standard update.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

8 / 23

Tree-reweighted sum-product algorithms
Message update from node t to node s:
reweighted messages st (xs , x ) t exp + t (x ) t st reweighted edge Mvt (xt )
vN (t)\s vt

Mts (xs )

x Xt t

Mst (xt )

(1ts )

.

opposite message

Properties: 1. Modied updates remain distributed and purely local over the graph. Messages are reweighted with st [0, 1]. 2. Key dierences: Potential on edge (s, t) is rescaled by st [0, 1]. Update involves the reverse direction edge. 3. The choice st = 1 for all edges (s, t) recovers standard update.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

8 / 23

50

Slide: Bethe entropy approximation
dene local marginal distributions (e.g., for m = 3 states):
 s (0) s (1) s (xs ) = s (2)  st (0, 0) st (xs , xt ) = st (1, 0) st (2, 0)  st (0, 1) st (1, 1) st (2, 1)  st (0, 2) st (1, 2) st (2, 2)

dene node-based entropy and edge-based mutual information: Node-based entropy:Hs (s ) =  Mutual information:Ist (st ) =
xs ,xt

s (xs ) log s (xs )
xs

st (xs , xt ) log

st (xs , xt ) . s (xs )t (xt )

-reweighted Bethe entropy HBethe () =
sV

Hs (s ) 

st Ist (st ),
(s,t)E

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

9 / 23

Bethe entropy approximation
dene local marginal distributions (e.g., for m = 3 states):
s (0) s (1) s (xs ) = s (2) st (0, 0) st (xs , xt ) = st (1, 0) st (2, 0) st (0, 1) st (1, 1) st (2, 1) st (0, 2) st (1, 2) st (2, 2)

dene node-based entropy and edge-based mutual information: Node-based entropy:Hs (s ) = Mutual information:Ist (st ) =
xs ,xt

s (xs ) log s (xs )
xs

st (xs , xt ) log

st (xs , xt ) . s (xs )t (xt )

-reweighted Bethe entropy HBethe () =
sV

Hs (s )

st Ist (st ),
(s,t)E

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

9 / 23

51

Slide: Bethe entropy is exact for trees
exact for trees, using the factorization: p(x; ) =
sV

s (xs )
(s,t)E

st (xs , xt ) s (xs )t (xt )

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

10 / 23

Bethe entropy is exact for trees
exact for trees, using the factorization: p(x; ) =
sV

s (xs )
(s,t)E

st (xs , xt ) s (xs )t (xt )

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

10 / 23

52

Slide: Reweighted sum-product and Bethe variational principle
Dene the local constraint set L(G) = s , st |   0, s (xs ) = 1,
xs xt

st (xs , xt ) = s (xs )

Reweighted sum-product and Bethe variational principle
Dene the local constraint set L(G) = s , st | 0, s (xs ) = 1,
xs xt

st (xs , xt ) = s (xs )

53

Slide: Reweighted sum-product and Bethe variational principle
Dene the local constraint set L(G) = s , st |   0, s (xs ) = 1,
xs xt

st (xs , xt ) = s (xs )

Theorem For any choice of positive edge weights st > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with ABethe (; ) := max
 L(G)

s ,  s +
sV (s,t)E

st , st + HBethe ( ; ) .

Reweighted sum-product and Bethe variational principle
Dene the local constraint set L(G) = s , st | 0, s (xs ) = 1,
xs xt

st (xs , xt ) = s (xs )

Theorem For any choice of positive edge weights st > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with ABethe (; ) := max
L(G)

s , s +
sV (s,t)E

st , st + HBethe ( ; ) .

54

Slide: Reweighted sum-product and Bethe variational principle
Dene the local constraint set L(G) = s , st |   0, s (xs ) = 1,
xs xt

st (xs , xt ) = s (xs )

Theorem For any choice of positive edge weights st > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with ABethe (; ) := max
 L(G)

s ,  s +
sV (s,t)E

st , st + HBethe ( ; ) .

(b) For valid choices of edge weights {st }, the xed points are unique and moreover log Z()  ABethe (; ). In addition, reweighted sum-product converges with appropriate scheduling.

Reweighted sum-product and Bethe variational principle
Dene the local constraint set L(G) = s , st | 0, s (xs ) = 1,
xs xt

st (xs , xt ) = s (xs )

Theorem For any choice of positive edge weights st > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with ABethe (; ) := max
L(G)

s , s +
sV (s,t)E

st , st + HBethe ( ; ) .

(b) For valid choices of edge weights {st }, the xed points are unique and moreover log Z() ABethe (; ). In addition, reweighted sum-product converges with appropriate scheduling.

55

Slide: Lagrangian derivation of ordinary sum-product
lets try to solve this problem by a (partial) Lagrangian formulation assign a Lagrange multiplier ts (xs ) for each constraint Cts (xs ) := s (xs )  xt st (xs , xt ) = 0 will enforce the normalization ( explicitly the Lagrangian takes the form: L( ; ) = ,  + Hs (s )  +
(s,t)E xt xs

s (xs ) = 1) and non-negativity constraints

Ist (st )
(s,t)E(G)

sV

st (xt )Cst (xt ) +
xs

ts (xs )Cts (xs )

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

12 / 23

Lagrangian derivation of ordinary sum-product
lets try to solve this problem by a (partial) Lagrangian formulation assign a Lagrange multiplier ts (xs ) for each constraint Cts (xs ) := s (xs ) xt st (xs , xt ) = 0 will enforce the normalization ( explicitly the Lagrangian takes the form: L( ; ) = , + Hs (s ) +
(s,t)E xt xs

s (xs ) = 1) and non-negativity constraints

Ist (st )
(s,t)E(G)

sV

st (xt )Cst (xt ) +
xs

ts (xs )Cts (xs )

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

12 / 23

56

Slide: Lagrangian derivation (part II)
taking derivatives of the Lagrangian w.r.t s and st yields L s (xs ) L st (xs , xt ) = = s (xs )  log s (xs ) + st (xs , xt )  log ts (xs ) + C
tN (s)

st (xs , xt )  ts (xs )  st (xt ) + C  s (xs )t (xt )

setting these partial derivatives to zero and simplifying:
s (xs ) s (xs , xt )   exp s (xs )
tN (s)

exp ts (xs )

exp s (xs ) + t (xt ) + st (xs , xt )  exp us (xs )
uN (s)\t vN (t)\s

exp vt (xt )

enforcing the constraint Cts (xs ) = 0 on these representations yields the familiar update rule for the messages Mts (xs ) = exp(ts (xs )):
Mts (xs )  exp t (xt ) + st (xs , xt )
xt
Graphical models and message-passing

Mut (xt )
uN (t)\s
September 3, 2012 13 / 23

Martin Wainwright (UC Berkeley)

Lagrangian derivation (part II)
taking derivatives of the Lagrangian w.r.t s and st yields L s (xs ) L st (xs , xt ) = = s (xs ) log s (xs ) + st (xs , xt ) log ts (xs ) + C
tN (s)

st (xs , xt ) ts (xs ) st (xt ) + C s (xs )t (xt )

setting these partial derivatives to zero and simplifying:
s (xs ) s (xs , xt ) exp s (xs )
tN (s)

exp ts (xs )

exp s (xs ) + t (xt ) + st (xs , xt ) exp us (xs )
uN (s)\t vN (t)\s

exp vt (xt )

enforcing the constraint Cts (xs ) = 0 on these representations yields the familiar update rule for the messages Mts (xs ) = exp(ts (xs )):
Mts (xs ) exp t (xt ) + st (xs , xt )
xt
Graphical models and message-passing

Mut (xt )
uN (t)\s
September 3, 2012 13 / 23

Martin Wainwright (UC Berkeley)

57

Slide: Convex combinations of trees
Idea: Upper bound A() := log Z() with a convex combination of tree-structured problems.

 A()

= 

(T 1 )(T 1 ) (T 1 )A((T 1 ))

+ +

(T 2 )(T 2 ) (T 2 )A((T 2 ))

+ +

(T 3 )(T 3 ) (T 3 )A((T 3 ))

 = {(T )} (T )

 

probability distribution over spanning trees tree-structured parameter vector
Graphical models and message-passing September 3, 2012 14 / 23

Martin Wainwright (UC Berkeley)

Convex combinations of trees
Idea: Upper bound A() := log Z() with a convex combination of tree-structured problems.

A()

=

(T 1 )(T 1 ) (T 1 )A((T 1 ))

+ +

(T 2 )(T 2 ) (T 2 )A((T 2 ))

+ +

(T 3 )(T 3 ) (T 3 )A((T 3 ))

= {(T )} (T )

probability distribution over spanning trees tree-structured parameter vector
Graphical models and message-passing September 3, 2012 14 / 23

Martin Wainwright (UC Berkeley)

58

Slide: Finding the tightest upper bound
Observation: For each xed distribution  over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

15 / 23

Finding the tightest upper bound
Observation: For each xed distribution over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

15 / 23

59

Slide: Finding the tightest upper bound
Observation: For each xed distribution  over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Example: On the 2-D lattice:

Grid size 9 16 36 100

# trees 192 100352 3.26  1013 5.69  1042

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

15 / 23

Finding the tightest upper bound
Observation: For each xed distribution over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Example: On the 2-D lattice:

Grid size 9 16 36 100

# trees 192 100352 3.26 1013 5.69 1042

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

15 / 23

60

Slide: Finding the tightest upper bound
Observation: For each xed distribution  over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. By a suitable dual reformulation, problem can be avoided: Key duality relation: min (T )A((T )) = max ,  + HBethe (; st ) .

T

(T )(T )=

L(G)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

15 / 23

Finding the tightest upper bound
Observation: For each xed distribution over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. By a suitable dual reformulation, problem can be avoided: Key duality relation: min (T )A((T )) = max , + HBethe (; st ) .

T

(T )(T )=

L(G)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

15 / 23

61

Slide: Edge appearance probabilities
Experiment: What is the probability e that a given edge e  E belongs to a tree T drawn randomly under ? f f f f

b

b

b

b

e
(a) Original

e
(b) (T 1 ) =
1 3

e
(c) (T 2 ) =
1 3

e
(d) (T 3 ) =
1 3

In this example:

b = 1;

e = 2 ; 3

f = 1 . 3

The vector e = { e | e  E } must belong to the spanning tree polytope.
(Edmonds, 1971)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

16 / 23

Edge appearance probabilities
Experiment: What is the probability e that a given edge e E belongs to a tree T drawn randomly under ? f f f f

b

b

b

b

e
(a) Original

e
(b) (T 1 ) =
1 3

e
(c) (T 2 ) =
1 3

e
(d) (T 3 ) =
1 3

In this example:

b = 1;

e = 2 ; 3

f = 1 . 3

The vector e = { e | e E } must belong to the spanning tree polytope.
(Edmonds, 1971)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

16 / 23

62

Slide: Why does entropy arise in the duality?
Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) =  p(x1 , . . . , xN ) log p(x1 , . . . , xN )
x

subject to expectation constraints of the form p(x) (x) =  .
x

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

17 / 23

Why does entropy arise in the duality?
Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) = p(x1 , . . . , xN ) log p(x1 , . . . , xN )
x

subject to expectation constraints of the form p(x) (x) = .
x

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

17 / 23

63

Slide: Why does entropy arise in the duality?
Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) =  p(x1 , . . . , xN ) log p(x1 , . . . , xN )
x

subject to expectation constraints of the form p(x) (x) =  .
x

Maximum likelihood in exponential family Maximize likelihood of parameterized densities p(x1 , . . . , xN ; ) = exp

Martin Wainwright (UC Berkeley)

  (x)  A() .
September 3, 2012 17 / 23

Graphical models and message-passing

Why does entropy arise in the duality?
Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) = p(x1 , . . . , xN ) log p(x1 , . . . , xN )
x

subject to expectation constraints of the form p(x) (x) = .
x

Maximum likelihood in exponential family Maximize likelihood of parameterized densities p(x1 , . . . , xN ; ) = exp

Martin Wainwright (UC Berkeley)

(x) A() .
September 3, 2012 17 / 23

Graphical models and message-passing

64

Slide: Conjugate dual functions
conjugate duality is a fertile source of variational representations any function f can be used to dene another function f  as follows: f  (v) := sup
uRn

v, u  f (u) .

easy to show that f  is always a convex function how about taking the dual of the dual? I.e., what is (f  ) ? when f is well-behaved (convex and lower semi-continuous), we have (f  ) = f , or alternatively stated: f (u) = sup
vRn

u, v  f  (v)

Conjugate dual functions
conjugate duality is a fertile source of variational representations any function f can be used to dene another function f as follows: f (v) := sup
uRn

v, u f (u) .

easy to show that f is always a convex function how about taking the dual of the dual? I.e., what is (f ) ? when f is well-behaved (convex and lower semi-continuous), we have (f ) = f , or alternatively stated: f (u) = sup
vRn

u, v f (v)

65

Slide: Geometric view: Supporting hyperplanes
Question: Given all hyperplanes in Rn  R with normal (v, 1), what is the intercept of the one that supports epi(f )?  f (u) v, u  ca v, u  cb (v, 1) u

Epigraph of f :
epi(f ) := {(u, )  Rn+1 | f (u)  }.

ca cb

Analytically, we require the smallest c  R such that: v, u  c c  f (u) for all u  Rn By re-arranging, we nd that this optimal c is the dual value: =
uRn

sup

v, u  f (u) .

Geometric view: Supporting hyperplanes
Question: Given all hyperplanes in Rn R with normal (v, 1), what is the intercept of the one that supports epi(f )? f (u) v, u ca v, u cb (v, 1) u

Epigraph of f :
epi(f ) := {(u, ) Rn+1 | f (u) }.

ca cb

Analytically, we require the smallest c R such that: v, u c c f (u) for all u Rn By re-arranging, we nd that this optimal c is the dual value: =
uRn

sup

v, u f (u) .

66

Slide: Example: Single Bernoulli
Random variable X  {0, 1} yields exponential family of the form: p(x; )  exp  x Lets compute the dual A ()
R

with

A() = log 1 + exp() .

:= sup   log[1 + exp()] .  = exp()/[1 + exp()].

(Possible) stationary point:

A()

A()

,   A () 
(a) Epigraph supported A () =

(b)

 ,   c
Epigraph cannot be supported

We nd that:

 log  + (1  ) log(1  ) if   [0, 1] . + otherwise. Leads to the variational representation: A() = max[0,1]     A () .
Graphical models and message-passing September 3, 2012 20 / 23

Martin Wainwright (UC Berkeley)

Example: Single Bernoulli
Random variable X {0, 1} yields exponential family of the form: p(x; ) exp x Lets compute the dual A ()
R

with

A() = log 1 + exp() .

:= sup log[1 + exp()] . = exp()/[1 + exp()].

(Possible) stationary point:

A()

A()

, A ()
(a) Epigraph supported A () =

(b)

, c
Epigraph cannot be supported

We nd that:

log + (1 ) log(1 ) if [0, 1] . + otherwise. Leads to the variational representation: A() = max[0,1] A () .
Graphical models and message-passing September 3, 2012 20 / 23

Martin Wainwright (UC Berkeley)

67

Slide: Geometry of Bethe variational problem
int

M(G)

f rac

L(G) belief propagation uses a polyhedral outer approximation to M(G):
 

for any graph, L(G)  M(G). equality holds  G is a tree.

Natural question: Do BP xed points ever fall outside of the marginal polytope M(G)?
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 21 / 23

Geometry of Bethe variational problem
int

M(G)

f rac

L(G) belief propagation uses a polyhedral outer approximation to M(G):

for any graph, L(G) M(G). equality holds G is a tree.

Natural question: Do BP xed points ever fall outside of the marginal polytope M(G)?
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 21 / 23

68

Slide: Illustration: Globally inconsistent BP xed points
Consider the following assignment of pseudomarginals s , st :
 
2

   

   
3

Locally consistent (pseudo)marginals

1

 

   

 

can verify that   L(G), and that  is a xed point of belief propagation (with all constant messages) however,  is globally inconsistent Note: More generally: for any  in the interior of L(G), can construct a distribution with  as a BP xed point.
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 22 / 23

Illustration: Globally inconsistent BP xed points
Consider the following assignment of pseudomarginals s , st :

2


3

Locally consistent (pseudo)marginals

1

can verify that L(G), and that is a xed point of belief propagation (with all constant messages) however, is globally inconsistent Note: More generally: for any in the interior of L(G), can construct a distribution with as a BP xed point.
Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 22 / 23

69

Slide: High-level perspective: A broad class of methods
message-passing algorithms (e.g., mean eld, belief propagation) are solving approximate versions of exact variational principle in exponential families there are two distinct components to approximations:
(a) can use either inner or outer bounds to M (b) various approximations to entropy function A ()

Rening one or both components yields better approximations: BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (e.g.,Yedidia et al., 2002) Expectation-propagation: better outer bounds and Bethe-like entropy approximations (Minka, 2002)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

23 / 23

High-level perspective: A broad class of methods
message-passing algorithms (e.g., mean eld, belief propagation) are solving approximate versions of exact variational principle in exponential families there are two distinct components to approximations:
(a) can use either inner or outer bounds to M (b) various approximations to entropy function A ()

Rening one or both components yields better approximations: BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (e.g.,Yedidia et al., 2002) Expectation-propagation: better outer bounds and Bethe-like entropy approximations (Minka, 2002)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

September 3, 2012

23 / 23

70

Slide: Graphical models and message-passing: Part III: Learning graphs from data
Martin Wainwright

UC Berkeley Departments of Statistics, and EECS

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

1 / 24

Graphical models and message-passing: Part III: Learning graphs from data
Martin Wainwright

UC Berkeley Departments of Statistics, and EECS

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

1 / 24

71

Slide: Introduction
previous lectures on forward problems: given a graphical model, perform some type of computation
 

Part I: compute most probable (MAP) assignment Part II: compute marginals and likelihoods

inverse problems concern learning the parameters and structure of graphs from data many instances of such graph learning problems:
    

tting graphs to politicians voting behavior modeling diseases with epidemiological networks trac ow modeling interactions between dierent genes and so on....

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

2 / 24

Introduction
previous lectures on forward problems: given a graphical model, perform some type of computation

Part I: compute most probable (MAP) assignment Part II: compute marginals and likelihoods

inverse problems concern learning the parameters and structure of graphs from data many instances of such graph learning problems:

tting graphs to politicians voting behavior modeling diseases with epidemiological networks trac ow modeling interactions between dierent genes and so on....

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

2 / 24

72

Slide: Example: US Senate network (20042006 voting)

(Banerjee et al., 2008; Ravikumar, W. & Laerty, 2010)

Example: US Senate network (20042006 voting)

(Banerjee et al., 2008; Ravikumar, W. & Laerty, 2010)

73

Slide: Example: Biological networks

gene networks during Drosophila life cycle many other examples:
 

(Ahmed & Xing, PNAS, 2009)

protein networks phylogenetic trees
Graphical models and message-passing 4 / 24

Martin Wainwright (UC Berkeley)

Example: Biological networks

gene networks during Drosophila life cycle many other examples:

(Ahmed & Xing, PNAS, 2009)

protein networks phylogenetic trees
Graphical models and message-passing 4 / 24

Martin Wainwright (UC Berkeley)

74

Slide: Learning for pairwise models
drawn n samples from Q(x1 , . . . , xp ; ) = 1 exp Z() s x2 + s
sV (s,t)E

st xs xt

graph G and matrix []st = st of edge weights are unknown

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

5 / 24

Learning for pairwise models
drawn n samples from Q(x1 , . . . , xp ; ) = 1 exp Z() s x2 + s
sV (s,t)E

st xs xt

graph G and matrix []st = st of edge weights are unknown

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

5 / 24

75

Slide: Learning for pairwise models
drawn n samples from Q(x1 , . . . , xp ; ) = 1 exp Z() s x2 + s
sV (s,t)E

st xs xt

graph G and matrix []st = st of edge weights are unknown data matrix:
 

Ising model (binary variables): Xn  {0, 1}np 1 Gaussian model: Xn  Rnp 1

estimator Xn   1

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

5 / 24

Learning for pairwise models
drawn n samples from Q(x1 , . . . , xp ; ) = 1 exp Z() s x2 + s
sV (s,t)E

st xs xt

graph G and matrix []st = st of edge weights are unknown data matrix:

Ising model (binary variables): Xn {0, 1}np 1 Gaussian model: Xn Rnp 1

estimator Xn 1

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

5 / 24

76

Slide: Learning for pairwise models
drawn n samples from Q(x1 , . . . , xp ; ) = 1 exp Z() s x2 + s
sV (s,t)E

st xs xt

graph G and matrix []st = st of edge weights are unknown data matrix:
 

Ising model (binary variables): Xn  {0, 1}np 1 Gaussian model: Xn  Rnp 1

estimator Xn   1 various loss functions are possible:
  

graph selection: supp[] = supp[]? bounds on Kullback-Leibler divergence D(Q bounds on |||  |||op .
Graphical models and message-passing

Q )

Martin Wainwright (UC Berkeley)

5 / 24

Learning for pairwise models
drawn n samples from Q(x1 , . . . , xp ; ) = 1 exp Z() s x2 + s
sV (s,t)E

st xs xt

graph G and matrix []st = st of edge weights are unknown data matrix:

Ising model (binary variables): Xn {0, 1}np 1 Gaussian model: Xn Rnp 1

estimator Xn 1 various loss functions are possible:

graph selection: supp[] = supp[]? bounds on Kullback-Leibler divergence D(Q bounds on ||| |||op .
Graphical models and message-passing

Q )

Martin Wainwright (UC Berkeley)

5 / 24

77

Slide: Challenges in graph selection
For pairwise models, negative log-likelihood takes form: (; Xn ) :=  1 1 n
n

log Q(xi1 , . . . , xip ; )
i=1

= log Z() 

sV

 s s 

st st
(s,t)

Challenges in graph selection
For pairwise models, negative log-likelihood takes form: (; Xn ) := 1 1 n
n

log Q(xi1 , . . . , xip ; )
i=1

= log Z()

sV

s s

st st
(s,t)

78

Slide: Challenges in graph selection
For pairwise models, negative log-likelihood takes form: (; Xn ) :=  1 1 n
n

log Q(xi1 , . . . , xip ; )
i=1

= log Z() 

sV

 s s 

st st
(s,t)

maximizing likelihood involves computing log Z() or its derivatives (marginals) for Gaussian graphical models, this is a log-determinant program for discrete graphical models, various work-arounds are possible:
  

Markov chain Monte Carlo and stochastic gradient variational approximations to likelihood pseudo-likelihoods

Challenges in graph selection
For pairwise models, negative log-likelihood takes form: (; Xn ) := 1 1 n
n

log Q(xi1 , . . . , xip ; )
i=1

= log Z()

sV

s s

st st
(s,t)

maximizing likelihood involves computing log Z() or its derivatives (marginals) for Gaussian graphical models, this is a log-determinant program for discrete graphical models, various work-arounds are possible:

Markov chain Monte Carlo and stochastic gradient variational approximations to likelihood pseudo-likelihoods

79

Slide: Methods for graph selection
for Gaussian graphical models:


1 -regularized neighborhood regression for Gaussian MRFs
(e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006)



1 -regularized log-determinant

(e.g., Yuan & Lin, 2006; dAsprmont et al., e 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008)

Methods for graph selection
for Gaussian graphical models:

1 -regularized neighborhood regression for Gaussian MRFs
(e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006)

1 -regularized log-determinant

(e.g., Yuan & Lin, 2006; dAsprmont et al., e 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008)

80

Slide: Methods for graph selection
for Gaussian graphical models:


1 -regularized neighborhood regression for Gaussian MRFs
(e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006)



1 -regularized log-determinant

(e.g., Yuan & Lin, 2006; dAsprmont et al., e 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008)

methods for discrete MRFs
  

exact solution for trees local testing various other methods
   

(Chow & Liu, 1967) (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008)

distribution ts by KL-divergence (Abeel et al., 2005) 1 -regularized log. regression (Ravikumar, W. & Laerty et al., 2008, 2010) approximate max. entropy approach and thinned graphical models (Johnson et al., 2007) neighborhood-based thresholding method (Bresler, Mossel & Sly, 2008)

Methods for graph selection
for Gaussian graphical models:

1 -regularized neighborhood regression for Gaussian MRFs
(e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006)

1 -regularized log-determinant

(e.g., Yuan & Lin, 2006; dAsprmont et al., e 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008)

methods for discrete MRFs

exact solution for trees local testing various other methods

(Chow & Liu, 1967) (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008)

distribution ts by KL-divergence (Abeel et al., 2005) 1 -regularized log. regression (Ravikumar, W. & Laerty et al., 2008, 2010) approximate max. entropy approach and thinned graphical models (Johnson et al., 2007) neighborhood-based thresholding method (Bresler, Mossel & Sly, 2008)

81

Slide: Methods for graph selection
for Gaussian graphical models:


1 -regularized neighborhood regression for Gaussian MRFs
(e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006)



1 -regularized log-determinant

(e.g., Yuan & Lin, 2006; dAsprmont et al., e 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008)

methods for discrete MRFs
  

exact solution for trees local testing various other methods
   

(Chow & Liu, 1967) (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008)

distribution ts by KL-divergence (Abeel et al., 2005) 1 -regularized log. regression (Ravikumar, W. & Laerty et al., 2008, 2010) approximate max. entropy approach and thinned graphical models (Johnson et al., 2007) neighborhood-based thresholding method (Bresler, Mossel & Sly, 2008)

information-theoretic analysis
 

pseudolikelihood and BIC criterion information-theoretic limitations

(Csiszar & Talata, 2006) (Santhanam & W., 2008, 2012)

Methods for graph selection
for Gaussian graphical models:

1 -regularized neighborhood regression for Gaussian MRFs
(e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006)

1 -regularized log-determinant

(e.g., Yuan & Lin, 2006; dAsprmont et al., e 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008)

methods for discrete MRFs

exact solution for trees local testing various other methods

(Chow & Liu, 1967) (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008)

distribution ts by KL-divergence (Abeel et al., 2005) 1 -regularized log. regression (Ravikumar, W. & Laerty et al., 2008, 2010) approximate max. entropy approach and thinned graphical models (Johnson et al., 2007) neighborhood-based thresholding method (Bresler, Mossel & Sly, 2008)

information-theoretic analysis

pseudolikelihood and BIC criterion information-theoretic limitations

(Csiszar & Talata, 2006) (Santhanam & W., 2008, 2012)

82

Slide: Graphs and random variables
associate to each node s  V a random variable Xs 2 3 4 7 B 5 6 1 Maximal cliques (123), (345), (456), (47) A S Vertex cutset S for each subset A  V , random vector XA := {Xs , s  A}.

a clique C  V is a subset of vertices all joined by edges

a vertex cutset is a subset S  V whose removal breaks the graph into two or more pieces

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

8 / 24

Graphs and random variables
associate to each node s V a random variable Xs 2 3 4 7 B 5 6 1 Maximal cliques (123), (345), (456), (47) A S Vertex cutset S for each subset A V , random vector XA := {Xs , s A}.

a clique C V is a subset of vertices all joined by edges

a vertex cutset is a subset S V whose removal breaks the graph into two or more pieces

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

8 / 24

83

Slide: Factorization and Markov properties
The graph G can be used to impose constraints on the random vector X = XV (or on the distribution Q) in dierent ways. Markov property: X is Markov w.r.t G if XA and XB are conditionally indpt. given XS whenever S separates A and B. Factorization: The distribution Q factorizes according to G if it can be expressed as a product over cliques: Q(x1 , x2 , . . . , xp ) = 1 Z C (xC )
CC

Normalization

compatibility function on clique C

Theorem: (Hammersley & Cliord, 1973) For strictly positive Q(), the Markov property and the Factorization property are equivalent.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

9 / 24

Factorization and Markov properties
The graph G can be used to impose constraints on the random vector X = XV (or on the distribution Q) in dierent ways. Markov property: X is Markov w.r.t G if XA and XB are conditionally indpt. given XS whenever S separates A and B. Factorization: The distribution Q factorizes according to G if it can be expressed as a product over cliques: Q(x1 , x2 , . . . , xp ) = 1 Z C (xC )
CC

Normalization

compatibility function on clique C

Theorem: (Hammersley & Cliord, 1973) For strictly positive Q(), the Markov property and the Factorization property are equivalent.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

9 / 24

84

Slide: Markov property and neighborhood structure
Markov properties encode neighborhood structure: (Xs | XV \s ) =
d

(Xs

|

XN (s) )

Condition on full graph

Condition on Markov blanket N (s) = {s, t, u, v, w} Xs Xt

Xw Xs Xv basis of pseudolikelihood method basis of many graph learning algorithms
Martin Wainwright (UC Berkeley)

Xu

(Besag, 1974) (Friedman et al., 1999; Csiszar &
10 / 24

Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)
Graphical models and message-passing

Markov property and neighborhood structure
Markov properties encode neighborhood structure: (Xs | XV \s ) =
d

(Xs

|

XN (s) )

Condition on full graph

Condition on Markov blanket N (s) = {s, t, u, v, w} Xs Xt

Xw Xs Xv basis of pseudolikelihood method basis of many graph learning algorithms
Martin Wainwright (UC Berkeley)

Xu

(Besag, 1974) (Friedman et al., 1999; Csiszar &
10 / 24

Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)
Graphical models and message-passing

85

Slide: Graph selection via neighborhood regression
1001101001110101 0110000111100100 ..... ..... 1111110101011011 0011010101000101 ..... 1 0 0 0 0 1 1

Predict Xs based on X\s := {Xs , t = s}.

X\s

Xs

Graph selection via neighborhood regression
1001101001110101 0110000111100100 ..... ..... 1111110101011011 0011010101000101 ..... 1 0 0 0 0 1 1

Predict Xs based on X\s := {Xs , t = s}.

X\s

Xs

86

Slide: Graph selection via neighborhood regression
1001101001110101 0110000111100100 ..... ..... 1111110101011011 0011010101000101 ..... 1 0 0 0 0 1 1

Predict Xs based on X\s := {Xs , t = s}.

X\s
1

Xs

For each node s  V , compute (regularized) max. likelihood estimate: [s] := arg min p1
R



1 n

n i=1

L(; Xi, \s )

+

n 

1

local log. likelihood

regularization

Graph selection via neighborhood regression
1001101001110101 0110000111100100 ..... ..... 1111110101011011 0011010101000101 ..... 1 0 0 0 0 1 1

Predict Xs based on X\s := {Xs , t = s}.

X\s
1

Xs

For each node s V , compute (regularized) max. likelihood estimate: [s] := arg min p1
R

1 n

n i=1

L(; Xi, \s )

+

n

1

local log. likelihood

regularization

87

Slide: Graph selection via neighborhood regression
1001101001110101 0110000111100100 ..... ..... 1111110101011011 0011010101000101 ..... 1 0 0 0 0 1 1

Predict Xs based on X\s := {Xs , t = s}.

X\s
1

Xs

For each node s  V , compute (regularized) max. likelihood estimate: [s] := arg min p1
R



1 n

n i=1

L(; Xi, \s )

+

n 

1

local log. likelihood

regularization

2

Estimate the local neighborhood N (s) as support of regression vector [s]  Rp1 .

Graph selection via neighborhood regression
1001101001110101 0110000111100100 ..... ..... 1111110101011011 0011010101000101 ..... 1 0 0 0 0 1 1

Predict Xs based on X\s := {Xs , t = s}.

X\s
1

Xs

For each node s V , compute (regularized) max. likelihood estimate: [s] := arg min p1
R

1 n

n i=1

L(; Xi, \s )

+

n

1

local log. likelihood

regularization

2

Estimate the local neighborhood N (s) as support of regression vector [s] Rp1 .

88

Slide: High-dimensional analysis
classical analysis: graph size p xed, sample size n  + high-dimensional analysis: allow both dimension p, sample size n, and maximum degree d to increase at arbitrary rates

take n i.i.d. samples from MRF dened by Gp,d study probability of success as a function of three parameters: Success(n, p, d) = Q[Method recovers graph Gp,d from n samples]

theory is non-asymptotic: explicit probabilities for nite (n, p, d)

High-dimensional analysis
classical analysis: graph size p xed, sample size n + high-dimensional analysis: allow both dimension p, sample size n, and maximum degree d to increase at arbitrary rates

take n i.i.d. samples from MRF dened by Gp,d study probability of success as a function of three parameters: Success(n, p, d) = Q[Method recovers graph Gp,d from n samples]

theory is non-asymptotic: explicit probabilities for nite (n, p, d)

89

Slide: Empirical behavior: Unrescaled plots
Star graph; Linear fraction neighbors 1

0.8 Prob. success

0.6

0.4

0.2

p = 64 p = 100 p = 225 100 200 300 400 Number of samples 500 600

0 0

Empirical behavior: Unrescaled plots
Star graph; Linear fraction neighbors 1

0.8 Prob. success

0.6

0.4

0.2

p = 64 p = 100 p = 225 100 200 300 400 Number of samples 500 600

0 0

90

Slide: Empirical behavior: Appropriately rescaled
Star graph; Linear fraction neighbors 1

0.8 Prob. success

0.6

0.4

0.2

p = 64 p = 100 p = 225 0.5 1 1.5 Control parameter 2

0 0

Empirical behavior: Appropriately rescaled
Star graph; Linear fraction neighbors 1

0.8 Prob. success

0.6

0.4

0.2

p = 64 p = 100 p = 225 0.5 1 1.5 Control parameter 2

0 0

91

Slide: Rescaled plots (2-D lattice graphs)
4nearest neighbor grid (attractive) 1

0.8 Prob. success

0.6

0.4

0.2

p = 64 p = 100 p = 225 0.5 1 1.5 2 Control parameter 2.5 3

0 0

Rescaled plots (2-D lattice graphs)
4nearest neighbor grid (attractive) 1

0.8 Prob. success

0.6

0.4

0.2

p = 64 p = 100 p = 225 0.5 1 1.5 2 Control parameter 2.5 3

0 0

92

Slide: Sucient conditions for consistent Ising selection
graph sequences Gp,d = (V, E) with p vertices, and maximum degree d. edge weights |st |  min for all (s, t)  E draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d)

Theorem (Ravikumar, W. & Laerty, 2006, 2010)

Sucient conditions for consistent Ising selection
graph sequences Gp,d = (V, E) with p vertices, and maximum degree d. edge weights |st | min for all (s, t) E draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d)

Theorem (Ravikumar, W. & Laerty, 2006, 2010)

93

Slide: Sucient conditions for consistent Ising selection
graph sequences Gp,d = (V, E) with p vertices, and maximum degree d. edge weights |st |  min for all (s, t)  E draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d)

Theorem (Ravikumar, W. & Laerty, 2006, 2010) Under incoherence conditions, for a rescaled sample LR (n, p, d) := n > crit d3 log p
log p n ,

1  2 exp 

and regularization parameter n  c1 c 2 2 n n :

then with probability greater than

(a) Correct exclusion: The estimated sign neighborhood N (s) correctly excludes all edges not in the true neighborhood.

Sucient conditions for consistent Ising selection
graph sequences Gp,d = (V, E) with p vertices, and maximum degree d. edge weights |st | min for all (s, t) E draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d)

Theorem (Ravikumar, W. & Laerty, 2006, 2010) Under incoherence conditions, for a rescaled sample LR (n, p, d) := n > crit d3 log p
log p n ,

1 2 exp

and regularization parameter n c1 c 2 2 n n :

then with probability greater than

(a) Correct exclusion: The estimated sign neighborhood N (s) correctly excludes all edges not in the true neighborhood.

94

Slide: Sucient conditions for consistent Ising selection
graph sequences Gp,d = (V, E) with p vertices, and maximum degree d. edge weights |st |  min for all (s, t)  E draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d)

Theorem (Ravikumar, W. & Laerty, 2006, 2010) Under incoherence conditions, for a rescaled sample LR (n, p, d) := n > crit d3 log p
log p n ,

1  2 exp 

and regularization parameter n  c1 c 2 2 n n :

then with probability greater than

(a) Correct exclusion: The estimated sign neighborhood N (s) correctly excludes all edges not in the true neighborhood. (b) Correct inclusion: For min  c3 n , the method selects the correct signed neighborhood.

Sucient conditions for consistent Ising selection
graph sequences Gp,d = (V, E) with p vertices, and maximum degree d. edge weights |st | min for all (s, t) E draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d)

Theorem (Ravikumar, W. & Laerty, 2006, 2010) Under incoherence conditions, for a rescaled sample LR (n, p, d) := n > crit d3 log p
log p n ,

1 2 exp

and regularization parameter n c1 c 2 2 n n :

then with probability greater than

(a) Correct exclusion: The estimated sign neighborhood N (s) correctly excludes all edges not in the true neighborhood. (b) Correct inclusion: For min c3 n , the method selects the correct signed neighborhood.

95

Slide: Some related work
thresholding estimator (poly-time for bounded degree) works with n 2d log p samples (Bresler et al., 2008)

Some related work
thresholding estimator (poly-time for bounded degree) works with n 2d log p samples (Bresler et al., 2008)

96

Slide: Some related work
thresholding estimator (poly-time for bounded degree) works with n 2d log p samples (Bresler et al., 2008) information-theoretic lower bound over family Gp,d : any method requires at least n = (d2 log p) samples (Santhanam & W., 2008)

Some related work
thresholding estimator (poly-time for bounded degree) works with n 2d log p samples (Bresler et al., 2008) information-theoretic lower bound over family Gp,d : any method requires at least n = (d2 log p) samples (Santhanam & W., 2008)

97

Slide: Some related work
thresholding estimator (poly-time for bounded degree) works with n 2d log p samples (Bresler et al., 2008) information-theoretic lower bound over family Gp,d : any method requires at least n = (d2 log p) samples (Santhanam & W., 2008) 1 -based method: sharper achievable rates, also failure for  large enough to violate incoherence (Bento & Montanari, 2009)

Some related work
thresholding estimator (poly-time for bounded degree) works with n 2d log p samples (Bresler et al., 2008) information-theoretic lower bound over family Gp,d : any method requires at least n = (d2 log p) samples (Santhanam & W., 2008) 1 -based method: sharper achievable rates, also failure for large enough to violate incoherence (Bento & Montanari, 2009)

98

Slide: Some related work
thresholding estimator (poly-time for bounded degree) works with n 2d log p samples (Bresler et al., 2008) information-theoretic lower bound over family Gp,d : any method requires at least n = (d2 log p) samples (Santhanam & W., 2008) 1 -based method: sharper achievable rates, also failure for  large enough to violate incoherence (Bento & Montanari, 2009) empirical study: 1 -based method can succeed beyond phase transition on Ising model (Aurell & Ekeberg, 2011)

Some related work
thresholding estimator (poly-time for bounded degree) works with n 2d log p samples (Bresler et al., 2008) information-theoretic lower bound over family Gp,d : any method requires at least n = (d2 log p) samples (Santhanam & W., 2008) 1 -based method: sharper achievable rates, also failure for large enough to violate incoherence (Bento & Montanari, 2009) empirical study: 1 -based method can succeed beyond phase transition on Ising model (Aurell & Ekeberg, 2011)

99

Slide: 3. Info. theory: Graph selection as channel coding
graphical model selection is an unorthodox channel coding problem:

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

18 / 24

3. Info. theory: Graph selection as channel coding
graphical model selection is an unorthodox channel coding problem:

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

18 / 24

100

Slide: 3. Info. theory: Graph selection as channel coding
graphical model selection is an unorthodox channel coding problem:
 



codewords/codebook: graph G in some graph class G channel use: draw sample Xi = (Xi1 , . . . , Xip from Markov random eld Q(G) decoding problem: use n samples {X1 , . . . , Xn } to correctly distinguish the codeword

G

Q(X | G)

X1 , . . . , Xn

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

18 / 24

3. Info. theory: Graph selection as channel coding
graphical model selection is an unorthodox channel coding problem:

codewords/codebook: graph G in some graph class G channel use: draw sample Xi = (Xi1 , . . . , Xip from Markov random eld Q(G) decoding problem: use n samples {X1 , . . . , Xn } to correctly distinguish the codeword

G

Q(X | G)

X1 , . . . , Xn

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

18 / 24

101

Slide: 3. Info. theory: Graph selection as channel coding
graphical model selection is an unorthodox channel coding problem:
 



codewords/codebook: graph G in some graph class G channel use: draw sample Xi = (Xi1 , . . . , Xip from Markov random eld Q(G) decoding problem: use n samples {X1 , . . . , Xn } to correctly distinguish the codeword

G

Q(X | G)

X1 , . . . , Xn

Channel capacity for graph decoding determined by balance between
log number of models relative distinguishability of dierent models
Martin Wainwright (UC Berkeley) Graphical models and message-passing 18 / 24

3. Info. theory: Graph selection as channel coding
graphical model selection is an unorthodox channel coding problem:

codewords/codebook: graph G in some graph class G channel use: draw sample Xi = (Xi1 , . . . , Xip from Markov random eld Q(G) decoding problem: use n samples {X1 , . . . , Xn } to correctly distinguish the codeword

G

Q(X | G)

X1 , . . . , Xn

Channel capacity for graph decoding determined by balance between
log number of models relative distinguishability of dierent models
Martin Wainwright (UC Berkeley) Graphical models and message-passing 18 / 24

102

Slide: Necessary conditions for Gd,p
G  Gd,p : graphs with p nodes and max. degree d Ising models with:  Minimum edge weight: |  |   min for all edges st  Maximum neighborhood weight: () := max

sV tN (s)

 |st |

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

19 / 24

Necessary conditions for Gd,p
G Gd,p : graphs with p nodes and max. degree d Ising models with: Minimum edge weight: | | min for all edges st Maximum neighborhood weight: () := max

sV tN (s)

|st |

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

19 / 24

103

Slide: Necessary conditions for Gd,p
G  Gd,p : graphs with p nodes and max. degree d Ising models with:  Minimum edge weight: |  |   min for all edges st  Maximum neighborhood weight: () := max

sV tN (s)

 |st |

Theorem If the sample size n is upper bounded by n < max

(Santhanam & W, 2008)

() p exp( 4 ) dmin log(pd/8) d log p log , , 3min 8 8d 2min tanh(min ) 128 exp( 2 )

then the probability of error of any algorithm over Gd,p is at least 1/2.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

19 / 24

Necessary conditions for Gd,p
G Gd,p : graphs with p nodes and max. degree d Ising models with: Minimum edge weight: | | min for all edges st Maximum neighborhood weight: () := max

sV tN (s)

|st |

Theorem If the sample size n is upper bounded by n < max

(Santhanam & W, 2008)

() p exp( 4 ) dmin log(pd/8) d log p log , , 3min 8 8d 2min tanh(min ) 128 exp( 2 )

then the probability of error of any algorithm over Gd,p is at least 1/2.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

19 / 24

104

Slide: Necessary conditions for Gd,p
G  Gd,p : graphs with p nodes and max. degree d Ising models with:  Minimum edge weight: |  |   min for all edges st  Maximum neighborhood weight: () := max

sV tN (s)

 |st |

Theorem If the sample size n is upper bounded by n < max

(Santhanam & W, 2008)

() p exp( 4 ) dmin log(pd/8) d log p log , , 3min 8 8d 2min tanh(min ) 128 exp( 2 )

then the probability of error of any algorithm over Gd,p is at least 1/2. Interpretation: Naive bulk eect: Arises from log cardinality log |Gd,p | d-clique eect: Diculty of separating models that contain a near d-clique Small weight eect: Dicult to detect edges with small weights.
Martin Wainwright (UC Berkeley) Graphical models and message-passing 19 / 24

Necessary conditions for Gd,p
G Gd,p : graphs with p nodes and max. degree d Ising models with: Minimum edge weight: | | min for all edges st Maximum neighborhood weight: () := max

sV tN (s)

|st |

Theorem If the sample size n is upper bounded by n < max

(Santhanam & W, 2008)

() p exp( 4 ) dmin log(pd/8) d log p log , , 3min 8 8d 2min tanh(min ) 128 exp( 2 )

then the probability of error of any algorithm over Gd,p is at least 1/2. Interpretation: Naive bulk eect: Arises from log cardinality log |Gd,p | d-clique eect: Diculty of separating models that contain a near d-clique Small weight eect: Dicult to detect edges with small weights.
Martin Wainwright (UC Berkeley) Graphical models and message-passing 19 / 24

105

Slide: Some consequences
Corollary For asymptotically reliable recovery over Gd,p , any algorithm requires at least n = (d2 log p) samples.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

20 / 24

Some consequences
Corollary For asymptotically reliable recovery over Gd,p , any algorithm requires at least n = (d2 log p) samples.

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

20 / 24

106

Slide: Some consequences
Corollary For asymptotically reliable recovery over Gd,p , any algorithm requires at least n = (d2 log p) samples. note that maximum neighborhood weight ( )  d min = require min = O(1/d)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

20 / 24

Some consequences
Corollary For asymptotically reliable recovery over Gd,p , any algorithm requires at least n = (d2 log p) samples. note that maximum neighborhood weight ( ) d min = require min = O(1/d)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

20 / 24

107

Slide: Some consequences
Corollary For asymptotically reliable recovery over Gd,p , any algorithm requires at least n = (d2 log p) samples. note that maximum neighborhood weight ( )  d min = require min = O(1/d) from small weight eect n = ( log p log p ) =  2 min tanh(min ) min

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

20 / 24

Some consequences
Corollary For asymptotically reliable recovery over Gd,p , any algorithm requires at least n = (d2 log p) samples. note that maximum neighborhood weight ( ) d min = require min = O(1/d) from small weight eect n = ( log p log p ) = 2 min tanh(min ) min

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

20 / 24

108

Slide: Some consequences
Corollary For asymptotically reliable recovery over Gd,p , any algorithm requires at least n = (d2 log p) samples. note that maximum neighborhood weight ( )  d min = require min = O(1/d) from small weight eect n = ( log p log p ) =  2 min tanh(min ) min

conclude that 1 -regularized logistic regression (LR) is optimal up to a factor O(d) (Ravikumar., W. & Laerty, 2010)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

20 / 24

Some consequences
Corollary For asymptotically reliable recovery over Gd,p , any algorithm requires at least n = (d2 log p) samples. note that maximum neighborhood weight ( ) d min = require min = O(1/d) from small weight eect n = ( log p log p ) = 2 min tanh(min ) min

conclude that 1 -regularized logistic regression (LR) is optimal up to a factor O(d) (Ravikumar., W. & Laerty, 2010)

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

20 / 24

109

Slide: Proof sketch: Main ideas for necessary conditions
based on assessing diculty of graph selection over various sub-ensembles G  Gp,d

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

21 / 24

Proof sketch: Main ideas for necessary conditions
based on assessing diculty of graph selection over various sub-ensembles G Gp,d

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

21 / 24

110

Slide: Proof sketch: Main ideas for necessary conditions
based on assessing diculty of graph selection over various sub-ensembles G  Gp,d choose G  G u.a.r., and consider multi-way hypothesis testing problem based on the data Xn = {X1 , . . . , Xn } 1

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

21 / 24

Proof sketch: Main ideas for necessary conditions
based on assessing diculty of graph selection over various sub-ensembles G Gp,d choose G G u.a.r., and consider multi-way hypothesis testing problem based on the data Xn = {X1 , . . . , Xn } 1

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

21 / 24

111

Slide: Proof sketch: Main ideas for necessary conditions
based on assessing diculty of graph selection over various sub-ensembles G  Gp,d choose G  G u.a.r., and consider multi-way hypothesis testing problem based on the data Xn = {X1 , . . . , Xn } 1 for any graph estimator  : X n  G, Fanos inequality implies that Q[(Xn ) = G]  1  1 I(Xn ; G) + log 2 1 log |G|

where I(Xn ; G) is mutual information between observations Xn and 1 1 randomly chosen graph G

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

21 / 24

Proof sketch: Main ideas for necessary conditions
based on assessing diculty of graph selection over various sub-ensembles G Gp,d choose G G u.a.r., and consider multi-way hypothesis testing problem based on the data Xn = {X1 , . . . , Xn } 1 for any graph estimator : X n G, Fanos inequality implies that Q[(Xn ) = G] 1 1 I(Xn ; G) + log 2 1 log |G|

where I(Xn ; G) is mutual information between observations Xn and 1 1 randomly chosen graph G

Martin Wainwright (UC Berkeley)

Graphical models and message-passing

21 / 24

112

Slide: Proof sketch: Main ideas for necessary conditions
based on assessing diculty of graph selection over various sub-ensembles G  Gp,d choose G  G u.a.r., and consider multi-way hypothesis testing problem based on the data Xn = {X1 , . . . , Xn } 1 for any graph estimator  : X n  G, Fanos inequality implies that Q[(Xn ) = G]  1  1 I(Xn ; G) + log 2 1 log |G|

where I(Xn ; G) is mutual information between observations Xn and 1 1 randomly chosen graph G remaining steps:
1 2 3

Construct dicult sub-ensembles G  Gp,d Compute or lower bound the log cardinality log |G|. Upper bound the mutual information I(Xn ; G). 1
Graphical models and message-passing 21 / 24

Martin Wainwright (UC Berkeley)

Proof sketch: Main ideas for necessary conditions
based on assessing diculty of graph selection over various sub-ensembles G Gp,d choose G G u.a.r., and consider multi-way hypothesis testing problem based on the data Xn = {X1 , . . . , Xn } 1 for any graph estimator : X n G, Fanos inequality implies that Q[(Xn ) = G] 1 1 I(Xn ; G) + log 2 1 log |G|

where I(Xn ; G) is mutual information between observations Xn and 1 1 randomly chosen graph G remaining steps:
1 2 3

Construct dicult sub-ensembles G Gp,d Compute or lower bound the log cardinality log |G|. Upper bound the mutual information I(Xn ; G). 1
Graphical models and message-passing 21 / 24

Martin Wainwright (UC Berkeley)

113

Slide: Summary
simple 1 -regularized neighborhood selection:
 

polynomial-time method for learning neighborhood structure natural extensions (using block regularization) to higher order models

information-theoretic limits of graph learning Some papers: Ravikumar, W. & Laerty (2010). High-dimensional Ising model selection using 1 -regularized logistic regression. Annals of Statistics. Santhanam & W (2012). Information-theoretic limits of selecting binary graphical models in high dimensions, IEEE Transactions on Information Theory.

Summary
simple 1 -regularized neighborhood selection:

polynomial-time method for learning neighborhood structure natural extensions (using block regularization) to higher order models

information-theoretic limits of graph learning Some papers: Ravikumar, W. & Laerty (2010). High-dimensional Ising model selection using 1 -regularized logistic regression. Annals of Statistics. Santhanam & W (2012). Information-theoretic limits of selecting binary graphical models in high dimensions, IEEE Transactions on Information Theory.

114

Slide: Two straightforward ensembles

Two straightforward ensembles

115

Slide: Two straightforward ensembles
1

Naive bulk ensemble: All graphs on p vertices with max. degree d (i.e., G = Gp,d )

Two straightforward ensembles
1

Naive bulk ensemble: All graphs on p vertices with max. degree d (i.e., G = Gp,d )

116

Slide: Two straightforward ensembles
1

Naive bulk ensemble: All graphs on p vertices with max. degree d (i.e., G = Gp,d )
   

simple counting argument: log |Gp,d | =  pd log(p/d) trivial upper bound: I(Xn ; G)  H(Xn )  np. 1 1 substituting into Fano yields necessary condition n = (d log(p/d)) this bound independently derived by dierent approach by Bresler et al. (2008)

Two straightforward ensembles
1

Naive bulk ensemble: All graphs on p vertices with max. degree d (i.e., G = Gp,d )

simple counting argument: log |Gp,d | = pd log(p/d) trivial upper bound: I(Xn ; G) H(Xn ) np. 1 1 substituting into Fano yields necessary condition n = (d log(p/d)) this bound independently derived by dierent approach by Bresler et al. (2008)

117

Slide: Two straightforward ensembles
1

Naive bulk ensemble: All graphs on p vertices with max. degree d (i.e., G = Gp,d )
   

simple counting argument: log |Gp,d | =  pd log(p/d) trivial upper bound: I(Xn ; G)  H(Xn )  np. 1 1 substituting into Fano yields necessary condition n = (d log(p/d)) this bound independently derived by dierent approach by Bresler et al. (2008)

2

Small weight eect: Ensemble G consisting of graphs with a single edge with weight  = min

Two straightforward ensembles
1

Naive bulk ensemble: All graphs on p vertices with max. degree d (i.e., G = Gp,d )

simple counting argument: log |Gp,d | = pd log(p/d) trivial upper bound: I(Xn ; G) H(Xn ) np. 1 1 substituting into Fano yields necessary condition n = (d log(p/d)) this bound independently derived by dierent approach by Bresler et al. (2008)

2

Small weight eect: Ensemble G consisting of graphs with a single edge with weight = min

118

Slide: Two straightforward ensembles
1

Naive bulk ensemble: All graphs on p vertices with max. degree d (i.e., G = Gp,d )
   

simple counting argument: log |Gp,d | =  pd log(p/d) trivial upper bound: I(Xn ; G)  H(Xn )  np. 1 1 substituting into Fano yields necessary condition n = (d log(p/d)) this bound independently derived by dierent approach by Bresler et al. (2008)

2

Small weight eect: Ensemble G consisting of graphs with a single edge with weight  = min
 

simple counting: log |G| = log p 2 upper bound on mutual information: I(Xn ; G)  1 1
p 2 (i,j),(k,)E

D (Gij ) (Gk ) .



upper bound on symmetrized Kullback-Leibler divergences: D (Gij ) (Gk ) + D (Gk ) (Gij )  2min tanh(min /2)



substituting into Fano yields necessary condition n = 

log p min tanh(min /2)

Two straightforward ensembles
1

Naive bulk ensemble: All graphs on p vertices with max. degree d (i.e., G = Gp,d )

simple counting argument: log |Gp,d | = pd log(p/d) trivial upper bound: I(Xn ; G) H(Xn ) np. 1 1 substituting into Fano yields necessary condition n = (d log(p/d)) this bound independently derived by dierent approach by Bresler et al. (2008)

2

Small weight eect: Ensemble G consisting of graphs with a single edge with weight = min

simple counting: log |G| = log p 2 upper bound on mutual information: I(Xn ; G) 1 1
p 2 (i,j),(k,)E

D (Gij ) (Gk ) .

upper bound on symmetrized Kullback-Leibler divergences: D (Gij ) (Gk ) + D (Gk ) (Gij ) 2min tanh(min /2)

substituting into Fano yields necessary condition n =

log p min tanh(min /2)

119

Slide: A harder d-clique ensemble
Constructive procedure: 1 Divide the vertex set V into  p  groups of size d + 1. d+1 2 Form the base graph G by making a (d + 1)-clique within each group. 3 Form graph Guv by deleting edge (u, v) from G. 4 Form Markov random eld Q(Guv ) by setting st = min for all edges.

(a) Base graph G For d  p/4, we can form

(b) Graph Guv

(c) Graph Gst

|G|   such graphs.

p d+1  d+1 2

= (dp)

A harder d-clique ensemble
Constructive procedure: 1 Divide the vertex set V into p groups of size d + 1. d+1 2 Form the base graph G by making a (d + 1)-clique within each group. 3 Form graph Guv by deleting edge (u, v) from G. 4 Form Markov random eld Q(Guv ) by setting st = min for all edges.

(a) Base graph G For d p/4, we can form

(b) Graph Guv

(c) Graph Gst

|G| such graphs.

p d+1 d+1 2

= (dp)

120