« Graph Mining

Koji Tsuda

Labeled graphs are general and powerful data structures that can be used to represent diverse kinds of objects such as XML code, chemical compounds, proteins, and RNAs. In these 10 years, we saw significant progress in statistical learning algorithms for graph data, such as supervised classification, clustering and dimensionality reduction. Graph kernels and graph mining have been the main driving force of such innovations. In this lecture, I start from basics of the two techniques and cover several important algorithms in learning from graphs. Successful biological applications are featured. If time allows, I will also cover recent developments and show future directions.

Scroll with j/k | | | Size

Slide: Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 1: Data Mining
National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 1: Data Mining
National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

1

Slide: National Institute of Advanced Industrial Science and Technology (AIST)
 Research institute under METI for 6 fields:
Life Science, Informatics, Environment and Energy, Nanotechnology and Materials, Geological Survey, Standard

 2500 scientists, 700 administrative staff, 5200
scientists from outside

 Since 2001
2

National Institute of Advanced Industrial Science and Technology (AIST)
Research institute under METI for 6 fields:
Life Science, Informatics, Environment and Energy, Nanotechnology and Materials, Geological Survey, Standard

2500 scientists, 700 administrative staff, 5200
scientists from outside

Since 2001
2

2

Slide: Computational Biology Research Center
Odaiba, Tokyo
Developing Novel Methods and Tools for
  

Genome Informatics Molecular Informatics Cellular Informatics

Diverse Collaboration with Companies and Universities
3

Computational Biology Research Center
Odaiba, Tokyo
Developing Novel Methods and Tools for

Genome Informatics Molecular Informatics Cellular Informatics

Diverse Collaboration with Companies and Universities
3

3

Slide: Koji Tsuda: Short Bio
1998 PhD in Kyoto Univ, Join ETL 2000 GMD FIRST (Berlin, Germany) 2001 Join CBRC/AIST 2003-2004, 2006-2008 Max Planck Institute for Biological Cybernetics, Tuebingen, Germany 2009 Back to CBRC, Machine Learning Group Leader
4

Koji Tsuda: Short Bio
1998 PhD in Kyoto Univ, Join ETL 2000 GMD FIRST (Berlin, Germany) 2001 Join CBRC/AIST 2003-2004, 2006-2008 Max Planck Institute for Biological Cybernetics, Tuebingen, Germany 2009 Back to CBRC, Machine Learning Group Leader
4

4

Slide: About this lecture
How to extract knowledge from structured data Itemset mining, tree mining, graph mining


Reverse Search Principle

Learning from structured data Kernels for structured data
5

About this lecture
How to extract knowledge from structured data Itemset mining, tree mining, graph mining

Reverse Search Principle

Learning from structured data Kernels for structured data
5

5

Slide: Chapter 1 (Data Mining)
1. 2. 3. 4. 5. 5. 6. Structured Data in Biology Itemset Mining Closed Itemset Mining Ordered Tree Mining Unordered Tree Mining Graph Mining Dense Module Enumeration

6

Chapter 1 (Data Mining)
1. 2. 3. 4. 5. 5. 6. Structured Data in Biology Itemset Mining Closed Itemset Mining Ordered Tree Mining Unordered Tree Mining Graph Mining Dense Module Enumeration

6

6

Slide: Agenda 2 (Learning from Structured Data)
1. 2. 3. 4. 5. Preliminaries Graph Clustering by EM Graph Boosting Regularization Paths in Graph Classification Itemset Boosting for predicting HIV drug resistance

7

Agenda 2 (Learning from Structured Data)
1. 2. 3. 4. 5. Preliminaries Graph Clustering by EM Graph Boosting Regularization Paths in Graph Classification Itemset Boosting for predicting HIV drug resistance

7

7

Slide: Agenda 3 (Kernel)
1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

8

Agenda 3 (Kernel)
1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

8

8

Slide: Part 1: Structural Data in Biology

9

Part 1: Structural Data in Biology

9

9

Slide: Biological Sequences
DNA sequences (A,C,G,T)


Gene Finding, Splice Sites MicroRNA discovery, etc. Remote homolog detection, Fold recognition etc.
10

RNA sequences (A,C,G,U)


Amino acid sequences (20 symbols)

Biological Sequences
DNA sequences (A,C,G,T)

Gene Finding, Splice Sites MicroRNA discovery, etc. Remote homolog detection, Fold recognition etc.
10

RNA sequences (A,C,G,U)

Amino acid sequences (20 symbols)

10

Slide: Structures hidden in sequences (I)
Exon/intron of DNA (Gene)

11

Structures hidden in sequences (I)
Exon/intron of DNA (Gene)

11

11

Slide: Structures hidden in sequences (II)

RNA Secondary Structure

Protein 3D Structures

It is crucial to infer hidden structures and exploit them for classification

Biological Graphs

12

Structures hidden in sequences (II)

RNA Secondary Structure

Protein 3D Structures

It is crucial to infer hidden structures and exploit them for classification

Biological Graphs

12

12

Slide: Molecular graphs
Structure: Thiamine (Vitamin B1) molecular graph

graph = a set of dots & lines (or nodes & edges)
Hydrogen Carbon Oxygen Nitrogen Sulfur single bond double bond

abstraction

implicit hydrogens

explicit hydrogens

Molecular graphs
Structure: Thiamine (Vitamin B1) molecular graph

graph = a set of dots & lines (or nodes & edges)
Hydrogen Carbon Oxygen Nitrogen Sulfur single bond double bond

abstraction

implicit hydrogens

explicit hydrogens

13

Slide: Gene Expression Data
Measurement of many mRNAs in the cell Rough estimate of amount of proteins Time-series or not Snapshot of the underlying dynamic system
14

Gene Expression Data
Measurement of many mRNAs in the cell Rough estimate of amount of proteins Time-series or not Snapshot of the underlying dynamic system
14

14

Slide: Biological Networks
Protein-protein physical interaction Metabolic networks Gene regulatory networks Network induced from sequence similarity
Thousands of nodes (genes/proteins) 100000s of edges (interactions)
15

Biological Networks
Protein-protein physical interaction Metabolic networks Gene regulatory networks Network induced from sequence similarity
Thousands of nodes (genes/proteins) 100000s of edges (interactions)
15

15

Slide: Physical Interaction Network

16

Physical Interaction Network

16

16

Slide: Metabolic Network

17

Metabolic Network

17

17

Slide: Many possible prediction problems..

18

Many possible prediction problems..

18

18

Slide: Part 2: Itemset mining

19

Part 2: Itemset mining

19

19

Slide: Data Mining
people

 A formal study of efficient methods for extracting interesting rules and patterns

person name @age @id #text 2 5 60 8 #text Jo hn john@abc.com 60 9 email @id

person tel name #text Mar y

#text

from massive data

5554567

 Frequent itemset mining
(Agrawal and Srikant 1994)

 Closed pattern mining  Structured data mining (Sequence, Trees, and Graphs)
20

Data Mining
people

A formal study of efficient methods for extracting interesting rules and patterns

person name @age @id #text 2 5 60 8 #text Jo hn john@abc.com 60 9 email @id

person tel name #text Mar y

#text

from massive data

5554567

Frequent itemset mining
(Agrawal and Srikant 1994)

Closed pattern mining Structured data mining (Sequence, Trees, and Graphs)
20

20

Slide: Frequent Itemset Mining
[Agrawal, Srikant, VLDB'94]

 Finding all "frequent" sets of elements (items) appearing  times or more in a database

Minsup = 2
1 t1 t2 t3 t4 t5           2 3     4 5

Frequent sets , 1, 2, 3, 4, 12, 13, 14, 23, 24, 124
X = {2, 4} appears three times, thus frequent

Frequent sets

21

database

The itemset lattice (2, )

Frequent Itemset Mining
[Agrawal, Srikant, VLDB'94]

Finding all "frequent" sets of elements (items) appearing times or more in a database

Minsup = 2
1 t1 t2 t3 t4 t5 2 3 4 5

Frequent sets , 1, 2, 3, 4, 12, 13, 14, 23, 24, 124
X = {2, 4} appears three times, thus frequent

Frequent sets

21

database

The itemset lattice (2, )

21

Slide: Definitions: Database
 A set  = { 1, ..., n } of items (elements)  Transaction database  A set T = { t1, ..., tm } of subsets of   Each subset t  is called a transaction

L = {1, 2, 3, 4}

Alphabet of items id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4
22 22

Definitions: Database
A set = { 1, ..., n } of items (elements) Transaction database A set T = { t1, ..., tm } of subsets of Each subset t is called a transaction

L = {1, 2, 3, 4}

Alphabet of items id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4
22 22

22

Slide: Definitions: Frequent sets
 Itemset X appears in transaction t: X  t  Occurrence of X in database T: Occ(X, T) = { t  T : X  t }  Frequency of X: Fr(X, T) = | Occ(X, T) |  Minimum support (minsup): 0  |T|  X is frequent in T if Fr(X, T)  .
I = {1, 2, 3, 4} Alphabet of items Transaction database id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4
23

Occurrences and frequencies of itemsets Occ(3, T) = {t1, t3} Fr(3, T) = 2 Occ(24, T) = {t2, t3, t4}, Fr(24, T) = 3
23

Definitions: Frequent sets
Itemset X appears in transaction t: X t Occurrence of X in database T: Occ(X, T) = { t T : X t } Frequency of X: Fr(X, T) = | Occ(X, T) | Minimum support (minsup): 0 |T| X is frequent in T if Fr(X, T) .
I = {1, 2, 3, 4} Alphabet of items Transaction database id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4
23

Occurrences and frequencies of itemsets Occ(3, T) = {t1, t3} Fr(3, T) = 2 Occ(24, T) = {t2, t3, t4}, Fr(24, T) = 3
23

23

Slide: Market Basket Data
 Popular application of itemset mining  Business and Market data analysis
 Transaction Data of purchase
ID 001 002 Chips 1 1 Mustard Sausage Softdrink 0 1 0 1 0 1 Beer 1 1

Item

003

1
0 0 1 1 1 1

0
0 1 1 0 1 0

1
1 1 1 1 1 0

0
0 1 0 1 0 1

0
1 1 1 1 0 0

 a transaction or a "basket"

004 005 006 007 008 009

010 0 1 0 1 Meaning of the transaction 003 1 "Custmer 003 bought Chips and Sausage together in his basket"
24

Market Basket Data
Popular application of itemset mining Business and Market data analysis
Transaction Data of purchase
ID 001 002 Chips 1 1 Mustard Sausage Softdrink 0 1 0 1 0 1 Beer 1 1

Item

003

1
0 0 1 1 1 1

0
0 1 1 0 1 0

1
1 1 1 1 1 0

0
0 1 0 1 0 1

0
1 1 1 1 0 0

a transaction or a "basket"

004 005 006 007 008 009

010 0 1 0 1 Meaning of the transaction 003 1 "Custmer 003 bought Chips and Sausage together in his basket"
24

24

Slide: DAG of itemsets: Hasse diagram
 Edge: Adding
empty

one item
1 2 3 4

1,2

1,3

2,3

1,4

2,4

3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,2,3,4
25

DAG of itemsets: Hasse diagram
Edge: Adding
empty

one item
1 2 3 4

1,2

1,3

2,3

1,4

2,4

3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,2,3,4
25

25

Slide: Enumeration Tree by Lexicographical Order
 Need a tree
empty

to avoid duplication
1,2

1

2

3

4

1,3

2,3

1,4

2,4

3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,2,3,4
26

Enumeration Tree by Lexicographical Order
Need a tree
empty

to avoid duplication
1,2

1

2

3

4

1,3

2,3

1,4

2,4

3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,2,3,4
26

26

Slide: Backtracking Algorithm: FP Growth etc.
 Monotonicity: Support only decreases  Depth First Traversal, Prune if support < 
Frequent sets

27

Backtracking Algorithm: FP Growth etc.
Monotonicity: Support only decreases Depth First Traversal, Prune if support <
Frequent sets

27

27

Slide: Association Rule Mining
 Confidence: Supp(A  B)/ Supp(A)
 Probability of B, Given A

 What item is likely to be bought when A

is bought
 Search: large support, confidence large  Post-processing of itemset mining
28

Association Rule Mining
Confidence: Supp(A B)/ Supp(A)
Probability of B, Given A

What item is likely to be bought when A

is bought
Search: large support, confidence large Post-processing of itemset mining
28

28

Slide: Summary: Itemset mining
 Itemset mining is the simplest of all

mining algorithms
 Need to maintain occurrence of each

pattern in database
 Tree by lexicographical order is

(implicitly) used

29

Summary: Itemset mining
Itemset mining is the simplest of all

mining algorithms
Need to maintain occurrence of each

pattern in database
Tree by lexicographical order is

(implicitly) used

29

29

Slide: Part 3: Closed Itemset mining

30

Part 3: Closed Itemset mining

30

30

Slide: Problem in Frequent Pattern Mining
 Huge Number of frequent itemsets  Hard to analyze  Most of them are similar
Huge number of frequent itemsets discovered in T

An input transaction database
1 t1 t2 t3           2 3     4 5

Frequent sets

minsup = 2
31

t4 t5

mining

database

Problem in Frequent Pattern Mining
Huge Number of frequent itemsets Hard to analyze Most of them are similar
Huge number of frequent itemsets discovered in T

An input transaction database
1 t1 t2 t3 2 3 4 5

Frequent sets

minsup = 2
31

t4 t5

mining

database

31

Slide: Solution: Closed Pattern Mining
 Find only closed patterns  Observation: Most frequent itemset X can be extended without changing occurrence by

adding new elements
 def ([Pasquier et al., ICDT'99]). An itemset X is a closed set if and only if there is no proper superset of X with the same frequency (thus the same occurrence set).

32

Solution: Closed Pattern Mining
Find only closed patterns Observation: Most frequent itemset X can be extended without changing occurrence by

adding new elements
def ([Pasquier et al., ICDT'99]). An itemset X is a closed set if and only if there is no proper superset of X with the same frequency (thus the same occurrence set).

32

32

Slide: Closed Pattern Mining
 A closed itemset is the maximal set among all

itemsets with the same occurrences.  Equivalence class [X] = {Y| Occ(X)=Occ(Y) }.


A AC AB ABC C AE ABE B BE ACE ABCE BC BCE E CE

Database
id 1 2 3 4 records ABCE AC BE BCE

Closed sets (maximal sets) Equivalence class w.r.t. occurrences

33

Closed Pattern Mining
A closed itemset is the maximal set among all

itemsets with the same occurrences. Equivalence class [X] = {Y| Occ(X)=Occ(Y) }.

A AC AB ABC C AE ABE B BE ACE ABCE BC BCE E CE

Database
id 1 2 3 4 records ABCE AC BE BCE

Closed sets (maximal sets) Equivalence class w.r.t. occurrences

33

33

Slide: Brute-force: Stupid Baseline
 ALGORITHM Bruteforce
 First, generate all frequent itemsets  Check them one by one via maximality test

 Maximality test for each candidate frequent set X
 Add some element e in  to X  If Freq(X U {e}) is properly less than Freq(X) then reject X.

34

Brute-force: Stupid Baseline
ALGORITHM Bruteforce
First, generate all frequent itemsets Check them one by one via maximality test

Maximality test for each candidate frequent set X
Add some element e in to X If Freq(X U {e}) is properly less than Freq(X) then reject X.

34

34

Slide: Bruteforce
 STEP1) first, generate all frequent sets

Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE



[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

35

Bruteforce
STEP1) first, generate all frequent sets

Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

35

35

Slide: Bruteforce
 STEP1) first, generate all frequent sets  STEP 2) make closedness test for each set

Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE



[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

36

Bruteforce
STEP1) first, generate all frequent sets STEP 2) make closedness test for each set

Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

36

36

Slide: Bruteforce
 STEP1) first, generate all frequent sets  STEP 2) make closedness test for each set  STEP3) finally, extract all closed sets
Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE



[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

All closed sets are found!

37

Bruteforce
STEP1) first, generate all frequent sets STEP 2) make closedness test for each set STEP3) finally, extract all closed sets
Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

All closed sets are found!

37

37

Slide: Complexity of Enumeration Algorithms
Number of patterns usually exponential to input size Delay: Time between two pattern outputs Brute-force is exponential delay w.r.t. pattern size
Input size Input Output size

Delay D Total Time T

38

Complexity of Enumeration Algorithms
Number of patterns usually exponential to input size Delay: Time between two pattern outputs Brute-force is exponential delay w.r.t. pattern size
Input size Input Output size

Delay D Total Time T

38

38

Slide: To achieve linear delay,


Must jump from closed set to closed set How to define the search tree? Reverse search!
(Avis and Fukuda 1996)

{2} {7,9} {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 39

To achieve linear delay,

Must jump from closed set to closed set How to define the search tree? Reverse search!
(Avis and Fukuda 1996)

{2} {7,9} {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 39

39

Slide: Reverse Search: Its a must
A general mathematical framework to design enumeration algorithms Can be used to prove the correctness of the algorithm Popular in computational geometry Data mining algorithms can be explained in remarkable simplicity
40

Reverse Search: Its a must
A general mathematical framework to design enumeration algorithms Can be used to prove the correctness of the algorithm Popular in computational geometry Data mining algorithms can be explained in remarkable simplicity
40

40

Slide: Often, search space comes as a DAG


 Naive Backtracking
= Duplication {2} {7,9} {1,7,9} {2,5} {2,7,9}

 Duplication check by
Marking = Exponential Memory

 How to visit all
nodes without duplication?

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 41

Often, search space comes as a DAG

Naive Backtracking
= Duplication {2} {7,9} {1,7,9} {2,5} {2,7,9}

Duplication check by
Marking = Exponential Memory

How to visit all
nodes without duplication?

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 41

41

Slide: Reduction Map
Mapping from a children to the parent Reduction map for closed itemset
 

Shrink the itemset until occurrence changes Take closure operation

closure(X) closure shrink
Parent of X
42

Closed set X
42

Reduction Map
Mapping from a children to the parent Reduction map for closed itemset

Shrink the itemset until occurrence changes Take closure operation

closure(X) closure shrink
Parent of X
42

Closed set X
42

42

Slide: Closure Operation
 closure(X) of a set X:
 Closed set computed by closure(X) =  { t in T : X  t }. (taking the intersection of all transactions in T that X occurs as subset)

non-closed set

closure(X)
closed set
43

Equivalence class of itemsets with same occurrence

Closure Operation
closure(X) of a set X:
Closed set computed by closure(X) = { t in T : X t }. (taking the intersection of all transactions in T that X occurs as subset)

non-closed set

closure(X)
closed set
43

Equivalence class of itemsets with same occurrence

43

Slide: Example of Closure Operation
Database T

 Non-closed itemset: (B,C)  Occurrence: 1,4  Take Intersection of 1 and 4

id 1 2 3 4

records ABCE AC BE BCE

(A,B,C,E)  (B,C,E) = (B,C,E)  This is closed itemset

44

Example of Closure Operation
Database T

Non-closed itemset: (B,C) Occurrence: 1,4 Take Intersection of 1 and 4

id 1 2 3 4

records ABCE AC BE BCE

(A,B,C,E) (B,C,E) = (B,C,E) This is closed itemset

44

44

Slide: By applying the reduction map to all nodes, enumeration tree is defined.


 But arrows are
in reverse direction..
{2} {7,9} {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 45

By applying the reduction map to all nodes, enumeration tree is defined.

But arrows are
in reverse direction..
{2} {7,9} {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 45

45

Slide: Children generation
In backtracking, one has to generate all children of the current node Parent Inverse of reduction map


Children candidates

real child





Generate all children candidates Apply reduction map to them Remove if not coming back

real child

46

Children generation
In backtracking, one has to generate all children of the current node Parent Inverse of reduction map

Children candidates

real child

Generate all children candidates Apply reduction map to them Remove if not coming back

real child

46

46

Slide: Reverse Search Theorem
To prove the correctness, prove the following


Reduction map is uniquely defined on all nodes By applying the reduction map repeatedly, one can reach the root node from any node Children generation is inverse of reduction map





Easy to check !

47

Reverse Search Theorem
To prove the correctness, prove the following

Reduction map is uniquely defined on all nodes By applying the reduction map repeatedly, one can reach the root node from any node Children generation is inverse of reduction map

Easy to check !

47

47

Slide: LCM = Linear Time Closed Sets Miner (Uno et al., 2003)
Prefix  Preserving Closure Extension Jump! = Children generation from the reduction map Linear Delay!
48

LCM = Linear Time Closed Sets Miner (Uno et al., 2003)
Prefix Preserving Closure Extension Jump! = Children generation from the reduction map Linear Delay!
48

48

Slide: Closure Extension
 Repeat: Add an item and taking closure

Step 1: Start: Y = X U {i} closed set X closure closure(X) i Y X
add item i

Step 2: closed set Z = closure(XU{i}) Z

non-closed sets closed sets
49

Closure Extension
Repeat: Add an item and taking closure

Step 1: Start: Y = X U {i} closed set X closure closure(X) i Y X
add item i

Step 2: closed set Z = closure(XU{i}) Z

non-closed sets closed sets
49

49

Slide: Nave Closure Extension: Duplication!
 closure extension  DAG of closed itemsets  {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} {2,5} {2,7,9}

T 

{1,2,7,9}

{2,3,4,5}

closure extension

{1,2,7,8,9}

{1,2,5,6,7,9}

50

Nave Closure Extension: Duplication!
closure extension DAG of closed itemsets {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} {2,5} {2,7,9}

T

{1,2,7,9}

{2,3,4,5}

closure extension

{1,2,7,8,9}

{1,2,5,6,7,9}

50

50

Slide: Prefix Preserving Closure Extension
 Ensure any closed set is generated from a unique parent Def. Closure tail of a closed itemset P  the minimum j s.t. closure (P  {1,,j})  P Def. H  closure(P{i}) is a PPC-extension of P  i > closure tail and H {1,,i-1}  P {1,,i-1}
51

Prefix Preserving Closure Extension
Ensure any closed set is generated from a unique parent Def. Closure tail of a closed itemset P the minimum j s.t. closure (P {1,,j}) P Def. H closure(P{i}) is a PPC-extension of P i > closure tail and H {1,,i-1} P {1,,i-1}
51

51

Slide: Enumeration tree by PPC extension
 closure extension  DAG  ppc extension  tree  {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} {2,5} {2,7,9}

T 

{1,2,7,9}

{2,3,4,5}

closure extension ppc extension

{1,2,7,8,9}

52 {1,2,5,6,7,9}

Enumeration tree by PPC extension
closure extension DAG ppc extension tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} {2,5} {2,7,9}

T

{1,2,7,9}

{2,3,4,5}

closure extension ppc extension

{1,2,7,8,9}

52 {1,2,5,6,7,9}

52

Slide: Linear Delay in Pattern Size
(Uno, Uchida, Asai, Arimura, Discovery Science 2004)

Theorem : The algorithm LCM finds all frequent closed sets X appearing in a collection of a transaction database D in O(lmn) time per closed set in the total size of D without duplicates, where l is the maximum length of transactions in D, and n is the total size of D, m is the size of pattern X.

Note: The output polynomial time complexity of Closed sets discovery is shown by [Makino et al. STACS2002]
53

Linear Delay in Pattern Size
(Uno, Uchida, Asai, Arimura, Discovery Science 2004)

Theorem : The algorithm LCM finds all frequent closed sets X appearing in a collection of a transaction database D in O(lmn) time per closed set in the total size of D without duplicates, where l is the maximum length of transactions in D, and n is the total size of D, m is the size of pattern X.

Note: The output polynomial time complexity of Closed sets discovery is shown by [Makino et al. STACS2002]
53

53

Slide: Summary: Closed Itemset Mining
 Closure Extension: Jump from closed

set to closed set
 LCM: Linear Delay

 Very fast in practice, too
 Winner of FIMI04 (Frequent Itemset Mining Implementation Workshop)

 Relation to clique enumeration (Arimura,
Uno, SDM2009)
54

Summary: Closed Itemset Mining
Closure Extension: Jump from closed

set to closed set
LCM: Linear Delay

Very fast in practice, too
Winner of FIMI04 (Frequent Itemset Mining Implementation Workshop)

Relation to clique enumeration (Arimura,
Uno, SDM2009)
54

54

Slide: Part 4: Ordered Tree Mining

55

Part 4: Ordered Tree Mining

55

55

Slide: Frequent Ordered Tree Mining
 Natural extension of frequent itemset

mining problem for trees
 Finding all frequent substructure in a given collection of labeled trees  How to enumerate them without duplicates

 Efficient DFS Algorithm
 FREQT [Asai, Arimura, SIAM DM2002]  TreeMiner [Zaki, ACM KDD2002]  Rightmost expansion technique

56

Frequent Ordered Tree Mining
Natural extension of frequent itemset

mining problem for trees
Finding all frequent substructure in a given collection of labeled trees How to enumerate them without duplicates

Efficient DFS Algorithm
FREQT [Asai, Arimura, SIAM DM2002] TreeMiner [Zaki, ACM KDD2002] Rightmost expansion technique

56

56

Slide: Labeled Ordered Trees
 Rooted:
 Ordered:
 Siblings are ordered from left to right.
people person name @age @id email #text
25 608 John

 Labeled
 Each node has a label.

person tel @id

 A model of
 HTML/XML  Hierarchical records  Dependency tree of natural language texts.

name
#text #text
609

#text
Mary

john@abc.com

5554567 57

Labeled Ordered Trees
Rooted:
Ordered:
Siblings are ordered from left to right.
people person name @age @id email #text
25 608 John

Labeled
Each node has a label.

person tel @id

A model of
HTML/XML Hierarchical records Dependency tree of natural language texts.

name
#text #text
609

#text
Mary

john@abc.com

5554567 57

57

Slide: Matching between trees
Pattern tree T matches a data tree D
(T occurs in D ) pattern tree T matching function f

There is a matching function f from T into D.

A C

1

r

data tree D

B


f is 1-to-1. B  f preserves parent-child relation. 3  f preserves (indirect) sibling


2 4

A A
5 6

7

C
11

C B
9

8

A
10

B

relation. f preserves labels.

B

C
58

Matching between trees
Pattern tree T matches a data tree D
(T occurs in D ) pattern tree T matching function f

There is a matching function f from T into D.

A C

1

r

data tree D

B

f is 1-to-1. B f preserves parent-child relation. 3 f preserves (indirect) sibling

2 4

A A
5 6

7

C
11

C B
9

8

A
10

B

relation. f preserves labels.

B

C
58

58

Slide: Frequency of a pattern tree
 A root occurrence of pattern T:
 The node to which the root of T maps by a matching function

 The frequency fr(T) = #root occurrences of T in D

T
B

A
1

r

D
P1
7

C
2 3

A
4

C B
11

B

A

Root occurrence list OccD(T) = {2, 8}

5 6

C
8

A B
9 10

P2

B

C
59

Frequency of a pattern tree
A root occurrence of pattern T:
The node to which the root of T maps by a matching function

The frequency fr(T) = #root occurrences of T in D

T
B

A
1

r

D
P1
7

C
2 3

A
4

C B
11

B

A

Root occurrence list OccD(T) = {2, 8}

5 6

C
8

A B
9 10

P2

B

C
59

59

Slide: Frequent Tree Mining Problem
 Given: a colection S of labeled ordered trees

and a minimum frequency threshold 

 Task: Discover all frequent ordered trees in S (with frequency no less than ) without

duplicates
A minimum frequency threshold (min-sup) s = 50%

60

Frequent Tree Mining Problem
Given: a colection S of labeled ordered trees

and a minimum frequency threshold

Task: Discover all frequent ordered trees in S (with frequency no less than ) without

duplicates
A minimum frequency threshold (min-sup) s = 50%

60

60

Slide: Key: How to enumerate ordered trees without duplicates?
A

 A naive algorithm
 Starting from the smallest tree  Grow a pattern tree by adding a new node one by one
B

A

A C

 Drawbacks
 Exponentially many different ways to generate the same pattern tree  Explicit duplication test needed
B

A C

 How to overcome this difficulty?

61

Key: How to enumerate ordered trees without duplicates?
A

A naive algorithm
Starting from the smallest tree Grow a pattern tree by adding a new node one by one
B

A

A C

Drawbacks
Exponentially many different ways to generate the same pattern tree Explicit duplication test needed
B

A C

How to overcome this difficulty?

61

61

Slide: An idea: DFS Code of Ordered Tree
 Depth-label sequence in the preorder traversal

(depth first search)
S = ((d(v1), l(v1)) ,  , (d(vk), l(vk)) depth 0 1
B
A C B A

B C

DFS code
id 1 2 3 4 5 6 7

2
3

seq 0A 1B 2A 3C 2B 1B 2C
62

An idea: DFS Code of Ordered Tree
Depth-label sequence in the preorder traversal

(depth first search)
S = ((d(v1), l(v1)) , , (d(vk), l(vk)) depth 0 1
B
A C B A

B C

DFS code
id 1 2 3 4 5 6 7

2
3

seq 0A 1B 2A 3C 2B 1B 2C
62

62

Slide: Rightmost expansion
 Extending the DFS Code = Attaching a new node on the rightmost branch (d1,l1),,(dn,ln), (dn+1,ln+1)
pattern
C B 2 C 3 1 A 4 C 5 B 6

S
C
B 2 C 3

C B 2

1 A 4

1

C 3
A 4

C 5 B 6
D
7

D

7

C

1 A 4 C 5 B 6
63

B 2 C 3

C 5 B 6

D

7

Rightmost expansion
Extending the DFS Code = Attaching a new node on the rightmost branch (d1,l1),,(dn,ln), (dn+1,ln+1)
pattern
C B 2 C 3 1 A 4 C 5 B 6

S
C
B 2 C 3

C B 2

1 A 4

1

C 3
A 4

C 5 B 6
D
7

D

7

C

1 A 4 C 5 B 6
63

B 2 C 3

C 5 B 6

D

7

63

Slide: Searching frequent ordered trees
 Enumerate all frequent ordered trees by backtracking  Tree extended only by rightmost extension = No duplication



frequent B

B B B B B

frequent A

B A

B B B B A B B B A B A

A

A
A infrequent

B infrequent B A A

B
64

Searching frequent ordered trees
Enumerate all frequent ordered trees by backtracking Tree extended only by rightmost extension = No duplication

frequent B

B B B B B

frequent A

B A

B B B B A B B B A B A

A

A
A infrequent

B infrequent B A A

B
64

64

Slide: Summary: Ordered tree mining
 Convert tree to a string (DFS Code)
 Adding element to the code =

Rightmost extension
 It was relatively easy because nodes

are ordered
 How about unordered case?

65

Summary: Ordered tree mining
Convert tree to a string (DFS Code)
Adding element to the code =

Rightmost extension
It was relatively easy because nodes

are ordered
How about unordered case?

65

65

Slide: Part 5: Unordered Tree Mining

66

Part 5: Unordered Tree Mining

66

66

Slide: Frequent Unordered Tree Mining
 Unordered trees: Non-trivial subclass of general graphs  Problem: Exponentially many isomorphic trees  Efficient DFS Algorithm
A
B B

A B A C

B

A

B

C

B

C

 Unot [Asai, Arimura, DS'03]  NK [Nijssen, Kok, MGTS03]

C

A B C B

A

B B
C B

B
A

A
C

C

67

Frequent Unordered Tree Mining
Unordered trees: Non-trivial subclass of general graphs Problem: Exponentially many isomorphic trees Efficient DFS Algorithm
A
B B

A B A C

B

A

B

C

B

C

Unot [Asai, Arimura, DS'03] NK [Nijssen, Kok, MGTS03]

C

A B C B

A

B B
C B

B
A

A
C

C

67

67

Slide: Canonical Ordered Representation
 Given ordering among siblings, depth-first search (DFS) code is defined
Code(T) = ((depth(v1), label(v1)) ,  , (depth(vk), label(vk))

T
B
A C

A

Code(T) = (0A,1B,2A,3C,2B,1B,2C)
B

B

C
68

Canonical Ordered Representation
Given ordering among siblings, depth-first search (DFS) code is defined
Code(T) = ((depth(v1), label(v1)) , , (depth(vk), label(vk))

T
B
A C

A

Code(T) = (0A,1B,2A,3C,2B,1B,2C)
B

B

C
68

68

Slide: Canonical representation
Ordered tree T with lexicographically maximum code
T1
B A
C
(0A,1B,2A,3C,2B,1B,2C)

A

T2
B C

A

T3
B B C C

A

T4
B B B C

A

B



B B A
C

B
B

A
C

A
C

(0A,1B,2B,2A,3C,1B,2C)

(0A,1B,2C,1B,2A,3C,2B)

(0A,1B,2C,1B,2B,2A,3C) 69

Canonical representation
Ordered tree T with lexicographically maximum code
T1
B A
C
(0A,1B,2A,3C,2B,1B,2C)

A

T2
B C

A

T3
B B C C

A

T4
B B B C

A

B

B B A
C

B
B

A
C

A
C

(0A,1B,2B,2A,3C,1B,2C)

(0A,1B,2C,1B,2A,3C,2B)

(0A,1B,2C,1B,2B,2A,3C) 69

69

Slide: Left Heavy Condition (Nakano and Uno, 2002)
 T(v): subtree rooted on v
 Ordered tree is canonical if and only if

Code(T(v1)) Code(T(v2))
for any pair of sibling nodes v1 (left) and v2 (right)

70

Left Heavy Condition (Nakano and Uno, 2002)
T(v): subtree rooted on v
Ordered tree is canonical if and only if

Code(T(v1)) Code(T(v2))
for any pair of sibling nodes v1 (left) and v2 (right)

70

70

Slide: Reduction Map
 How to define parent from child in the

enumeration tree
 Generate canonical tree of size k-1

from canonical tree of size k
 Remove the last element of DFS Code
Code(T) = (0A,1B,2A,3C,2B,1B,2C)
71

Reduction Map
How to define parent from child in the

enumeration tree
Generate canonical tree of size k-1

from canonical tree of size k
Remove the last element of DFS Code
Code(T) = (0A,1B,2A,3C,2B,1B,2C)
71

71

Slide: Children Generation
 Generate children

pattern
C B 2 C 3 1

S
A 4

candidates by rightmost extension
 Check the maximality

C 5 B 6

of candidate based on left heavy property
 Discard if not

maximal
72

Children Generation
Generate children

pattern
C B 2 C 3 1

S
A 4

candidates by rightmost extension
Check the maximality

C 5 B 6

of candidate based on left heavy property
Discard if not

maximal
72

72

Slide: Maximality Check by Left Heavy Property
 Code of left subtree must be larger than that of right subtree
 Check only rightmost sibling and second rightmost sibling

73

Maximality Check by Left Heavy Property
Code of left subtree must be larger than that of right subtree
Check only rightmost sibling and second rightmost sibling

73

73

Slide: Complexity of UNOT
 Delay per pattern O(kb2 mn)
 k: pattern size

 b: branching factor of the data tree
 m: size of data tree

 n: database size

74

Complexity of UNOT
Delay per pattern O(kb2 mn)
k: pattern size

b: branching factor of the data tree
m: size of data tree

n: database size

74

74

Slide: Summary: Mining Unordered Tree
 The following three elements are

necessary for a mining algorithm
 Canonical Representation  Reduction Map  Children Generation including Maximality Check

 Backtracking on the resulting

enumeration tree
75

Summary: Mining Unordered Tree
The following three elements are

necessary for a mining algorithm
Canonical Representation Reduction Map Children Generation including Maximality Check

Backtracking on the resulting

enumeration tree
75

75

Slide: Part 6: Graph Mining

76

Part 6: Graph Mining

76

76

Slide: Frequent Subgraph Mining

Graph Database

 Enumerate all subgraphs occurring more than 3 times

Patterns
77

Frequent Subgraph Mining

Graph Database

Enumerate all subgraphs occurring more than 3 times

Patterns
77

77

Slide: Gspan (Yan and Han, 2002)
 Most widely used graph mining algorithm  Can be interpreted with reverse search principle
 Canonical representation?  Reduction map?  Children generation?
78

Gspan (Yan and Han, 2002)
Most widely used graph mining algorithm Can be interpreted with reverse search principle
Canonical representation? Reduction map? Children generation?
78

78

Slide: DFS Code for Graph
 Depth first search and preorder node labeling  (src, dest, src_label, edge_label, dest_label)  Some edges not traversed
 backward edge (dest < src)

 Elements sorted in the code
0 A a a B b A 3 A a 2 {(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

1
79

DFS Code for Graph
Depth first search and preorder node labeling (src, dest, src_label, edge_label, dest_label) Some edges not traversed
backward edge (dest < src)

Elements sorted in the code
0 A a a B b A 3 A a 2 {(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

1
79

79

Slide: Canonical Representation
 Multiple DFS codes: different starting point and children ordering  Minimum DFS Code: Lexicographically Minimum

80

Canonical Representation
Multiple DFS codes: different starting point and children ordering Minimum DFS Code: Lexicographically Minimum

80

80

Slide: Reduction Map
 Removing the tail of minimum DFS code preserves minimality

min DFS: {e1, e2, e3, e4}

{e1, e2, e3}
81

Reduction Map
Removing the tail of minimum DFS code preserves minimality

min DFS: {e1, e2, e3, e4}

{e1, e2, e3}
81

81

Slide: Children Generation
 Create candidates by adding an element to DFS code  Check if each candidate is minimum  If not, remove it
Children candidates

Parent

real child

real child

82

Children Generation
Create candidates by adding an element to DFS code Check if each candidate is minimum If not, remove it
Children candidates

Parent

real child

real child

82

82

Slide: Minimality Check
 Reconstruct the graph from DFS Code  Derive the minimum DFS Code by trying all DFSs
 Speed up by traversing minimal label only
{(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

 If the minimal code is not identical to the original, prune it

0

A a a B b A 3 A a 2

1
83

Minimality Check
Reconstruct the graph from DFS Code Derive the minimum DFS Code by trying all DFSs
Speed up by traversing minimal label only
{(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

If the minimal code is not identical to the original, prune it

0

A a a B b A 3 A a 2

1
83

83

Slide: Summary: Graph Mining
 gSpan is a typical example of reverse search  Not explained: Closed tree mining, Closed Graph mining  Delay exponential to pattern size
 It cannot be avoided due to NP-hardness of graph isomorphism  Yet it scales to millions for sparse molecular graphs

 Applications covered in next chapter
84

Summary: Graph Mining
gSpan is a typical example of reverse search Not explained: Closed tree mining, Closed Graph mining Delay exponential to pattern size
It cannot be avoided due to NP-hardness of graph isomorphism Yet it scales to millions for sparse molecular graphs

Applications covered in next chapter
84

84

Slide: Part7: Dense Module Enumeration

85

Part7: Dense Module Enumeration

85

85

Slide: Biological Motivation
 Most cellular processes performed by multi-component protein complexes  Increasing amount of experimental protein interaction data available  Our approach  Predict complexes (modules) from protein interaction network  Exploit additional information given by gene expression data, evolutionary conservation, phenotypic profiles etc.

29/08/2012

86

Biological Motivation
Most cellular processes performed by multi-component protein complexes Increasing amount of experimental protein interaction data available Our approach Predict complexes (modules) from protein interaction network Exploit additional information given by gene expression data, evolutionary conservation, phenotypic profiles etc.

29/08/2012

86

86

Slide: 29/08/2012

87

29/08/2012

87

87

Slide: Protein interaction networks
 Node: Proteins  Edge: Physical interaction of two proteins
 Challenge 1: False negative edges  Go beyond clique search!  Challenge 2: False positive edges  Assign confidence scores to edges  Find node sets with high density of high confidence edges

29/08/2012

88

Protein interaction networks
Node: Proteins Edge: Physical interaction of two proteins
Challenge 1: False negative edges Go beyond clique search! Challenge 2: False positive edges Assign confidence scores to edges Find node sets with high density of high confidence edges

29/08/2012

88

88

Slide: Module Discovery
 Previous work  Clique percolation [Palla et al., 2005]  Partitioning
 Hierarchical clustering [Girvan and Newman, 2001]  Flow Simulation [Krogan et al., 2006]  Spectral methods [Newman, 2006]

 Heuristic Local Search [Bader and Hogue, 2003]

 Our approach  Exhaustively enumerate all dense subgraphs efficiently

29/08/2012

89

Module Discovery
Previous work Clique percolation [Palla et al., 2005] Partitioning
Hierarchical clustering [Girvan and Newman, 2001] Flow Simulation [Krogan et al., 2006] Spectral methods [Newman, 2006]

Heuristic Local Search [Bader and Hogue, 2003]

Our approach Exhaustively enumerate all dense subgraphs efficiently

29/08/2012

89

89

Slide: Motivation for Enumeration Approach
 Detects overlapping modules  Allows to specify minimum density for outcoming modules  Outputs all modules satisfying the density threshold

29/08/2012

90

Motivation for Enumeration Approach
Detects overlapping modules Allows to specify minimum density for outcoming modules Outputs all modules satisfying the density threshold

29/08/2012

90

90

Slide: Differential Expression Criterion
 Incorporation of gene expression  Presence of proteins depends on cell type  Additional Criterion for modules  : Num of conditions where whole module expressed  : Num of conditions where whole module not expressed  Fix minimum values for both quantities

29/08/2012

91

Differential Expression Criterion
Incorporation of gene expression Presence of proteins depends on cell type Additional Criterion for modules : Num of conditions where whole module expressed : Num of conditions where whole module not expressed Fix minimum values for both quantities

29/08/2012

91

91

Slide: Problem Formalization

29/08/2012

92

Problem Formalization

29/08/2012

92

92

Slide: Typical Enumeration Algorithms
    Itemset mining, graph mining etc. Enumerate all entities whose frequency >= 10 Set up a search tree Tree Pruning by anti-monotonicity  An entitys frequency is always smaller than that of sub-entity

Not generated

29/08/2012

93

Typical Enumeration Algorithms
Itemset mining, graph mining etc. Enumerate all entities whose frequency >= 10 Set up a search tree Tree Pruning by anti-monotonicity An entitys frequency is always smaller than that of sub-entity

Not generated

29/08/2012

93

93

Slide: Network Example

29/08/2012

94

Network Example

29/08/2012

94

94

Slide: Graph-shaped Search Space of Modules

29/08/2012

95

Graph-shaped Search Space of Modules

29/08/2012

95

95

Slide: Choosing a search tree
 For efficient search, a search tree is needed  There are many possible search trees  Default: Lexicographical ordering

29/08/2012

96

Choosing a search tree
For efficient search, a search tree is needed There are many possible search trees Default: Lexicographical ordering

29/08/2012

96

96

Slide: Density is not a monotonic criterion
 Subset of dense set is not necessarily dense  Density does not decrease monotonically on a path  Pruning Impossible

Density

1.0 1

0.1 1,2

0.5 1,2,3

29/08/2012

97

Density is not a monotonic criterion
Subset of dense set is not necessarily dense Density does not decrease monotonically on a path Pruning Impossible

Density

1.0 1

0.1 1,2

0.5 1,2,3

29/08/2012

97

97

Slide: Question
 Is it possible to make a search tree such that density decreases monotonically?

29/08/2012

98

Question
Is it possible to make a search tree such that density decreases monotonically?

29/08/2012

98

98

Slide: Question
 Is it possible to make a search tree such that density decreases monotonically?  YES!  Use Reverse Search (Avis and Fukuda 1993)

29/08/2012

99

Question
Is it possible to make a search tree such that density decreases monotonically? YES! Use Reverse Search (Avis and Fukuda 1993)

29/08/2012

99

99

Slide: Reverse search (Avis and Fukuda, 1993)
 Specify a search tree in the graph-shaped search space  Reduction Mapping  Rule to generate a parent from a child  Remove the node with the smallest degree  Density always increase by the removal

CHILD

PARENT
29/08/2012 100

Reverse search (Avis and Fukuda, 1993)
Specify a search tree in the graph-shaped search space Reduction Mapping Rule to generate a parent from a child Remove the node with the smallest degree Density always increase by the removal

CHILD

PARENT
29/08/2012 100

100

Slide: Search Tree is uniquely specified by the reduction mapping
 Condition: Every node should converge to the root node by applying the reduction mapping repeatedly

29/08/2012

101

Search Tree is uniquely specified by the reduction mapping
Condition: Every node should converge to the root node by applying the reduction mapping repeatedly

29/08/2012

101

101

Slide: Enumeration algorithm by reverse search
 A set of children is generated from a parent node  Try every possible children, and choose the ones satisfying the reduction mapping  Prune if no children exist

29/08/2012

102

Enumeration algorithm by reverse search
A set of children is generated from a parent node Try every possible children, and choose the ones satisfying the reduction mapping Prune if no children exist

29/08/2012

102

102

Slide: Constraint Integration
 Differential expression constraint
 Monotonicity: e_0 and e_1 decrease with extension of U

 Can be used for extra pruning without difficulty

29/08/2012

103

Constraint Integration
Differential expression constraint
Monotonicity: e_0 and e_1 decrease with extension of U

Can be used for extra pruning without difficulty

29/08/2012

103

103

Slide: Statistical Significance of a module
 k : The number of nodes in the module  : Density of the module  : The number of modules of size k with density at least  Probability of random selection making a denser module (p-value)

29/08/2012

104

Statistical Significance of a module
k : The number of nodes in the module : Density of the module : The number of modules of size k with density at least Probability of random selection making a denser module (p-value)

29/08/2012

104

104

Slide: Benchmarking in yeast complex discovery
Combined interactions from CYGD-Mpact and DIP Interactions among 3559 nodes Confidence weights on edges due to (Jansen, 2003) Methods in comparison  Clique detection (Clique)  Clique Parcolation Method (CPM)  Markov Clustering  Modules compared with MIPS complexes    

29/08/2012

105

Benchmarking in yeast complex discovery
Combined interactions from CYGD-Mpact and DIP Interactions among 3559 nodes Confidence weights on edges due to (Jansen, 2003) Methods in comparison Clique detection (Clique) Clique Parcolation Method (CPM) Markov Clustering Modules compared with MIPS complexes

29/08/2012

105

105

Slide: 29/08/2012

106

29/08/2012

106

106

Slide: Evolutionary Conserved Yeast Modules
Use ortholog profiles (10 species, InParanoid) Density >= 50%, at least three orthologs 1917 modules in 30 minutes Recovered evolutionary conserved complexes  20S proteasome  19S/22S regulator  COPI vesicle coat complex  DNA polymerase I and II subunits  Translation initiation factor eIF2B complex  They could not be recovered by simple DME due to low density    
29/08/2012 107

Evolutionary Conserved Yeast Modules
Use ortholog profiles (10 species, InParanoid) Density >= 50%, at least three orthologs 1917 modules in 30 minutes Recovered evolutionary conserved complexes 20S proteasome 19S/22S regulator COPI vesicle coat complex DNA polymerase I and II subunits Translation initiation factor eIF2B complex They could not be recovered by simple DME due to low density
29/08/2012 107

107

Slide: MIPS Complexes discovered by DME (Conserved in Evolution)

29/08/2012

108

MIPS Complexes discovered by DME (Conserved in Evolution)

29/08/2012

108

108

Slide: Phenotype-associated yeast modules
 Use growth phenotypic profiles (21 conditions, Dudley et al, 2005)  Growth defect in at least one condition  Each of the 13 highest ranking modules covers the large subunit of mitochondrial ribosome
 Found additional protein, Mhr1

 Exactly recovered the nucleoplasmic THO complex (Hpr1, Mft1, Rlr1, Thp2)  Transcription elongation, hyperrecombination  Growth defect under ethanol

29/08/2012

109

Phenotype-associated yeast modules
Use growth phenotypic profiles (21 conditions, Dudley et al, 2005) Growth defect in at least one condition Each of the 13 highest ranking modules covers the large subunit of mitochondrial ribosome
Found additional protein, Mhr1

Exactly recovered the nucleoplasmic THO complex (Hpr1, Mft1, Rlr1, Thp2) Transcription elongation, hyperrecombination Growth defect under ethanol

29/08/2012

109

109

Slide: Phenotype Associated Modules Large subunit of mitochondrial ribosome

Mhr1: involved in homologous recombination of the mitochondrial genome
29/08/2012 110

Phenotype Associated Modules Large subunit of mitochondrial ribosome

Mhr1: involved in homologous recombination of the mitochondrial genome
29/08/2012 110

110

Slide: Human Settings
 Tissue-specific gene expression data (Su et al., 2004)  79 different tissues  Consistently expressed in 3 tissues, not in 10 tissues  7763 proteins, density >= 35%, 5 minutes  1021 maximal modules  MIPS human complex database (Ruepp et al., to appear)

29/08/2012

111

Human Settings
Tissue-specific gene expression data (Su et al., 2004) 79 different tissues Consistently expressed in 3 tissues, not in 10 tissues 7763 proteins, density >= 35%, 5 minutes 1021 maximal modules MIPS human complex database (Ruepp et al., to appear)

29/08/2012

111

111

Slide: Human-expression result
 Around MCM complex, we found inter-complex relationships with ORC, CDC7, Toposome, PLK1 protein  Module Uqcrc1, Uqcrc2, Uqcrb, Cyc1 (lg p = -13)  No overlap with MIPS  Ubiquinol-cytochrome c reductase complex  SCF E3 ubiquitin ligase complex: Mark protein for degradation  5 different modules with different tissue specificity  Peripheral proteins: Substrate recognition particles  Target proteins are selected in a tissue specific manner!  Natural Killer cells have all particles

29/08/2012

112

Human-expression result
Around MCM complex, we found inter-complex relationships with ORC, CDC7, Toposome, PLK1 protein Module Uqcrc1, Uqcrc2, Uqcrb, Cyc1 (lg p = -13) No overlap with MIPS Ubiquinol-cytochrome c reductase complex SCF E3 ubiquitin ligase complex: Mark protein for degradation 5 different modules with different tissue specificity Peripheral proteins: Substrate recognition particles Target proteins are selected in a tissue specific manner! Natural Killer cells have all particles

29/08/2012

112

112

Slide: High ranking modules around the MCM complex

Expressed in bone mallow cells Not expressed in brain, liver, kidney etc.
29/08/2012 113

High ranking modules around the MCM complex

Expressed in bone mallow cells Not expressed in brain, liver, kidney etc.
29/08/2012 113

113

Slide: Tissue Specific organization of the SCF ligase complex

29/08/2012

114

Tissue Specific organization of the SCF ligase complex

29/08/2012

114

114

Slide: Summary: Dense Module Enumeration
 Novel module enumeration algorithm based on reverse search  Combination with other information sources  Statistical significance of dense modules  Successfully applied to  yeast/human protein interaction networks

29/08/2012

115

Summary: Dense Module Enumeration
Novel module enumeration algorithm based on reverse search Combination with other information sources Statistical significance of dense modules Successfully applied to yeast/human protein interaction networks

29/08/2012

115

115

Slide: Reference

 




   



[1] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc. 20th Int. Conf. Very Large Data Bases, 1994. [2] T. Asai, H. Arimura, T. Uno, and S. Nakano, Discovering frequent substructures in large unordered trees, Discovery science, 2003. [3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, Efficient Substructure Discovery from Large Semi-structured Data., in SDM, 2002. [4] D. Avis and K. Fukuda, Reverse search for enumeration, Discrete Applied Mathematics, vol. 65, no. 13, pp. 2146, 1996. [5] E. Georgii, S. Dietmann, T. Uno, P. Pagel, and K. Tsuda, Enumeration of conditiondependent dense modules in protein interaction networks., Bioinformatics (Oxford, England), vol. 25, no. 7, pp. 93340, Apr. 2009. [6] S. Nijssen and J. Kok, Efficient discovery of frequent unordered trees, MGTS, 2003. [7] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering frequent closed itemsets for association rules, Database TheoryICDT99, 1999. [8] I. Takigawa and H. Mamitsuka, Efficiently mining -tolerance closed frequent subgraphs, Machine Learning, vol. 82, no. 2, pp. 95121, Sep. 2010. [9] X. Yan and J. Han, gSpan: graph-based substructure pattern mining, in 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 721724. [10] M. Zaki, Efficiently mining frequent trees in a forest: Algorithms and applications, IEEE Transaction on Knowledge and Data Engineering, 2005.
29/08/2012 116

Reference

[1] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc. 20th Int. Conf. Very Large Data Bases, 1994. [2] T. Asai, H. Arimura, T. Uno, and S. Nakano, Discovering frequent substructures in large unordered trees, Discovery science, 2003. [3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, Efficient Substructure Discovery from Large Semi-structured Data., in SDM, 2002. [4] D. Avis and K. Fukuda, Reverse search for enumeration, Discrete Applied Mathematics, vol. 65, no. 13, pp. 2146, 1996. [5] E. Georgii, S. Dietmann, T. Uno, P. Pagel, and K. Tsuda, Enumeration of conditiondependent dense modules in protein interaction networks., Bioinformatics (Oxford, England), vol. 25, no. 7, pp. 93340, Apr. 2009. [6] S. Nijssen and J. Kok, Efficient discovery of frequent unordered trees, MGTS, 2003. [7] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering frequent closed itemsets for association rules, Database TheoryICDT99, 1999. [8] I. Takigawa and H. Mamitsuka, Efficiently mining -tolerance closed frequent subgraphs, Machine Learning, vol. 82, no. 2, pp. 95121, Sep. 2010. [9] X. Yan and J. Han, gSpan: graph-based substructure pattern mining, in 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 721724. [10] M. Zaki, Efficiently mining frequent trees in a forest: Algorithms and applications, IEEE Transaction on Knowledge and Data Engineering, 2005.
29/08/2012 116

116

Slide: Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 2: Learning from Structured Data
National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 2: Learning from Structured Data
National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

117

Slide: Agenda 2 (Learning from Structured Data)
1. 2. 3. 4. 5. Preliminaries Graph Clustering by EM Graph Boosting Regularization Paths in Graph Classification Itemset Boosting for predicting HIV drug resistance

2

Agenda 2 (Learning from Structured Data)
1. 2. 3. 4. 5. Preliminaries Graph Clustering by EM Graph Boosting Regularization Paths in Graph Classification Itemset Boosting for predicting HIV drug resistance

2

118

Slide: Part 1: Preliminaries

3

Part 1: Preliminaries

3

119

Slide: Clustering Graphs

4

Clustering Graphs

4

120

Slide: Graph Regression

Training ( (
( ,-0.2) , 0.7) ,-0.5) (

Test

, ?)

5

Graph Regression

Training ( (
( ,-0.2) , 0.7) ,-0.5) (

Test

, ?)

5

121

Slide: Substructure Representation
0/1 vector of pattern indicators Huge dimensionality! Need Graph Mining for selecting features

patterns
6

Substructure Representation
0/1 vector of pattern indicators Huge dimensionality! Need Graph Mining for selecting features

patterns
6

122

Slide: Graph Mining
Frequent Substructure Mining


Enumerate all patterns occurred in at least m graphs



:Indicator of pattern k in graph i
Support(k): # of occurrence of pattern k
7

Graph Mining
Frequent Substructure Mining

Enumerate all patterns occurred in at least m graphs

:Indicator of pattern k in graph i
Support(k): # of occurrence of pattern k
7

123

Slide: Enumeration on Tree-shaped Search Space
Each node has a pattern Generate nodes from the root:


Add an edge at each step

8

Enumeration on Tree-shaped Search Space
Each node has a pattern Generate nodes from the root:

Add an edge at each step

8

124

Slide: Support(g): # of occurrence of pattern g

Tree Pruning
Anti-monotonicity:

If support(g) < m, stop exploring!

Not generated

9

Support(g): # of occurrence of pattern g

Tree Pruning
Anti-monotonicity:

If support(g) < m, stop exploring!

Not generated

9

125

Slide: Gspan

(Yan and Han, 2002)

Efficient Frequent Substructure Mining Method DFS Code


Efficient detection of isomorphic patterns

10

Gspan

(Yan and Han, 2002)

Efficient Frequent Substructure Mining Method DFS Code

Efficient detection of isomorphic patterns

10

126

Slide: Depth First Search (DFS) Code
A labeled graph G A a B 1 c b b A 2 [1,2,B,c,A] [0,3,A,b,C] [0,1,A,a,B] [0,2,A,b,A]
G0

0

C

3

DFS Code Tree on G
[2,0,A,b,A] [0,3,A,b,C]
Isomorphic

[0,1,A,a,B]
G1

[2,0,A,b,A] [0,3,A,b,C]

[0,3,A,b,C]

Non-minimum DFS code. Prune it. 11

Depth First Search (DFS) Code
A labeled graph G A a B 1 c b b A 2 [1,2,B,c,A] [0,3,A,b,C] [0,1,A,a,B] [0,2,A,b,A]
G0

0

C

3

DFS Code Tree on G
[2,0,A,b,A] [0,3,A,b,C]
Isomorphic

[0,1,A,a,B]
G1

[2,0,A,b,A] [0,3,A,b,C]

[0,3,A,b,C]

Non-minimum DFS code. Prune it. 11

127

Slide: Discriminative patterns
w_i > 0: positive class w_i < 0: negative class Weighted Substructure Mining

Patterns with large frequency difference Not Anti-Monotonic: Use a bound
12

Discriminative patterns
w_i > 0: positive class w_i < 0: negative class Weighted Substructure Mining

Patterns with large frequency difference Not Anti-Monotonic: Use a bound
12

128

Slide: Multiclass version
Multiple weight vectors
 

(graph belongs to class ) (otherwise)

Search patterns overrepresented in a class

13

Multiclass version
Multiple weight vectors

(graph belongs to class ) (otherwise)

Search patterns overrepresented in a class

13

129

Slide: Basic Bound
 xij : Occurrence of pattern j  If k is supergraph of pattern j,

14

Basic Bound
xij : Occurrence of pattern j If k is supergraph of pattern j,

14

130

Slide: Pruning Condition
Summarizing the bound for all classes,

If it is negative, the search tree can be pruned safely

15

Pruning Condition
Summarizing the bound for all classes,

If it is negative, the search tree can be pruned safely

15

131

Slide: Summary: Preliminaries
Various graph learning problems


Supervised/Unsupervised

Discovery of salient features by graph mining Actual speed depends on the data


Faster for..
 Sparse graphs  Diverse kinds of labels
16

Summary: Preliminaries
Various graph learning problems

Supervised/Unsupervised

Discovery of salient features by graph mining Actual speed depends on the data

Faster for..
Sparse graphs Diverse kinds of labels
16

132

Slide: Part 2: EM-based clustering of graphs

17

Part 2: EM-based clustering of graphs

17

133

Slide: EM-based graph clustering
Motivation




Learning a mixture model in the feature space of patterns Basis for more complex probabilistic inference

L1 regularization & Graph Mining E-step -> Mining -> M-step
18

EM-based graph clustering
Motivation

Learning a mixture model in the feature space of patterns Basis for more complex probabilistic inference

L1 regularization & Graph Mining E-step -> Mining -> M-step
18

134

Slide: Probabilistic Model
Binomial Mixture

Each Component

:Feature vector of a graph (0 or 1) :Mixing weight for cluster :Parameter vector for cluster
19

Probabilistic Model
Binomial Mixture

Each Component

:Feature vector of a graph (0 or 1) :Mixing weight for cluster :Parameter vector for cluster
19

135

Slide: Ordinary EM algorithm
Maximizing the log likelihood

E-step: Get posterior M-step: Estimate using posterior probs. Both are computationally prohibitive (!)

20

Ordinary EM algorithm
Maximizing the log likelihood

E-step: Get posterior M-step: Estimate using posterior probs. Both are computationally prohibitive (!)

20

136

Slide: Regularization
L1-Regularized log likelihood

Baseline constant


ML parameter estimate using single binomial distribution

In solution, most parameters exactly equal to constants
21

Regularization
L1-Regularized log likelihood

Baseline constant

ML parameter estimate using single binomial distribution

In solution, most parameters exactly equal to constants
21

137

Slide: E-step
Active pattern

E-step computed only with active patterns (computable!)

22

E-step
Active pattern

E-step computed only with active patterns (computable!)

22

138

Slide: M-step
Putative cluster assignment Each parameter is solved separately

Nave way:


solve it for all params and identify active patterns

Use graph mining to find active patterns

23

M-step
Putative cluster assignment Each parameter is solved separately

Nave way:

solve it for all params and identify active patterns

Use graph mining to find active patterns

23

139

Slide: Solution

Occurrence probability in a cluster

Overall occurrence probability
24

Solution

Occurrence probability in a cluster

Overall occurrence probability
24

140

Slide: Solution

25

Solution

25

141

Slide: Important Observation
For active pattern k, the occurrence probability in a graph cluster is significantly different from the average

26

Important Observation
For active pattern k, the occurrence probability in a graph cluster is significantly different from the average

26

142

Slide: Mining for Active Patterns
Active pattern
Equivalently written as

F can be found by graph mining! (multiclass)
27

Mining for Active Patterns
Active pattern
Equivalently written as

F can be found by graph mining! (multiclass)
27

143

Slide: Experiments: RNA graphs
Stem as a node Secondary structure by RNAfold 0/1 Vertex label (self loop or not)

28

Experiments: RNA graphs
Stem as a node Secondary structure by RNAfold 0/1 Vertex label (self loop or not)

28

144

Slide: Clustering RNA graphs
Three Rfam families
  

Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs)

Three bipartition problems


Results evaluated by ROC scores (Area under the ROC curve)

29

Clustering RNA graphs
Three Rfam families

Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs)

Three bipartition problems

Results evaluated by ROC scores (Area under the ROC curve)

29

145

Slide: Examples of RNA Graphs

30

Examples of RNA Graphs

30

146

Slide: ROC Scores

31

ROC Scores

31

147

Slide: No of Patterns & Time

32

No of Patterns & Time

32

148

Slide: Found Patterns

33

Found Patterns

33

149

Slide: Summary (graph EM)
Substructure representation is better than paths Probabilistic inference helped by graph mining Extension to Dirichlet mixture model


Reported in Tsuda et al., SDM 2008 Graph PCA, LFD, CCA Semi-supervised learning
34

Possible extension

Summary (graph EM)
Substructure representation is better than paths Probabilistic inference helped by graph mining Extension to Dirichlet mixture model

Reported in Tsuda et al., SDM 2008 Graph PCA, LFD, CCA Semi-supervised learning
34

Possible extension

150

Slide: Part 3: Graph Boosting

35

Part 3: Graph Boosting

35

151

Slide: Graph classification problem in chemoinformatics
Known as SAR problem in chemical informatics


(Quantitative) Structure-Activity Analysis

Given a graph, predict a class-label (+1 or -1)




Typically, features (descriptors) are given e.g., Dragon Descriptors, JOELIB2

36

Graph classification problem in chemoinformatics
Known as SAR problem in chemical informatics

(Quantitative) Structure-Activity Analysis

Given a graph, predict a class-label (+1 or -1)

Typically, features (descriptors) are given e.g., Dragon Descriptors, JOELIB2

36

152

Slide: SAR with conventional descriptors

#atoms #bonds #rings 22 25



Activity +1

20 23
11 21

21 24
11 22

+1 +1
-1 -1
37

SAR with conventional descriptors

#atoms #bonds #rings 22 25

Activity +1

20 23
11 21

21 24
11 22

+1 +1
-1 -1
37

153

Slide: Motivation of Graph Boosting
Descriptors are not always available New features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpan Linear Programming (LP) Boosting
 

Reduce the number of graph mining calls Faster than AdaBoost

Accurate prediction & interpretable results

38

Motivation of Graph Boosting
Descriptors are not always available New features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpan Linear Programming (LP) Boosting

Reduce the number of graph mining calls Faster than AdaBoost

Accurate prediction & interpretable results

38

154

Slide: Molecule as a labeled graph
C C C C C C C O C C C

39

Molecule as a labeled graph
C C C C C C C O C C C

39

155

Slide: SAR with patterns
C C C C C O C C C C C C C C C C



Activity

C

C

Cl

1 -1 -1 -1 1

1 1 1 1 1
C Cl

1 -1 -1 -1 -1
C C C C O C C

+1 +1 +1 -1 -1
 3
C C

f  1

 2

C
C C

C
C C

C
C

 ...
40

SAR with patterns
C C C C C O C C C C C C C C C C

Activity

C

C

Cl

1 -1 -1 -1 1

1 1 1 1 1
C Cl

1 -1 -1 -1 -1
C C C C O C C

+1 +1 +1 -1 -1
3
C C

f 1

2

C
C C

C
C C

C
C

...
40

156

Slide: Sparse classification in a very high dimensional space
G: all possible patterns (intractably large) |G|-dimensional feature vector x for a molecule Linear Classifier
f (x)    j x j
j 1 d

Use L1 regularizer to have sparse  Select a tractable number of patterns
41

Sparse classification in a very high dimensional space
G: all possible patterns (intractably large) |G|-dimensional feature vector x for a molecule Linear Classifier
f (x) j x j
j 1 d

Use L1 regularizer to have sparse Select a tractable number of patterns
41

157

Slide: Problem formulation

Sum of hinge loss and L1 regularizer : Training examples : slack variables

42

Problem formulation

Sum of hinge loss and L1 regularizer : Training examples : slack variables

42

158

Slide: Dual LP
Primal: Huge number of weight variables Dual: Huge number of constraints
Dual problem

43

Dual LP
Primal: Huge number of weight variables Dual: Huge number of constraints
Dual problem

43

159

Slide: Column Generation Algorithm for LP Boost (Demiriz et al., 2002)
Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge
Constraint Matrix
Used Part
44

Column Generation Algorithm for LP Boost (Demiriz et al., 2002)
Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge
Constraint Matrix
Used Part
44

160

Slide: Finding the most violated constraint
Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted substructure mining

45

Finding the most violated constraint
Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted substructure mining

45

161

Slide: Algorithm Overview
Iteration

 



Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by solving the dual problem
Convert dual solution to obtain primal solution 

Return


46

Algorithm Overview
Iteration

Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by solving the dual problem
Convert dual solution to obtain primal solution

Return

46

162

Slide: Experimental Settings
Classification and Regression Datasets

47

Experimental Settings
Classification and Regression Datasets

47

163

Slide: Classification Results

48

Classification Results

48

164

Slide: Regression Results

49

Regression Results

49

165

Slide: Extracted patterns from CPDB

50

Extracted patterns from CPDB

50

166

Slide: Memory Usage

51

Memory Usage

51

167

Slide: Runtime

52

Runtime

52

168

Slide: Comparison with AdaBoost

53

Comparison with AdaBoost

53

169

Slide: Summary (Graph Boosting)
Graph Boosting simultaneously generate patterns and learn their weights Finite convergence by column generation Interpretable by chemists. Flexible constraints and speed-up by LP.

54

Summary (Graph Boosting)
Graph Boosting simultaneously generate patterns and learn their weights Finite convergence by column generation Interpretable by chemists. Flexible constraints and speed-up by LP.

54

170

Slide: Part 4: Entire Regularization Path

55

Part 4: Entire Regularization Path

55

171

Slide: Overview
Entire regularization paths





LARS-LASSO (Efron et al., 2004), L1SVM Forward selection of features Trace the solution trajectory of L1-regularized learning Feature search -> pattern search Branch-and-bound algorithm DFS code tree, New Bound
56

Path following algorithm for graph data

Overview
Entire regularization paths

LARS-LASSO (Efron et al., 2004), L1SVM Forward selection of features Trace the solution trajectory of L1-regularized learning Feature search -> pattern search Branch-and-bound algorithm DFS code tree, New Bound
56

Path following algorithm for graph data

172

Slide: Path Following Algorithms
LASSO regression

Follow the complete trajectory of


: Infinity to Zero

Active feature set


Features corresponding to nonzero weights
57

Path Following Algorithms
LASSO regression

Follow the complete trajectory of

: Infinity to Zero

Active feature set

Features corresponding to nonzero weights
57

173

Slide: Piecewise Linear Path
At a turning point,




A new feature included into the active set, or An existing feature excluded from the active set

58

Piecewise Linear Path
At a turning point,

A new feature included into the active set, or An existing feature excluded from the active set

58

174

Slide: Practical Merit of Path Following
Cross validation by grid search
 

Has to solve QP many times Especially time-consuming for graph data

Path following does not include QP Determine the CV-optimal regularization parameter in the finest precision
59

Practical Merit of Path Following
Cross validation by grid search

Has to solve QP many times Especially time-consuming for graph data

Path following does not include QP Determine the CV-optimal regularization parameter in the finest precision
59

175

Slide: Pseudo code of path following
Set initial point Do
    

and direction
Main Search Problem

d1 = Step size if next event is inclusion d2 = Step size if next event is exclusion d = min(d1,d2) Update the active feature set Set the next direction



Until all features are included
60

Pseudo code of path following
Set initial point Do

and direction
Main Search Problem

d1 = Step size if next event is inclusion d2 = Step size if next event is exclusion d = min(d1,d2) Update the active feature set Set the next direction

Until all features are included
60

176

Slide: Feature space of patterns
Graph training data Set of all subgraphs (patterns) Each graph is represented as

61

Feature space of patterns
Graph training data Set of all subgraphs (patterns) Each graph is represented as

61

177

Slide: Main Search problem
Step size if pattern t is included next

: constants computed from active set

Find pattern

that minimizes

62

Main Search problem
Step size if pattern t is included next

: constants computed from active set

Find pattern

that minimizes

62

178

Slide: Tree-shaped Search Space
Each node has a pattern Generate nodes from the root:


Add an edge at each step

63

Tree-shaped Search Space
Each node has a pattern Generate nodes from the root:

Add an edge at each step

63

179

Slide: Tree Pruning
If it is guaranteed that the optimal pattern is not in the downstream, the search tree can be pruned

Not generated

64

Tree Pruning
If it is guaranteed that the optimal pattern is not in the downstream, the search tree can be pruned

Not generated

64

180

Slide: Theorem (Pruning condition)
Traversed up to pattern t :Minimum value so far No better pattern in the downstream, if

where

65

Theorem (Pruning condition)
Traversed up to pattern t :Minimum value so far No better pattern in the downstream, if

where

65

181

Slide: Initial Experiments
EDKB Estrogen receptor database


131 training graphs (chemical compounds)

Computation Time: 4 sec/search


Pattern size limitation: 10 edges

66

Initial Experiments
EDKB Estrogen receptor database

131 training graphs (chemical compounds)

Computation Time: 4 sec/search

Pattern size limitation: 10 edges

66

182

Slide: Solution Path

67

Solution Path

67

183

Slide: Events

68

Events

68

184

Slide: Summary: Regularization Path
Path following implemented for graph data Pattern search by graph mining Classification: To do Combination with item set mining

69

Summary: Regularization Path
Path following implemented for graph data Pattern search by graph mining Classification: To do Combination with item set mining

69

185

Slide: Part 5: . Itemset Boosting for predicting HIV drug resistance

70

Part 5: . Itemset Boosting for predicting HIV drug resistance

70

186

Slide: Life cycle of an HIV Virus

Drug Target

71

Life cycle of an HIV Virus

Drug Target

71

187

Slide: Approved HIV Drugs
 8 Protease inhibitors (PI)  8 Nucleoside/nucleotide reverse transcriptase inhibitors (NRTI)  3 Non-nucleoside reverse transcriptase inhibitors (NNRTI)  1 Fusion inhibitor

72

Approved HIV Drugs
8 Protease inhibitors (PI) 8 Nucleoside/nucleotide reverse transcriptase inhibitors (NRTI) 3 Non-nucleoside reverse transcriptase inhibitors (NNRTI) 1 Fusion inhibitor

72

188

Slide: Drug resistance of HIV
 Exposure to a drug causes mutations in HIVs genes  As a result, HIV gains resistance against the drug
 Cost of identifying the genotypes of HIV in a patient is relatively cheap  Predict the drug resistance from HIVs genotypes !
 Effective Pharmacotherapy for individuals
73

Drug resistance of HIV
Exposure to a drug causes mutations in HIVs genes As a result, HIV gains resistance against the drug
Cost of identifying the genotypes of HIV in a patient is relatively cheap Predict the drug resistance from HIVs genotypes !
Effective Pharmacotherapy for individuals
73

189

Slide: Drug resistance prediction problem as regression
 Input: Set of mutations in a target protein
 41L: Amino acid at position 41 changed to L

 Output: Resistance against a drug (fold change) (40F,41L,43E,210W,211K,215Y) (43Q,75M,122E,210W,211K) (75I, 77L, 116Y,151M,184V) 0.8 12.8 ?
74

Drug resistance prediction problem as regression
Input: Set of mutations in a target protein
41L: Amino acid at position 41 changed to L

Output: Resistance against a drug (fold change) (40F,41L,43E,210W,211K,215Y) (43Q,75M,122E,210W,211K) (75I, 77L, 116Y,151M,184V) 0.8 12.8 ?
74

190

Slide: Simple vs Complex Genotypic Features
 Simple genotypic features

(0, 1, 0, 1, 0, 0, 0, 1,)
41L
62V F116Y

 Complex genotypic features

(0, 1, 0, 1, 0, 0, 0, 1,)
77L,116Y 103N,210W,215Y

75

Simple vs Complex Genotypic Features
Simple genotypic features

(0, 1, 0, 1, 0, 0, 0, 1,)
41L
62V F116Y

Complex genotypic features

(0, 1, 0, 1, 0, 0, 0, 1,)
77L,116Y 103N,210W,215Y

75

191

Slide: Linear Regression on Simple Genotypic Features (Rhee et al., PNAS 2006)
 Mutation associations not discovered

76

Linear Regression on Simple Genotypic Features (Rhee et al., PNAS 2006)
Mutation associations not discovered

76

192

Slide: Complex Genotypic Features
0/1 vector of pattern indicators Huge dimensionality! Need itemset mining for selecting features Selection of salient features (0, 1, 0, 1, 0, 0, 0, 1,)
77L,116Y 103N,210W,215Y

patterns
77

Complex Genotypic Features
0/1 vector of pattern indicators Huge dimensionality! Need itemset mining for selecting features Selection of salient features (0, 1, 0, 1, 0, 0, 0, 1,)
77L,116Y 103N,210W,215Y

patterns
77

193

Slide: Other methods
 Nonlinear SVM
 Not interpretable  High accuracy

 Decision trees
 Interpretable  Low accuracy

78

Other methods
Nonlinear SVM
Not interpretable High accuracy

Decision trees
Interpretable Low accuracy

78

194

Slide: Motivation of Itemset Boosting
Impossible to maintain all complex features Greedy feature discovery by Boosting + itemset mining Quadratic Programming (QP) Boosting
 

Reduce the number of itemset mining calls Faster than AdaBoost

Accurate prediction & interpretable results

79

Motivation of Itemset Boosting
Impossible to maintain all complex features Greedy feature discovery by Boosting + itemset mining Quadratic Programming (QP) Boosting

Reduce the number of itemset mining calls Faster than AdaBoost

Accurate prediction & interpretable results

79

195

Slide: Sparse classification in a very high dimensional space
G: all possible patterns (intractably large) |G|-dimensional feature vector x Linear Classifier
f (x)    j x j
j 1 d

Use L1 regularizer to have sparse  (LASSO) Select a tractable number of patterns
80

Sparse classification in a very high dimensional space
G: all possible patterns (intractably large) |G|-dimensional feature vector x Linear Classifier
f (x) j x j
j 1 d

Use L1 regularizer to have sparse (LASSO) Select a tractable number of patterns
80

196

Slide: Problem formulation: Quadratic Programming

Sum of squared loss and L1 regularizer : Training examples : slack variables

81

Problem formulation: Quadratic Programming

Sum of squared loss and L1 regularizer : Training examples : slack variables

81

197

Slide: Dual QP
Primal: Huge number of weight variables Dual: Huge number of constraints

82

Dual QP
Primal: Huge number of weight variables Dual: Huge number of constraints

82

198

Slide: Column Generation Algorithm for QP Boost (Demiriz et al., 2002)
Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge
Constraint Matrix
Used Part
83

Column Generation Algorithm for QP Boost (Demiriz et al., 2002)
Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge
Constraint Matrix
Used Part
83

199

Slide: Finding the most violated constraint
Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted itemset mining
84

Finding the most violated constraint
Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted itemset mining
84

200

Slide: Algorithm Overview
Iteration

 



Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by solving the dual problem
Convert dual solution to obtain primal solution 

Return


85

Algorithm Overview
Iteration

Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by solving the dual problem
Convert dual solution to obtain primal solution

Return

85

201

Slide: Experimental settings
 Three classes of drugs
 NRTI (Nucleotide Reverse Transcriptase Inhibitors)  PI (Protease Inhibitors)  NNRTI (Nonnucleotide Reverse Transcriptase Inhibitors)

 5fold cross validation
 Linear SVM, Ridge regression, Lars  Nonlinear SVM, iBoost
86

Experimental settings
Three classes of drugs
NRTI (Nucleotide Reverse Transcriptase Inhibitors) PI (Protease Inhibitors) NNRTI (Nonnucleotide Reverse Transcriptase Inhibitors)

5fold cross validation
Linear SVM, Ridge regression, Lars Nonlinear SVM, iBoost
86

202

Slide: Regression results

Accuracy:

87

Regression results

Accuracy:

87

203

Slide: Accuracy Summary
 NRTIs: iBoost performed best  PIs:
 Nonlinear methods were better than linear  SVMs were slightly better (non-significant)

 NNRTIs
 Linear methods were better  Combination is not necessary
88

Accuracy Summary
NRTIs: iBoost performed best PIs:
Nonlinear methods were better than linear SVMs were slightly better (non-significant)

NNRTIs
Linear methods were better Combination is not necessary
88

204

Slide: NRTI Drugs
89

NRTI Drugs
89

205

Slide: Known mutation associations in RT
Red: Thymidineassociated Mutations (TAM)
 41L, 67N, 70R, 210W, 215Y/F, 219Q, 69i 75I, 77L, 116Y, 151M, 65R, 74V, 184I/V
RT of HIV-1

Blue: Q151M Complex


90

Known mutation associations in RT
Red: Thymidineassociated Mutations (TAM)
41L, 67N, 70R, 210W, 215Y/F, 219Q, 69i 75I, 77L, 116Y, 151M, 65R, 74V, 184I/V
RT of HIV-1

Blue: Q151M Complex

90

206

Slide: PI Drugs
91

PI Drugs
91

207

Slide: NNRTI Drugs (almost no combination found)
92

NNRTI Drugs (almost no combination found)
92

208

Slide: Computation Time of iBoost
 Training time for 3TC  507 isolates with 371 mutations on average  QP time longer than mining time

93

Computation Time of iBoost
Training time for 3TC 507 isolates with 371 mutations on average QP time longer than mining time

93

209

Slide: Summary (HIV)
 Itemset Boosting for finding mutation associations  Good accuracy for NRTIs  Our complex features re-discover known mutation clusters  Broad applications
 Multiple SNP analysis, RNAi efficacy prediction  Motif combination, Flu mutation analysis  P53 mutation analysis

94

Summary (HIV)
Itemset Boosting for finding mutation associations Good accuracy for NRTIs Our complex features re-discover known mutation clusters Broad applications
Multiple SNP analysis, RNAi efficacy prediction Motif combination, Flu mutation analysis P53 mutation analysis

94

210

Slide: Reference









[1] K. Tsuda and T. Kudo, Clustering graphs by weighted substructure mining, in Proceedings of the 23rd international conference on Machine learning - ICML 06, 2006, pp. 953960. [2] H. Saigo, N. Krmer, and K. Tsuda, Partial least squares regression for graph mining, in Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, 2008, p. 578. [3] H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. Tsuda, gBoost: a mathematical programming approach to graph classification and regression, Machine Learning, vol. 75, no. 1, pp. 6989, Nov. 2008. [4] H. Saigo, T. Uno, and K. Tsuda, Mining complex genotypic features for predicting HIV-1 drug resistance., Bioinformatics (Oxford, England), vol. 23, no. 18, pp. 245562, Sep. 2007. [5] K. Tsuda, Entire regularization paths for graph data, in Proceedings of the 24th international conference on Machine learning - ICML 07, 2007, pp. 919926.

95

Reference

[1] K. Tsuda and T. Kudo, Clustering graphs by weighted substructure mining, in Proceedings of the 23rd international conference on Machine learning - ICML 06, 2006, pp. 953960. [2] H. Saigo, N. Krmer, and K. Tsuda, Partial least squares regression for graph mining, in Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, 2008, p. 578. [3] H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. Tsuda, gBoost: a mathematical programming approach to graph classification and regression, Machine Learning, vol. 75, no. 1, pp. 6989, Nov. 2008. [4] H. Saigo, T. Uno, and K. Tsuda, Mining complex genotypic features for predicting HIV-1 drug resistance., Bioinformatics (Oxford, England), vol. 23, no. 18, pp. 245562, Sep. 2007. [5] K. Tsuda, Entire regularization paths for graph data, in Proceedings of the 24th international conference on Machine learning - ICML 07, 2007, pp. 919926.

95

211

Slide: Machine Learning Summer School Kyoto 2012

Graph Mining
Chapter 3: Kernels for Structured Data
National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

Machine Learning Summer School Kyoto 2012

Graph Mining
Chapter 3: Kernels for Structured Data
National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

212

Slide: Agenda 3 (Kernel)
1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

2

Agenda 3 (Kernel)
1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

2

213

Slide: Part 1: Kernel Method Revisited

3

Part 1: Kernel Method Revisited

3

214

Slide: Kernels and Learning
In Kernel-based learning algorithms, problem solving is now decoupled into:




A general purpose learning algorithm (e.g. SVM, PCA, )  Often linear algorithm A problem specific kernel
Simple (linear) learning algorithm

Complex Learning Task

Specific Kernel function 4

Kernels and Learning
In Kernel-based learning algorithms, problem solving is now decoupled into:

A general purpose learning algorithm (e.g. SVM, PCA, ) Often linear algorithm A problem specific kernel
Simple (linear) learning algorithm

Complex Learning Task

Specific Kernel function 4

215

Slide: Current Synthesis
Modularity and re-usability
 

Same kernel ,different learning algorithms Different kernels, same learning algorithms

Data 1 (Sequence) Data 2 (Network)

Kernel 1
Gram Matrix
(not necessarily stored)

Learning Algo 1 Learning Algo 2
5

Kernel 2 Gram Matrix

Current Synthesis
Modularity and re-usability

Same kernel ,different learning algorithms Different kernels, same learning algorithms

Data 1 (Sequence) Data 2 (Network)

Kernel 1
Gram Matrix
(not necessarily stored)

Learning Algo 1 Learning Algo 2
5

Kernel 2 Gram Matrix

216

Slide: Kernel Methods : intuitive idea
Find a mapping f such that, in the new space, problem solving is linear Kernel represents the similarity between two objects, defined as the dot-product in this new vector space But the mapping is left implicit Easy generalization of a lot of dotproduct-based learning algorithms
6

Kernel Methods : intuitive idea
Find a mapping f such that, in the new space, problem solving is linear Kernel represents the similarity between two objects, defined as the dot-product in this new vector space But the mapping is left implicit Easy generalization of a lot of dotproduct-based learning algorithms
6

217

Slide: Kernel Methods : the mapping
f

f

Original Space

f

Feature (Vector) Space
7

Kernel Methods : the mapping
f

f

Original Space

f

Feature (Vector) Space
7

218

Slide: Kernel : more formal definition
A kernel k(x,y)
is a similarity measure  defined by an implicit mapping f,  such that: k(x,y)=f(x)f(y) This similarity measure implies:

 





Invariance or other a priori knowledge The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning)  but still computational efficiency when computing k(x,y)
8

Kernel : more formal definition
A kernel k(x,y)
is a similarity measure defined by an implicit mapping f, such that: k(x,y)=f(x)f(y) This similarity measure implies:

Invariance or other a priori knowledge The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning) but still computational efficiency when computing k(x,y)
8

219

Slide: Kernel Trick
Generalizes (nonlinearly) algorithms in clustering, classification, density estimation ..




When these algorithms are dot-product based, by replacing the dot product (xy) by k(x,y)=f(x)f(y) When these algorithms are distance-based, by replacing d(x,y) by k(x,x)+k(y,y)-2k(x,y)

Freedom of choosing f implies a large variety of learning algorithms
9

Kernel Trick
Generalizes (nonlinearly) algorithms in clustering, classification, density estimation ..

When these algorithms are dot-product based, by replacing the dot product (xy) by k(x,y)=f(x)f(y) When these algorithms are distance-based, by replacing d(x,y) by k(x,x)+k(y,y)-2k(x,y)

Freedom of choosing f implies a large variety of learning algorithms
9

220

Slide: Valid Kernels
Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel)
 



A function is P.D. if  K (x, y) f (x) f (y)dxdy  0 f  L2 In other words, the Gram matrix K (whose elements are k(xi,xj)) must be positive definite for all xi, xj of the input space One possible choice of f(x): k(,x) (maps a point x to a function k(,x)  feature space with infinite dimension!)
10

Valid Kernels
Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel)

A function is P.D. if K (x, y) f (x) f (y)dxdy 0 f L2 In other words, the Gram matrix K (whose elements are k(xi,xj)) must be positive definite for all xi, xj of the input space One possible choice of f(x): k(,x) (maps a point x to a function k(,x) feature space with infinite dimension!)
10

221

Slide: How to build new kernels
Kernel combinations, preserving validity:
K (x,y )  K1 ( x,y )  (1   ) K 2 (x,y ) K (x,y )  a.K1 (x,y ) a0 K (x,y )  K1 ( x,y ).K 2 ( x,y ) K (x,y )  f ( x). f ( y ) f is real  valued function K (x,y )  K 3 (( x) ,(y )) K (x,y )  xPy P symmetric definite positive K (x,y )  K1 ( x,y ) K1 ( x,x) K1 (y,y )
11

0   1

How to build new kernels
Kernel combinations, preserving validity:
K (x,y ) K1 ( x,y ) (1 ) K 2 (x,y ) K (x,y ) a.K1 (x,y ) a0 K (x,y ) K1 ( x,y ).K 2 ( x,y ) K (x,y ) f ( x). f ( y ) f is real valued function K (x,y ) K 3 (( x) ,(y )) K (x,y ) xPy P symmetric definite positive K (x,y ) K1 ( x,y ) K1 ( x,x) K1 (y,y )
11

0 1

222

Slide: Strategies of Design
Convolution Kernels: text is a recursively-defined data structure. How to build global kernels form local (atomic level) kernels? Generative model-based kernels: the topology of the problem will be translated into a kernel function
12

Strategies of Design
Convolution Kernels: text is a recursively-defined data structure. How to build global kernels form local (atomic level) kernels? Generative model-based kernels: the topology of the problem will be translated into a kernel function
12

223

Slide: IBM Research  Tokyo Research Laboratory

 2005 IBM Corporation

Family of kernels
 Kernels for biological sequences
 Spectrum kernel  Marginalized kernel  Profile kernel  Local alignment kernel
Kernel Methods in Computational Biology, MIT Press

 Tree Kernels
 Kernel for phylogenetic profiles
 Kernel for natural language  Kernel for RNA sequences

IBM Research Tokyo Research Laboratory

2005 IBM Corporation

Family of kernels
Kernels for biological sequences
Spectrum kernel Marginalized kernel Profile kernel Local alignment kernel
Kernel Methods in Computational Biology, MIT Press

Tree Kernels
Kernel for phylogenetic profiles
Kernel for natural language Kernel for RNA sequences

224

Slide: IBM Research  Tokyo Research Laboratory

 2005 IBM Corporation

Family of kernels
 Kernels for nodes in a network
 Diffusion kernel  Locally constrained diffusion kernel

 Graph Kernels
 Marginalized Graph Kernels  MGK without tottering

 Acyclic Pattern Kernels
 Shortest Path Kernel  Weisfeiler-Lehman Kernel

IBM Research Tokyo Research Laboratory

2005 IBM Corporation

Family of kernels
Kernels for nodes in a network
Diffusion kernel Locally constrained diffusion kernel

Graph Kernels
Marginalized Graph Kernels MGK without tottering

Acyclic Pattern Kernels
Shortest Path Kernel Weisfeiler-Lehman Kernel

225

Slide: Weak points of kernel methods
Not Interpretable
 

Not sure which features are used -> Graph Mining, Boosting

Dense kernel matrices: Slow
 

Take O(n^3) time for manipulation -> Semi-supervised learning
15

Weak points of kernel methods
Not Interpretable

Not sure which features are used -> Graph Mining, Boosting

Dense kernel matrices: Slow

Take O(n^3) time for manipulation -> Semi-supervised learning
15

226

Slide: Part 2. Marginalized kernels

16

Part 2. Marginalized kernels

16

227

Slide: Biological Sequences: Classification Tasks
DNA sequences (A,C,G,T)


Gene Finding, Splice Sites MicroRNA discovery, Classification into Rfam families Remote Homolog Detection, Fold recognition
17

RNA sequences (A,C,G,U)


Amino Acid Sequences (20 symbols)

Biological Sequences: Classification Tasks
DNA sequences (A,C,G,T)

Gene Finding, Splice Sites MicroRNA discovery, Classification into Rfam families Remote Homolog Detection, Fold recognition
17

RNA sequences (A,C,G,U)

Amino Acid Sequences (20 symbols)

228

Slide: Kernels for Sequences
Similarity between sequences of different lengths

ACGGTTCAA ATATCGCGGGAA
Later combined with SVMs and other kernel methods
18

Kernels for Sequences
Similarity between sequences of different lengths

ACGGTTCAA ATATCGCGGGAA
Later combined with SVMs and other kernel methods
18

229

Slide: Count Kernel
Inner product between symbol counts

Extension: Spectrum kernels (Leslie et al., 2002)  Counts the number of k-mers (k-grams) efficiently Not good for sequences with frequent context change  E.g., coding/non-coding regions in DNA

19

Count Kernel
Inner product between symbol counts

Extension: Spectrum kernels (Leslie et al., 2002) Counts the number of k-mers (k-grams) efficiently Not good for sequences with frequent context change E.g., coding/non-coding regions in DNA

19

230

Slide: Hidden Markov Models for Estimating Context
Visible Variable
Hidden Variable

: Symbol Sequence
: Context

HMM can estimate the posterior probability of hidden variables from data

20

Hidden Markov Models for Estimating Context
Visible Variable
Hidden Variable

: Symbol Sequence
: Context

HMM can estimate the posterior probability of hidden variables from data

20

231

Slide: Marginalized kernels
Design a joint kernel for combined  Hidden variable is not usually available


Take expectation with respect to the hidden variable

The marginalized kernel for visible variables

21

Marginalized kernels
Design a joint kernel for combined Hidden variable is not usually available

Take expectation with respect to the hidden variable

The marginalized kernel for visible variables

21

232

Slide: Designing a joint kernel for sequences
Symbols are counted separately in each context

:count of a combined symbol (k,l) Joint kernel: count kernel with context information

22

Designing a joint kernel for sequences
Symbols are counted separately in each context

:count of a combined symbol (k,l) Joint kernel: count kernel with context information

22

233

Slide: Marginalization of the joint kernel
Joint kernel

Marginalized count kernel

23

Marginalization of the joint kernel
Joint kernel

Marginalized count kernel

23

234

Slide: Computing Marginalized Counts from HMM
Marginalized count is described as

Posterior probability of i-th hidden variable is efficiently computed as

24

Computing Marginalized Counts from HMM
Marginalized count is described as

Posterior probability of i-th hidden variable is efficiently computed as

24

235

Slide: 2nd order marginalized count kernel
If adjacent relations between symbols have essential meanings, the count kernel is obviously not sufficient
2nd order marginalized count kernel  4 neighboring symbols (i.e. 2 visible and 2 hidden) are combined and counted

25

2nd order marginalized count kernel
If adjacent relations between symbols have essential meanings, the count kernel is obviously not sufficient
2nd order marginalized count kernel 4 neighboring symbols (i.e. 2 visible and 2 hidden) are combined and counted

25

236

Slide: Fisher Kernel
Probabilistic mode

Fisher Kernel  Mapping to a feature vector (Fisher score vector)



Inner product of Fisher scores

Z: Fisher information matrix
26

Fisher Kernel
Probabilistic mode

Fisher Kernel Mapping to a feature vector (Fisher score vector)

Inner product of Fisher scores

Z: Fisher information matrix
26

237

Slide: Fisher Kernel from HMM
Derive FK from HMM (Jaakkola et al. 2000)
 

Derivative for emission probabilities only No Fisher information matrix Counts are centralized and weighted

FK from HMM is a special case of marginalized kernels


27

Fisher Kernel from HMM
Derive FK from HMM (Jaakkola et al. 2000)

Derivative for emission probabilities only No Fisher information matrix Counts are centralized and weighted

FK from HMM is a special case of marginalized kernels

27

238

Slide: Difference between FK and Marginalized Kernels
FK: Probabilistic model determines the joint kernel and the posterior probabilities
MK: You can determine them separately


More flexible design!

28

Difference between FK and Marginalized Kernels
FK: Probabilistic model determines the joint kernel and the posterior probabilities
MK: You can determine them separately

More flexible design!

28

239

Slide: Protein clustering experiment
84 proteins containing five classes


gyrB proteins from five bacteria species HMM + {FK,MCK1,MCK2}+K-Means

Clustering methods


Evaluation


Adjusted Rand Index (ARI)

29

Protein clustering experiment
84 proteins containing five classes

gyrB proteins from five bacteria species HMM + {FK,MCK1,MCK2}+K-Means

Clustering methods

Evaluation

Adjusted Rand Index (ARI)

29

240

Slide: Kernel Matrices

30

Kernel Matrices

30

241

Slide: Clustering Evaluation

31

Clustering Evaluation

31

242

Slide: Applications since then..
Marginalized Graph Kernels (Kashima et al., ICML 2003) Sensor networks (Nyugen et al., ICML 2004) Labeling of structured data (Kashima et al., ICML 2004) Robotics (Shimosaka et al., ICRA 2005) Kernels for Promoter Regions (Vert et al., NIPS 2005) Web data (Zhao et al., WWW 2006)

32

Applications since then..
Marginalized Graph Kernels (Kashima et al., ICML 2003) Sensor networks (Nyugen et al., ICML 2004) Labeling of structured data (Kashima et al., ICML 2004) Robotics (Shimosaka et al., ICRA 2005) Kernels for Promoter Regions (Vert et al., NIPS 2005) Web data (Zhao et al., WWW 2006)

32

243

Slide: Summary (Marginalized Kernels)
General Framework for using generative model for defining kernels Fisher kernel as a special case Broad applications

33

Summary (Marginalized Kernels)
General Framework for using generative model for defining kernels Fisher kernel as a special case Broad applications

33

244

Slide: Part 3 Marginalized Graph Kernels

34

Part 3 Marginalized Graph Kernels

34

245

Slide: Motivations for graph analysis
Existing methods assume  tables
Serial Num 0001 0002 Name   Age 40 31 Sex Male Female Address Tokyo Osaka   

Structured data beyond this framework  New methods for analysis

35

Motivations for graph analysis
Existing methods assume tables
Serial Num 0001 0002 Name Age 40 31 Sex Male Female Address Tokyo Osaka

Structured data beyond this framework New methods for analysis

35

246

Slide: Graphs..

36

Graphs..

36

247

Slide: Graph Structures in Biology
DNA Sequence
A C G C

Compounds
H

RNA

C
UA CG CG U U U U H H C C O H

C
C H

C
H

Texts in literature
Amitriptyline

inhibits

adenosine

uptake

37

Graph Structures in Biology
DNA Sequence
A C G C

Compounds
H

RNA

C
UA CG CG U U U U H H C C O H

C
C H

C
H

Texts in literature
Amitriptyline

inhibits

adenosine

uptake

37

248

Slide: Marginalized Graph Kernels
Going to define the kernel function Both vertex and edges are labeled

(Kashima, Tsuda, Inokuchi, ICML 2003)

38

Marginalized Graph Kernels
Going to define the kernel function Both vertex and edges are labeled

(Kashima, Tsuda, Inokuchi, ICML 2003)

38

249

Slide: Label path
Sequence of vertex and edge labels

Generated by random walking Uniform initial, transition, terminal probabilities

39

Label path
Sequence of vertex and edge labels

Generated by random walking Uniform initial, transition, terminal probabilities

39

250

Slide: Path-probability vector

40

Path-probability vector

40

251

Slide: Kernel definition
Kernels for paths

A c D b E

B c D a A

Take expectation over all possible paths! Marginalized kernels for graphs

41

Kernel definition
Kernels for paths

A c D b E

B c D a A

Take expectation over all possible paths! Marginalized kernels for graphs

41

252

Slide: Computation
 

Transition probability : Initial and terminal : omitted

: Set of paths ending at v KV : Kernel computed from the paths ending at (v, v)



KV is written recursively



Kernel computed by solving linear equations polynomial time
A(v)

A(v) v

v
42

Computation

Transition probability : Initial and terminal : omitted

: Set of paths ending at v KV : Kernel computed from the paths ending at (v, v)

KV is written recursively

Kernel computed by solving linear equations polynomial time
A(v)

A(v) v

v
42

253

Slide: Graph Kernel Applications
Chemical Compounds (Mahe et al., 2005) Protein 3D structures (Borgwardt et al, 2005) RNA graphs (Karklin et al., 2005) Pedestrian detection Signal Processing

43

Graph Kernel Applications
Chemical Compounds (Mahe et al., 2005) Protein 3D structures (Borgwardt et al, 2005) RNA graphs (Karklin et al., 2005) Pedestrian detection Signal Processing

43

254

Slide: Predicting Mutagenicity
MUTAG benchmark dataset
  

Mutation of Salmonella typhimurium 125 positive data (effective for mutations) 63 negative data (not effective for mutations)

Mahe et al. J. Chem. Inf. Model., 2005

44

Predicting Mutagenicity
MUTAG benchmark dataset

Mutation of Salmonella typhimurium 125 positive data (effective for mutations) 63 negative data (not effective for mutations)

Mahe et al. J. Chem. Inf. Model., 2005

44

255

Slide: Classification of Protein 3D structures
Graphs for protein 3D structures
 

Node: Secondary structure elements Edge: Distance of two elements

Calculate the similarity by graph kernels

45 Borgwardt et al. Protein function prediction via graph kernels, ISMB2005

Classification of Protein 3D structures
Graphs for protein 3D structures

Node: Secondary structure elements Edge: Distance of two elements

Calculate the similarity by graph kernels

45 Borgwardt et al. Protein function prediction via graph kernels, ISMB2005

256

Slide: Classification of proteins: Accuracy

46 Borgwardt et al. Protein function prediction via graph kernels, ISMB2005

Classification of proteins: Accuracy

46 Borgwardt et al. Protein function prediction via graph kernels, ISMB2005

257

Slide: Pedestrian detection in images
(F. Suard et al., 2005)

47

Pedestrian detection in images
(F. Suard et al., 2005)

47

258

Slide: Classifying RNA graphs
Karklin et al.,, 2005)

(Y.

48

Classifying RNA graphs
Karklin et al.,, 2005)

(Y.

48

259

Slide: Strong points of MGK
Polynomial time computation O(n^3) Positive definite kernel
   

Support Vector Machines Kernel PCA Kernel CCA And so on

49

Strong points of MGK
Polynomial time computation O(n^3) Positive definite kernel

Support Vector Machines Kernel PCA Kernel CCA And so on

49

260

Slide: Drawbacks of graph kernels
Global similarity measure
 

Fails to capture subtle differences Long paths suppressed

Results not interpretable

Structural features ignored (e.g. loops)


No labels -> kernel always 1
50

Drawbacks of graph kernels
Global similarity measure

Fails to capture subtle differences Long paths suppressed

Results not interpretable

Structural features ignored (e.g. loops)

No labels -> kernel always 1
50

261

Slide: Part 4. Weisfeiler Lehman kernel

51

Part 4. Weisfeiler Lehman kernel

51

262

Slide: Convert a graph into a set of words
i) Make a label set of adjacent A vertices ex) {E,A,D} ii) Sort ex) A,D,E B E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a unique value E R ex) B,A,D,ER D v) Assign the value as the new Bag-of-words {A,B,D,E,,R,} vertex label

Convert a graph into a set of words
i) Make a label set of adjacent A vertices ex) {E,A,D} ii) Sort ex) A,D,E B E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a unique value E R ex) B,A,D,ER D v) Assign the value as the new Bag-of-words {A,B,D,E,,R,} vertex label

263

Slide: Courtesy K. Borgwardt
29/08/2012 53

Courtesy K. Borgwardt
29/08/2012 53

264

Slide: 54

54

265

Slide: Part 5. Reaction Graph Kernels

55

Part 5. Reaction Graph Kernels

55

266

Slide: KEGG lysine degradation pathway

KEGG lysine degradation pathway

267

Slide: Missing enzymes in metabolic networks
 Many enzymatic reactions whose substrate and product are known, but the enzyme involved is unknown.
?

 Need to assign Enzymatic Classification numbers.

?

Missing enzymes in metabolic networks
Many enzymatic reactions whose substrate and product are known, but the enzyme involved is unknown.
?

Need to assign Enzymatic Classification numbers.

?

268

Slide: EC number
 EC (Enzymatic Classification) number is a hierarchical categorization of
 Enzymes  Enzymatic reactions

class

subsubclass

EC 1.3.3.subclass

EC number
EC (Enzymatic Classification) number is a hierarchical categorization of
Enzymes Enzymatic reactions

class

subsubclass

EC 1.3.3.subclass

269

Slide: Task
Given a pair of substrate and product as a query, find similar reactions in the database

Query ?

Result of Retrieval EC: 3.5.2

Similarity measure of reactions is necessary

Task
Given a pair of substrate and product as a query, find similar reactions in the database

Query ?

Result of Retrieval EC: 3.5.2

Similarity measure of reactions is necessary

270

Slide: Reaction Graph
 Represent enzymatic reaction as reaction graph
 Node: Molecules  Edge: chemical relation of molecules (main, leave, co-factor, transferase, ligase)

 Reaction graph kernel: Similarity measure of reaction graphs
 Molecule = Graph of atoms  Reaction graph has graph of graphs structure  Extension of existing walk-based graph kernel
(Kashima et al., 2003)

Reaction Graph
Represent enzymatic reaction as reaction graph
Node: Molecules Edge: chemical relation of molecules (main, leave, co-factor, transferase, ligase)

Reaction graph kernel: Similarity measure of reaction graphs
Molecule = Graph of atoms Reaction graph has graph of graphs structure Extension of existing walk-based graph kernel
(Kashima et al., 2003)

271

Slide: Main Substrate

2 Main Products





main

leave

main

Main Substrate

2 Main Products

main

leave

main

272

Slide: Reaction graph kernels (RGK)
 Two-layered kernels on graphs of graphs
 Node kernel = walk-based graph kernel of molecules  Edge kernel = delta kernel of labels
 main, leave, cofactor, transferase, ligase

R

K r ( R, R' )
R'

Reaction graph kernels (RGK)
Two-layered kernels on graphs of graphs
Node kernel = walk-based graph kernel of molecules Edge kernel = delta kernel of labels
main, leave, cofactor, transferase, ligase

R

K r ( R, R' )
R'

273

Slide: Simplified Settings
 Query might not come in the complete form  Remove some edges in the database entries
RPAIR

Use only reactant edges (main, leave)
main-only Use main only

Simplified Settings
Query might not come in the complete form Remove some edges in the database entries
RPAIR

Use only reactant edges (main, leave)
main-only Use main only

274

Slide: Automatic classification of enzymatic reactions in KEGG
 KEGG/REACTION database  4610 reactions with known EC number  6 classes, 50 subclasses, 124 subsubclasses

 Construct nearest neighbor classifier based on the reaction graph kernel  Three different levels: class, subclass,subsubclass  Three kernel versions: full, RPAIR, main-only  Measure leave-one-out classification error

Automatic classification of enzymatic reactions in KEGG
KEGG/REACTION database 4610 reactions with known EC number 6 classes, 50 subclasses, 124 subsubclasses

Construct nearest neighbor classifier based on the reaction graph kernel Three different levels: class, subclass,subsubclass Three kernel versions: full, RPAIR, main-only Measure leave-one-out classification error

275

Slide: Prediction Accuracy

 As expected, classification is easier for upper categories, but difficult for lower categories such as subsubclass.  The order of accuracy (full-edge > RPAIR > main-pair) suggests that detailed edge information contributes to further accuracy.

Prediction Accuracy

As expected, classification is easier for upper categories, but difficult for lower categories such as subsubclass. The order of accuracy (full-edge > RPAIR > main-pair) suggests that detailed edge information contributes to further accuracy.

276

Slide: Predicting unannotated reactions in plant secondary metabolism
 KEGG pathway Biosynthesis of Secondary Metabolites  Out of unannotated 56 reactions, we have manually assigned ECs of 36 reactions under chemists guidance  Comparison with an existing rule-based method: ezyme (Kotera et al., J. Am. Chem. Soc, 2004).  RGKs accuracy was better than e-zyme. (50% improvement for the top candidate)

Predicting unannotated reactions in plant secondary metabolism
KEGG pathway Biosynthesis of Secondary Metabolites Out of unannotated 56 reactions, we have manually assigned ECs of 36 reactions under chemists guidance Comparison with an existing rule-based method: ezyme (Kotera et al., J. Am. Chem. Soc, 2004). RGKs accuracy was better than e-zyme. (50% improvement for the top candidate)

277

Slide: Case 1: EC 3.1.1
query 

Manual

3.1.1

annotation

Structures of C02046 and C01479 are almost same except structural isomerism. A hydrolysis occur at a carboxylic-esther bond.
method RGK rank1 3.1.1 rank2 1.14.11 rank3 1.14.11

e-zyme

6.1.1

3.1.1

NA

Case 1: EC 3.1.1
query

Manual

3.1.1

annotation

Structures of C02046 and C01479 are almost same except structural isomerism. A hydrolysis occur at a carboxylic-esther bond.
method RGK rank1 3.1.1 rank2 1.14.11 rank3 1.14.11

e-zyme

6.1.1

3.1.1

NA

278

Slide: Case 2: EC 2.1.1+1.1.1
query

Manual
annotation

After removing a methyl group of C06175 (2.1.1), C01735 will be produced by oxidation of CH-OH group (1.1.1). In the second reactions, enzymes usually use NAD+/NADP+ as an acceptor.
method
RGK e-zyme

rank1
1.1.1 2.1.1

rank2
1.1.1 1.13.12

rank3
1.1.1 1.14.14

Case 2: EC 2.1.1+1.1.1
query

Manual
annotation

After removing a methyl group of C06175 (2.1.1), C01735 will be produced by oxidation of CH-OH group (1.1.1). In the second reactions, enzymes usually use NAD+/NADP+ as an acceptor.
method
RGK e-zyme

rank1
1.1.1 2.1.1

rank2
1.1.1 1.13.12

rank3
1.1.1 1.14.14

279

Slide: Case 3: EC1.14+2.4.1+5.2.1+3.2.1
Query

Manual Annotation
This reaction is thought to be a set of 5 reactions by analogy with the pathway from C00423 => C01772 => C05158 => C05839 => C05838, and to C05851. The last reaction is spontaneous and not enzymatic.
method RGK rank1 1.14.13 rank2 1.2.3 rank3 1.2.1

e-zyme

NA

NA-

NA-

Case 3: EC1.14+2.4.1+5.2.1+3.2.1
Query

Manual Annotation
This reaction is thought to be a set of 5 reactions by analogy with the pathway from C00423 => C01772 => C05158 => C05839 => C05838, and to C05851. The last reaction is spontaneous and not enzymatic.
method RGK rank1 1.14.13 rank2 1.2.3 rank3 1.2.1

e-zyme

NA

NA-

NA-

280

Slide: A difficult case
No annotation

A very similar reaction (below) is found by manual inspection

Multi-step reaction is difficult to analyze. We dont know how many steps are hidden between given substrate and product.

A difficult case
No annotation

A very similar reaction (below) is found by manual inspection

Multi-step reaction is difficult to analyze. We dont know how many steps are hidden between given substrate and product.

281

Slide: Concluding Remarks: New Frontier
Developing Algorithms for Learning from Graphs Taming Combinatorial Explosion  Recursive Fixed Point Iteration: Graph Kernels  Statistical Pruning in Search Tree: Graph Boosting  Hashing to a set of words: WL Kernel New ideas are necessary to get beyond the current level of speed and prediction accuracy Deeper integration of learning and mining
THANK YOU VERY MUCH!!
71

Concluding Remarks: New Frontier
Developing Algorithms for Learning from Graphs Taming Combinatorial Explosion Recursive Fixed Point Iteration: Graph Kernels Statistical Pruning in Search Tree: Graph Boosting Hashing to a set of words: WL Kernel New ideas are necessary to get beyond the current level of speed and prediction accuracy Deeper integration of learning and mining
THANK YOU VERY MUCH!!
71

282

Slide: Reference
[1] H. Kashima, K. Tsuda, and A. Inokuchi, Marginalized kernels between labeled graphs, ICML, pp. 321328, 2003. [2] H. Saigo, M. Hattori, H. Kashima, and K. Tsuda, Reaction graph kernels predict EC numbers of unknown enzymatic reactions in plant secondary metabolism., BMC bioinformatics, vol. 11 Suppl 1, no. Suppl 1, p. S31, Jan. 2010. [3] N. Shervashidze, P. Schweitzer, V. Leeuwen, E. Jan, K. Mehlhorn, and K. Borgwardt, Weisfeiler-Lehman graph kernels, Journal of Machine Learning Research, vol. 12, pp. 25392561, 2011. [4] K. Tsuda, T. Kin, and K. Asai, Marginalized kernels for biological sequences, Bioinformatics, vol. 18, no. Suppl 1, pp. S268S275, Jul. 2002.

72

Reference
[1] H. Kashima, K. Tsuda, and A. Inokuchi, Marginalized kernels between labeled graphs, ICML, pp. 321328, 2003. [2] H. Saigo, M. Hattori, H. Kashima, and K. Tsuda, Reaction graph kernels predict EC numbers of unknown enzymatic reactions in plant secondary metabolism., BMC bioinformatics, vol. 11 Suppl 1, no. Suppl 1, p. S31, Jan. 2010. [3] N. Shervashidze, P. Schweitzer, V. Leeuwen, E. Jan, K. Mehlhorn, and K. Borgwardt, Weisfeiler-Lehman graph kernels, Journal of Machine Learning Research, vol. 12, pp. 25392561, 2011. [4] K. Tsuda, T. Kin, and K. Asai, Marginalized kernels for biological sequences, Bioinformatics, vol. 18, no. Suppl 1, pp. S268S275, Jul. 2002.

72

283