Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 1: Data Mining

National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

Labeled graphs are general and powerful data structures that can be used to represent diverse kinds of objects such as XML code, chemical compounds, proteins, and RNAs. In these 10 years, we saw significant progress in statistical learning algorithms for graph data, such as supervised classification, clustering and dimensionality reduction. Graph kernels and graph mining have been the main driving force of such innovations. In this lecture, I start from basics of the two techniques and cover several important algorithms in learning from graphs. Successful biological applications are featured. If time allows, I will also cover recent developments and show future directions.

Scroll with j/k | | | Size

Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 1: Data Mining

National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

1

National Institute of Advanced Industrial Science and Technology (AIST)

Research institute under METI for 6 fields:

Life Science, Informatics, Environment and Energy, Nanotechnology and Materials, Geological Survey, Standard

2500 scientists, 700 administrative staff, 5200

scientists from outside

Since 2001

2

2

Computational Biology Research Center

Odaiba, Tokyo

Developing Novel Methods and Tools for

Genome Informatics Molecular Informatics Cellular Informatics

Diverse Collaboration with Companies and Universities

3

3

Koji Tsuda: Short Bio

1998 PhD in Kyoto Univ, Join ETL 2000 GMD FIRST (Berlin, Germany) 2001 Join CBRC/AIST 2003-2004, 2006-2008 Max Planck Institute for Biological Cybernetics, Tuebingen, Germany 2009 Back to CBRC, Machine Learning Group Leader

4

4

About this lecture

How to extract knowledge from structured data Itemset mining, tree mining, graph mining

Reverse Search Principle

Learning from structured data Kernels for structured data

5

5

Chapter 1 (Data Mining)

1. 2. 3. 4. 5. 5. 6. Structured Data in Biology Itemset Mining Closed Itemset Mining Ordered Tree Mining Unordered Tree Mining Graph Mining Dense Module Enumeration

6

6

Agenda 2 (Learning from Structured Data)

1. 2. 3. 4. 5. Preliminaries Graph Clustering by EM Graph Boosting Regularization Paths in Graph Classification Itemset Boosting for predicting HIV drug resistance

7

7

Agenda 3 (Kernel)

1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

8

8

Part 1: Structural Data in Biology

9

9

Biological Sequences

DNA sequences (A,C,G,T)

Gene Finding, Splice Sites MicroRNA discovery, etc. Remote homolog detection, Fold recognition etc.

10

RNA sequences (A,C,G,U)

Amino acid sequences (20 symbols)

10

Structures hidden in sequences (I)

Exon/intron of DNA (Gene)

11

11

Structures hidden in sequences (II)

RNA Secondary Structure

Protein 3D Structures

It is crucial to infer hidden structures and exploit them for classification

Biological Graphs

12

12

Molecular graphs

Structure: Thiamine (Vitamin B1) molecular graph

graph = a set of dots & lines (or nodes & edges)

Hydrogen Carbon Oxygen Nitrogen Sulfur single bond double bond

abstraction

implicit hydrogens

explicit hydrogens

13

Gene Expression Data

Measurement of many mRNAs in the cell Rough estimate of amount of proteins Time-series or not Snapshot of the underlying dynamic system

14

14

Biological Networks

Protein-protein physical interaction Metabolic networks Gene regulatory networks Network induced from sequence similarity

Thousands of nodes (genes/proteins) 100000s of edges (interactions)

15

15

Physical Interaction Network

16

16

Metabolic Network

17

17

Many possible prediction problems..

18

18

Part 2: Itemset mining

19

19

Data Mining

people

A formal study of efficient methods for extracting interesting rules and patterns

person name @age @id #text 2 5 60 8 #text Jo hn john@abc.com 60 9 email @id

person tel name #text Mar y

#text

from massive data

5554567

Frequent itemset mining

(Agrawal and Srikant 1994)

Closed pattern mining Structured data mining (Sequence, Trees, and Graphs)

20

20

Frequent Itemset Mining

[Agrawal, Srikant, VLDB'94]

Finding all "frequent" sets of elements (items) appearing times or more in a database

Minsup = 2

1 t1 t2 t3 t4 t5 2 3 4 5

Frequent sets , 1, 2, 3, 4, 12, 13, 14, 23, 24, 124

X = {2, 4} appears three times, thus frequent

Frequent sets

21

database

The itemset lattice (2, )

21

Definitions: Database

A set = { 1, ..., n } of items (elements) Transaction database A set T = { t1, ..., tm } of subsets of Each subset t is called a transaction

L = {1, 2, 3, 4}

Alphabet of items id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4

22 22

22

Definitions: Frequent sets

Itemset X appears in transaction t: X t Occurrence of X in database T: Occ(X, T) = { t T : X t } Frequency of X: Fr(X, T) = | Occ(X, T) | Minimum support (minsup): 0 |T| X is frequent in T if Fr(X, T) .

I = {1, 2, 3, 4} Alphabet of items Transaction database id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4

23

Occurrences and frequencies of itemsets Occ(3, T) = {t1, t3} Fr(3, T) = 2 Occ(24, T) = {t2, t3, t4}, Fr(24, T) = 3

23

23

Market Basket Data

Popular application of itemset mining Business and Market data analysis

Transaction Data of purchase

ID 001 002 Chips 1 1 Mustard Sausage Softdrink 0 1 0 1 0 1 Beer 1 1

Item

003

1

0 0 1 1 1 1

0

0 1 1 0 1 0

1

1 1 1 1 1 0

0

0 1 0 1 0 1

0

1 1 1 1 0 0

a transaction or a "basket"

004 005 006 007 008 009

010 0 1 0 1 Meaning of the transaction 003 1 "Custmer 003 bought Chips and Sausage together in his basket"

24

24

DAG of itemsets: Hasse diagram

Edge: Adding

empty

one item

1 2 3 4

1,2

1,3

2,3

1,4

2,4

3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,2,3,4

25

25

Enumeration Tree by Lexicographical Order

Need a tree

empty

to avoid duplication

1,2

1

2

3

4

1,3

2,3

1,4

2,4

3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,2,3,4

26

26

Backtracking Algorithm: FP Growth etc.

Monotonicity: Support only decreases Depth First Traversal, Prune if support <

Frequent sets

27

27

Association Rule Mining

Confidence: Supp(A B)/ Supp(A)

Probability of B, Given A

What item is likely to be bought when A

is bought

Search: large support, confidence large Post-processing of itemset mining

28

28

Summary: Itemset mining

Itemset mining is the simplest of all

mining algorithms

Need to maintain occurrence of each

pattern in database

Tree by lexicographical order is

(implicitly) used

29

29

Part 3: Closed Itemset mining

30

30

Problem in Frequent Pattern Mining

Huge Number of frequent itemsets Hard to analyze Most of them are similar

Huge number of frequent itemsets discovered in T

An input transaction database

1 t1 t2 t3 2 3 4 5

Frequent sets

minsup = 2

31

t4 t5

mining

database

31

Solution: Closed Pattern Mining

Find only closed patterns Observation: Most frequent itemset X can be extended without changing occurrence by

adding new elements

def ([Pasquier et al., ICDT'99]). An itemset X is a closed set if and only if there is no proper superset of X with the same frequency (thus the same occurrence set).

32

32

Closed Pattern Mining

A closed itemset is the maximal set among all

itemsets with the same occurrences. Equivalence class [X] = {Y| Occ(X)=Occ(Y) }.

A AC AB ABC C AE ABE B BE ACE ABCE BC BCE E CE

Database

id 1 2 3 4 records ABCE AC BE BCE

Closed sets (maximal sets) Equivalence class w.r.t. occurrences

33

33

Brute-force: Stupid Baseline

ALGORITHM Bruteforce

First, generate all frequent itemsets Check them one by one via maximality test

Maximality test for each candidate frequent set X

Add some element e in to X If Freq(X U {e}) is properly less than Freq(X) then reject X.

34

34

Bruteforce

STEP1) first, generate all frequent sets

Database T id 1 2 3 4

[1,2]

All itemsets in T

[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE

[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

35

35

Bruteforce

STEP1) first, generate all frequent sets STEP 2) make closedness test for each set

Database T id 1 2 3 4

[1,2]

All itemsets in T

[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE

[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

36

36

Bruteforce

STEP1) first, generate all frequent sets STEP 2) make closedness test for each set STEP3) finally, extract all closed sets

Database T id 1 2 3 4

[1,2]

All itemsets in T

[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE

[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

All closed sets are found!

37

37

Complexity of Enumeration Algorithms

Number of patterns usually exponential to input size Delay: Time between two pattern outputs Brute-force is exponential delay w.r.t. pattern size

Input size Input Output size

Delay D Total Time T

38

38

To achieve linear delay,

Must jump from closed set to closed set How to define the search tree? Reverse search!

(Avis and Fukuda 1996)

{2} {7,9} {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 39

39

Reverse Search: Its a must

A general mathematical framework to design enumeration algorithms Can be used to prove the correctness of the algorithm Popular in computational geometry Data mining algorithms can be explained in remarkable simplicity

40

40

Often, search space comes as a DAG

Naive Backtracking

= Duplication {2} {7,9} {1,7,9} {2,5} {2,7,9}

Duplication check by

Marking = Exponential Memory

How to visit all

nodes without duplication?

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 41

41

Reduction Map

Mapping from a children to the parent Reduction map for closed itemset

Shrink the itemset until occurrence changes Take closure operation

closure(X) closure shrink

Parent of X

42

Closed set X

42

42

Closure Operation

closure(X) of a set X:

Closed set computed by closure(X) = { t in T : X t }. (taking the intersection of all transactions in T that X occurs as subset)

non-closed set

closure(X)

closed set

43

Equivalence class of itemsets with same occurrence

43

Example of Closure Operation

Database T

Non-closed itemset: (B,C) Occurrence: 1,4 Take Intersection of 1 and 4

id 1 2 3 4

records ABCE AC BE BCE

(A,B,C,E) (B,C,E) = (B,C,E) This is closed itemset

44

44

By applying the reduction map to all nodes, enumeration tree is defined.

But arrows are

in reverse direction..

{2} {7,9} {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 45

45

Children generation

In backtracking, one has to generate all children of the current node Parent Inverse of reduction map

Children candidates

real child

Generate all children candidates Apply reduction map to them Remove if not coming back

real child

46

46

Reverse Search Theorem

To prove the correctness, prove the following

Reduction map is uniquely defined on all nodes By applying the reduction map repeatedly, one can reach the root node from any node Children generation is inverse of reduction map

Easy to check !

47

47

LCM = Linear Time Closed Sets Miner (Uno et al., 2003)

Prefix Preserving Closure Extension Jump! = Children generation from the reduction map Linear Delay!

48

48

Closure Extension

Repeat: Add an item and taking closure

Step 1: Start: Y = X U {i} closed set X closure closure(X) i Y X

add item i

Step 2: closed set Z = closure(XU{i}) Z

non-closed sets closed sets

49

49

Nave Closure Extension: Duplication!

closure extension DAG of closed itemsets {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} {2,5} {2,7,9}

T

{1,2,7,9}

{2,3,4,5}

closure extension

{1,2,7,8,9}

{1,2,5,6,7,9}

50

50

Prefix Preserving Closure Extension

Ensure any closed set is generated from a unique parent Def. Closure tail of a closed itemset P the minimum j s.t. closure (P {1,,j}) P Def. H closure(P{i}) is a PPC-extension of P i > closure tail and H {1,,i-1} P {1,,i-1}

51

51

Enumeration tree by PPC extension

closure extension DAG ppc extension tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} {2,5} {2,7,9}

T

{1,2,7,9}

{2,3,4,5}

closure extension ppc extension

{1,2,7,8,9}

52 {1,2,5,6,7,9}

52

Linear Delay in Pattern Size

(Uno, Uchida, Asai, Arimura, Discovery Science 2004)

Theorem : The algorithm LCM finds all frequent closed sets X appearing in a collection of a transaction database D in O(lmn) time per closed set in the total size of D without duplicates, where l is the maximum length of transactions in D, and n is the total size of D, m is the size of pattern X.

Note: The output polynomial time complexity of Closed sets discovery is shown by [Makino et al. STACS2002]

53

53

Summary: Closed Itemset Mining

Closure Extension: Jump from closed

set to closed set

LCM: Linear Delay

Very fast in practice, too

Winner of FIMI04 (Frequent Itemset Mining Implementation Workshop)

Relation to clique enumeration (Arimura,

Uno, SDM2009)

54

54

Part 4: Ordered Tree Mining

55

55

Frequent Ordered Tree Mining

Natural extension of frequent itemset

mining problem for trees

Finding all frequent substructure in a given collection of labeled trees How to enumerate them without duplicates

Efficient DFS Algorithm

FREQT [Asai, Arimura, SIAM DM2002] TreeMiner [Zaki, ACM KDD2002] Rightmost expansion technique

56

56

Labeled Ordered Trees

Rooted:

Ordered:

Siblings are ordered from left to right.

people person name @age @id email #text

25 608 John

Labeled

Each node has a label.

person tel @id

A model of

HTML/XML Hierarchical records Dependency tree of natural language texts.

name

#text #text

609

#text

Mary

john@abc.com

5554567 57

57

Matching between trees

Pattern tree T matches a data tree D

(T occurs in D ) pattern tree T matching function f

There is a matching function f from T into D.

A C

1

r

data tree D

B

f is 1-to-1. B f preserves parent-child relation. 3 f preserves (indirect) sibling

2 4

A A

5 6

7

C

11

C B

9

8

A

10

B

relation. f preserves labels.

B

C

58

58

Frequency of a pattern tree

A root occurrence of pattern T:

The node to which the root of T maps by a matching function

The frequency fr(T) = #root occurrences of T in D

T

B

A

1

r

D

P1

7

C

2 3

A

4

C B

11

B

A

Root occurrence list OccD(T) = {2, 8}

5 6

C

8

A B

9 10

P2

B

C

59

59

Frequent Tree Mining Problem

Given: a colection S of labeled ordered trees

and a minimum frequency threshold

Task: Discover all frequent ordered trees in S (with frequency no less than ) without

duplicates

A minimum frequency threshold (min-sup) s = 50%

60

60

Key: How to enumerate ordered trees without duplicates?

A

A naive algorithm

Starting from the smallest tree Grow a pattern tree by adding a new node one by one

B

A

A C

Drawbacks

Exponentially many different ways to generate the same pattern tree Explicit duplication test needed

B

A C

How to overcome this difficulty?

61

61

An idea: DFS Code of Ordered Tree

Depth-label sequence in the preorder traversal

(depth first search)

S = ((d(v1), l(v1)) , , (d(vk), l(vk)) depth 0 1

B

A C B A

B C

DFS code

id 1 2 3 4 5 6 7

2

3

seq 0A 1B 2A 3C 2B 1B 2C

62

62

Rightmost expansion

Extending the DFS Code = Attaching a new node on the rightmost branch (d1,l1),,(dn,ln), (dn+1,ln+1)

pattern

C B 2 C 3 1 A 4 C 5 B 6

S

C

B 2 C 3

C B 2

1 A 4

1

C 3

A 4

C 5 B 6

D

7

D

7

C

1 A 4 C 5 B 6

63

B 2 C 3

C 5 B 6

D

7

63

Searching frequent ordered trees

Enumerate all frequent ordered trees by backtracking Tree extended only by rightmost extension = No duplication

frequent B

B B B B B

frequent A

B A

B B B B A B B B A B A

A

A

A infrequent

B infrequent B A A

B

64

64

Summary: Ordered tree mining

Convert tree to a string (DFS Code)

Adding element to the code =

Rightmost extension

It was relatively easy because nodes

are ordered

How about unordered case?

65

65

Part 5: Unordered Tree Mining

66

66

Frequent Unordered Tree Mining

Unordered trees: Non-trivial subclass of general graphs Problem: Exponentially many isomorphic trees Efficient DFS Algorithm

A

B B

A B A C

B

A

B

C

B

C

Unot [Asai, Arimura, DS'03] NK [Nijssen, Kok, MGTS03]

C

A B C B

A

B B

C B

B

A

A

C

C

67

67

Canonical Ordered Representation

Given ordering among siblings, depth-first search (DFS) code is defined

Code(T) = ((depth(v1), label(v1)) , , (depth(vk), label(vk))

T

B

A C

A

Code(T) = (0A,1B,2A,3C,2B,1B,2C)

B

B

C

68

68

Canonical representation

Ordered tree T with lexicographically maximum code

T1

B A

C

(0A,1B,2A,3C,2B,1B,2C)

A

T2

B C

A

T3

B B C C

A

T4

B B B C

A

B

B B A

C

B

B

A

C

A

C

(0A,1B,2B,2A,3C,1B,2C)

(0A,1B,2C,1B,2A,3C,2B)

(0A,1B,2C,1B,2B,2A,3C) 69

69

Left Heavy Condition (Nakano and Uno, 2002)

T(v): subtree rooted on v

Ordered tree is canonical if and only if

Code(T(v1)) Code(T(v2))

for any pair of sibling nodes v1 (left) and v2 (right)

70

70

Reduction Map

How to define parent from child in the

enumeration tree

Generate canonical tree of size k-1

from canonical tree of size k

Remove the last element of DFS Code

Code(T) = (0A,1B,2A,3C,2B,1B,2C)

71

71

Children Generation

Generate children

pattern

C B 2 C 3 1

S

A 4

candidates by rightmost extension

Check the maximality

C 5 B 6

of candidate based on left heavy property

Discard if not

maximal

72

72

Maximality Check by Left Heavy Property

Code of left subtree must be larger than that of right subtree

Check only rightmost sibling and second rightmost sibling

73

73

Complexity of UNOT

Delay per pattern O(kb2 mn)

k: pattern size

b: branching factor of the data tree

m: size of data tree

n: database size

74

74

Summary: Mining Unordered Tree

The following three elements are

necessary for a mining algorithm

Canonical Representation Reduction Map Children Generation including Maximality Check

Backtracking on the resulting

enumeration tree

75

75

Part 6: Graph Mining

76

76

Frequent Subgraph Mining

Graph Database

Enumerate all subgraphs occurring more than 3 times

Patterns

77

77

Gspan (Yan and Han, 2002)

Most widely used graph mining algorithm Can be interpreted with reverse search principle

Canonical representation? Reduction map? Children generation?

78

78

DFS Code for Graph

Depth first search and preorder node labeling (src, dest, src_label, edge_label, dest_label) Some edges not traversed

backward edge (dest < src)

Elements sorted in the code

0 A a a B b A 3 A a 2 {(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

1

79

79

Canonical Representation

Multiple DFS codes: different starting point and children ordering Minimum DFS Code: Lexicographically Minimum

80

80

Reduction Map

Removing the tail of minimum DFS code preserves minimality

min DFS: {e1, e2, e3, e4}

{e1, e2, e3}

81

81

Children Generation

Create candidates by adding an element to DFS code Check if each candidate is minimum If not, remove it

Children candidates

Parent

real child

real child

82

82

Minimality Check

Reconstruct the graph from DFS Code Derive the minimum DFS Code by trying all DFSs

Speed up by traversing minimal label only

{(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

If the minimal code is not identical to the original, prune it

0

A a a B b A 3 A a 2

1

83

83

Summary: Graph Mining

gSpan is a typical example of reverse search Not explained: Closed tree mining, Closed Graph mining Delay exponential to pattern size

It cannot be avoided due to NP-hardness of graph isomorphism Yet it scales to millions for sparse molecular graphs

Applications covered in next chapter

84

84

Part7: Dense Module Enumeration

85

85

Biological Motivation

Most cellular processes performed by multi-component protein complexes Increasing amount of experimental protein interaction data available Our approach Predict complexes (modules) from protein interaction network Exploit additional information given by gene expression data, evolutionary conservation, phenotypic profiles etc.

29/08/2012

86

86

29/08/2012

87

87

Protein interaction networks

Node: Proteins Edge: Physical interaction of two proteins

Challenge 1: False negative edges Go beyond clique search! Challenge 2: False positive edges Assign confidence scores to edges Find node sets with high density of high confidence edges

29/08/2012

88

88

Module Discovery

Previous work Clique percolation [Palla et al., 2005] Partitioning

Hierarchical clustering [Girvan and Newman, 2001] Flow Simulation [Krogan et al., 2006] Spectral methods [Newman, 2006]

Heuristic Local Search [Bader and Hogue, 2003]

Our approach Exhaustively enumerate all dense subgraphs efficiently

29/08/2012

89

89

Motivation for Enumeration Approach

Detects overlapping modules Allows to specify minimum density for outcoming modules Outputs all modules satisfying the density threshold

29/08/2012

90

90

Differential Expression Criterion

Incorporation of gene expression Presence of proteins depends on cell type Additional Criterion for modules : Num of conditions where whole module expressed : Num of conditions where whole module not expressed Fix minimum values for both quantities

29/08/2012

91

91

Problem Formalization

29/08/2012

92

92

Typical Enumeration Algorithms

Itemset mining, graph mining etc. Enumerate all entities whose frequency >= 10 Set up a search tree Tree Pruning by anti-monotonicity An entitys frequency is always smaller than that of sub-entity

Not generated

29/08/2012

93

93

Network Example

29/08/2012

94

94

Graph-shaped Search Space of Modules

29/08/2012

95

95

Choosing a search tree

For efficient search, a search tree is needed There are many possible search trees Default: Lexicographical ordering

29/08/2012

96

96

Density is not a monotonic criterion

Subset of dense set is not necessarily dense Density does not decrease monotonically on a path Pruning Impossible

Density

1.0 1

0.1 1,2

0.5 1,2,3

29/08/2012

97

97

Question

Is it possible to make a search tree such that density decreases monotonically?

29/08/2012

98

98

Question

Is it possible to make a search tree such that density decreases monotonically? YES! Use Reverse Search (Avis and Fukuda 1993)

29/08/2012

99

99

Reverse search (Avis and Fukuda, 1993)

Specify a search tree in the graph-shaped search space Reduction Mapping Rule to generate a parent from a child Remove the node with the smallest degree Density always increase by the removal

CHILD

PARENT

29/08/2012 100

100

Search Tree is uniquely specified by the reduction mapping

Condition: Every node should converge to the root node by applying the reduction mapping repeatedly

29/08/2012

101

101

Enumeration algorithm by reverse search

A set of children is generated from a parent node Try every possible children, and choose the ones satisfying the reduction mapping Prune if no children exist

29/08/2012

102

102

Constraint Integration

Differential expression constraint

Monotonicity: e_0 and e_1 decrease with extension of U

Can be used for extra pruning without difficulty

29/08/2012

103

103

Statistical Significance of a module

k : The number of nodes in the module : Density of the module : The number of modules of size k with density at least Probability of random selection making a denser module (p-value)

29/08/2012

104

104

Benchmarking in yeast complex discovery

Combined interactions from CYGD-Mpact and DIP Interactions among 3559 nodes Confidence weights on edges due to (Jansen, 2003) Methods in comparison Clique detection (Clique) Clique Parcolation Method (CPM) Markov Clustering Modules compared with MIPS complexes

29/08/2012

105

105

29/08/2012

106

106

Evolutionary Conserved Yeast Modules

Use ortholog profiles (10 species, InParanoid) Density >= 50%, at least three orthologs 1917 modules in 30 minutes Recovered evolutionary conserved complexes 20S proteasome 19S/22S regulator COPI vesicle coat complex DNA polymerase I and II subunits Translation initiation factor eIF2B complex They could not be recovered by simple DME due to low density

29/08/2012 107

107

MIPS Complexes discovered by DME (Conserved in Evolution)

29/08/2012

108

108

Phenotype-associated yeast modules

Use growth phenotypic profiles (21 conditions, Dudley et al, 2005) Growth defect in at least one condition Each of the 13 highest ranking modules covers the large subunit of mitochondrial ribosome

Found additional protein, Mhr1

Exactly recovered the nucleoplasmic THO complex (Hpr1, Mft1, Rlr1, Thp2) Transcription elongation, hyperrecombination Growth defect under ethanol

29/08/2012

109

109

Phenotype Associated Modules Large subunit of mitochondrial ribosome

Mhr1: involved in homologous recombination of the mitochondrial genome

29/08/2012 110

110

Human Settings

Tissue-specific gene expression data (Su et al., 2004) 79 different tissues Consistently expressed in 3 tissues, not in 10 tissues 7763 proteins, density >= 35%, 5 minutes 1021 maximal modules MIPS human complex database (Ruepp et al., to appear)

29/08/2012

111

111

Human-expression result

Around MCM complex, we found inter-complex relationships with ORC, CDC7, Toposome, PLK1 protein Module Uqcrc1, Uqcrc2, Uqcrb, Cyc1 (lg p = -13) No overlap with MIPS Ubiquinol-cytochrome c reductase complex SCF E3 ubiquitin ligase complex: Mark protein for degradation 5 different modules with different tissue specificity Peripheral proteins: Substrate recognition particles Target proteins are selected in a tissue specific manner! Natural Killer cells have all particles

29/08/2012

112

112

High ranking modules around the MCM complex

Expressed in bone mallow cells Not expressed in brain, liver, kidney etc.

29/08/2012 113

113

Tissue Specific organization of the SCF ligase complex

29/08/2012

114

114

Summary: Dense Module Enumeration

Novel module enumeration algorithm based on reverse search Combination with other information sources Statistical significance of dense modules Successfully applied to yeast/human protein interaction networks

29/08/2012

115

115

Reference

[1] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc. 20th Int. Conf. Very Large Data Bases, 1994. [2] T. Asai, H. Arimura, T. Uno, and S. Nakano, Discovering frequent substructures in large unordered trees, Discovery science, 2003. [3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, Efficient Substructure Discovery from Large Semi-structured Data., in SDM, 2002. [4] D. Avis and K. Fukuda, Reverse search for enumeration, Discrete Applied Mathematics, vol. 65, no. 13, pp. 2146, 1996. [5] E. Georgii, S. Dietmann, T. Uno, P. Pagel, and K. Tsuda, Enumeration of conditiondependent dense modules in protein interaction networks., Bioinformatics (Oxford, England), vol. 25, no. 7, pp. 93340, Apr. 2009. [6] S. Nijssen and J. Kok, Efficient discovery of frequent unordered trees, MGTS, 2003. [7] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering frequent closed itemsets for association rules, Database TheoryICDT99, 1999. [8] I. Takigawa and H. Mamitsuka, Efficiently mining -tolerance closed frequent subgraphs, Machine Learning, vol. 82, no. 2, pp. 95121, Sep. 2010. [9] X. Yan and J. Han, gSpan: graph-based substructure pattern mining, in 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 721724. [10] M. Zaki, Efficiently mining frequent trees in a forest: Algorithms and applications, IEEE Transaction on Knowledge and Data Engineering, 2005.

29/08/2012 116

116

Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 2: Learning from Structured Data

National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

117

Agenda 2 (Learning from Structured Data)

1. 2. 3. 4. 5. Preliminaries Graph Clustering by EM Graph Boosting Regularization Paths in Graph Classification Itemset Boosting for predicting HIV drug resistance

2

118

Part 1: Preliminaries

3

119

Clustering Graphs

4

120

Graph Regression

Training ( (

( ,-0.2) , 0.7) ,-0.5) (

Test

, ?)

5

121

Substructure Representation

0/1 vector of pattern indicators Huge dimensionality! Need Graph Mining for selecting features

patterns

6

122

Graph Mining

Frequent Substructure Mining

Enumerate all patterns occurred in at least m graphs

:Indicator of pattern k in graph i

Support(k): # of occurrence of pattern k

7

123

Enumeration on Tree-shaped Search Space

Each node has a pattern Generate nodes from the root:

Add an edge at each step

8

124

Support(g): # of occurrence of pattern g

Tree Pruning

Anti-monotonicity:

If support(g) < m, stop exploring!

Not generated

9

125

Gspan

(Yan and Han, 2002)

Efficient Frequent Substructure Mining Method DFS Code

Efficient detection of isomorphic patterns

10

126

Depth First Search (DFS) Code

A labeled graph G A a B 1 c b b A 2 [1,2,B,c,A] [0,3,A,b,C] [0,1,A,a,B] [0,2,A,b,A]

G0

0

C

3

DFS Code Tree on G

[2,0,A,b,A] [0,3,A,b,C]

Isomorphic

[0,1,A,a,B]

G1

[2,0,A,b,A] [0,3,A,b,C]

[0,3,A,b,C]

Non-minimum DFS code. Prune it. 11

127

Discriminative patterns

w_i > 0: positive class w_i < 0: negative class Weighted Substructure Mining

Patterns with large frequency difference Not Anti-Monotonic: Use a bound

12

128

Multiclass version

Multiple weight vectors

(graph belongs to class ) (otherwise)

Search patterns overrepresented in a class

13

129

Basic Bound

xij : Occurrence of pattern j If k is supergraph of pattern j,

14

130

Pruning Condition

Summarizing the bound for all classes,

If it is negative, the search tree can be pruned safely

15

131

Summary: Preliminaries

Various graph learning problems

Supervised/Unsupervised

Discovery of salient features by graph mining Actual speed depends on the data

Faster for..

Sparse graphs Diverse kinds of labels

16

132

Part 2: EM-based clustering of graphs

17

133

EM-based graph clustering

Motivation

Learning a mixture model in the feature space of patterns Basis for more complex probabilistic inference

L1 regularization & Graph Mining E-step -> Mining -> M-step

18

134

Probabilistic Model

Binomial Mixture

Each Component

:Feature vector of a graph (0 or 1) :Mixing weight for cluster :Parameter vector for cluster

19

135

Ordinary EM algorithm

Maximizing the log likelihood

E-step: Get posterior M-step: Estimate using posterior probs. Both are computationally prohibitive (!)

20

136

Regularization

L1-Regularized log likelihood

Baseline constant

ML parameter estimate using single binomial distribution

In solution, most parameters exactly equal to constants

21

137

E-step

Active pattern

E-step computed only with active patterns (computable!)

22

138

M-step

Putative cluster assignment Each parameter is solved separately

Nave way:

solve it for all params and identify active patterns

Use graph mining to find active patterns

23

139

Solution

Occurrence probability in a cluster

Overall occurrence probability

24

140

Solution

25

141

Important Observation

For active pattern k, the occurrence probability in a graph cluster is significantly different from the average

26

142

Mining for Active Patterns

Active pattern

Equivalently written as

F can be found by graph mining! (multiclass)

27

143

Experiments: RNA graphs

Stem as a node Secondary structure by RNAfold 0/1 Vertex label (self loop or not)

28

144

Clustering RNA graphs

Three Rfam families

Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs)

Three bipartition problems

Results evaluated by ROC scores (Area under the ROC curve)

29

145

Examples of RNA Graphs

30

146

ROC Scores

31

147

No of Patterns & Time

32

148

Found Patterns

33

149

Summary (graph EM)

Substructure representation is better than paths Probabilistic inference helped by graph mining Extension to Dirichlet mixture model

Reported in Tsuda et al., SDM 2008 Graph PCA, LFD, CCA Semi-supervised learning

34

Possible extension

150

Part 3: Graph Boosting

35

151

Graph classification problem in chemoinformatics

Known as SAR problem in chemical informatics

(Quantitative) Structure-Activity Analysis

Given a graph, predict a class-label (+1 or -1)

Typically, features (descriptors) are given e.g., Dragon Descriptors, JOELIB2

36

152

SAR with conventional descriptors

#atoms #bonds #rings 22 25

Activity +1

20 23

11 21

21 24

11 22

+1 +1

-1 -1

37

153

Motivation of Graph Boosting

Descriptors are not always available New features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpan Linear Programming (LP) Boosting

Reduce the number of graph mining calls Faster than AdaBoost

Accurate prediction & interpretable results

38

154

Molecule as a labeled graph

C C C C C C C O C C C

39

155

SAR with patterns

C C C C C O C C C C C C C C C C

Activity

C

C

Cl

1 -1 -1 -1 1

1 1 1 1 1

C Cl

1 -1 -1 -1 -1

C C C C O C C

+1 +1 +1 -1 -1

3

C C

f 1

2

C

C C

C

C C

C

C

...

40

156

Sparse classification in a very high dimensional space

G: all possible patterns (intractably large) |G|-dimensional feature vector x for a molecule Linear Classifier

f (x) j x j

j 1 d

Use L1 regularizer to have sparse Select a tractable number of patterns

41

157

Problem formulation

Sum of hinge loss and L1 regularizer : Training examples : slack variables

42

158

Dual LP

Primal: Huge number of weight variables Dual: Huge number of constraints

Dual problem

43

159

Column Generation Algorithm for LP Boost (Demiriz et al., 2002)

Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge

Constraint Matrix

Used Part

44

160

Finding the most violated constraint

Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted substructure mining

45

161

Algorithm Overview

Iteration

Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by solving the dual problem

Convert dual solution to obtain primal solution

Return

46

162

Experimental Settings

Classification and Regression Datasets

47

163

Classification Results

48

164

Regression Results

49

165

Extracted patterns from CPDB

50

166

Memory Usage

51

167

Runtime

52

168

Comparison with AdaBoost

53

169

Summary (Graph Boosting)

Graph Boosting simultaneously generate patterns and learn their weights Finite convergence by column generation Interpretable by chemists. Flexible constraints and speed-up by LP.

54

170

Part 4: Entire Regularization Path

55

171

Overview

Entire regularization paths

LARS-LASSO (Efron et al., 2004), L1SVM Forward selection of features Trace the solution trajectory of L1-regularized learning Feature search -> pattern search Branch-and-bound algorithm DFS code tree, New Bound

56

Path following algorithm for graph data

172

Path Following Algorithms

LASSO regression

Follow the complete trajectory of

: Infinity to Zero

Active feature set

Features corresponding to nonzero weights

57

173

Piecewise Linear Path

At a turning point,

A new feature included into the active set, or An existing feature excluded from the active set

58

174

Practical Merit of Path Following

Cross validation by grid search

Has to solve QP many times Especially time-consuming for graph data

Path following does not include QP Determine the CV-optimal regularization parameter in the finest precision

59

175

Pseudo code of path following

Set initial point Do

and direction

Main Search Problem

d1 = Step size if next event is inclusion d2 = Step size if next event is exclusion d = min(d1,d2) Update the active feature set Set the next direction

Until all features are included

60

176

Feature space of patterns

Graph training data Set of all subgraphs (patterns) Each graph is represented as

61

177

Main Search problem

Step size if pattern t is included next

: constants computed from active set

Find pattern

that minimizes

62

178

Tree-shaped Search Space

Each node has a pattern Generate nodes from the root:

Add an edge at each step

63

179

Tree Pruning

If it is guaranteed that the optimal pattern is not in the downstream, the search tree can be pruned

Not generated

64

180

Theorem (Pruning condition)

Traversed up to pattern t :Minimum value so far No better pattern in the downstream, if

where

65

181

Initial Experiments

EDKB Estrogen receptor database

131 training graphs (chemical compounds)

Computation Time: 4 sec/search

Pattern size limitation: 10 edges

66

182

Solution Path

67

183

Events

68

184

Summary: Regularization Path

Path following implemented for graph data Pattern search by graph mining Classification: To do Combination with item set mining

69

185

Part 5: . Itemset Boosting for predicting HIV drug resistance

70

186

Life cycle of an HIV Virus

Drug Target

71

187

Approved HIV Drugs

8 Protease inhibitors (PI) 8 Nucleoside/nucleotide reverse transcriptase inhibitors (NRTI) 3 Non-nucleoside reverse transcriptase inhibitors (NNRTI) 1 Fusion inhibitor

72

188

Drug resistance of HIV

Exposure to a drug causes mutations in HIVs genes As a result, HIV gains resistance against the drug

Cost of identifying the genotypes of HIV in a patient is relatively cheap Predict the drug resistance from HIVs genotypes !

Effective Pharmacotherapy for individuals

73

189

Drug resistance prediction problem as regression

Input: Set of mutations in a target protein

41L: Amino acid at position 41 changed to L

Output: Resistance against a drug (fold change) (40F,41L,43E,210W,211K,215Y) (43Q,75M,122E,210W,211K) (75I, 77L, 116Y,151M,184V) 0.8 12.8 ?

74

190

Simple vs Complex Genotypic Features

Simple genotypic features

(0, 1, 0, 1, 0, 0, 0, 1,)

41L

62V F116Y

Complex genotypic features

(0, 1, 0, 1, 0, 0, 0, 1,)

77L,116Y 103N,210W,215Y

75

191

Linear Regression on Simple Genotypic Features (Rhee et al., PNAS 2006)

Mutation associations not discovered

76

192

Complex Genotypic Features

0/1 vector of pattern indicators Huge dimensionality! Need itemset mining for selecting features Selection of salient features (0, 1, 0, 1, 0, 0, 0, 1,)

77L,116Y 103N,210W,215Y

patterns

77

193

Other methods

Nonlinear SVM

Not interpretable High accuracy

Decision trees

Interpretable Low accuracy

78

194

Motivation of Itemset Boosting

Impossible to maintain all complex features Greedy feature discovery by Boosting + itemset mining Quadratic Programming (QP) Boosting

Reduce the number of itemset mining calls Faster than AdaBoost

Accurate prediction & interpretable results

79

195

Sparse classification in a very high dimensional space

G: all possible patterns (intractably large) |G|-dimensional feature vector x Linear Classifier

f (x) j x j

j 1 d

Use L1 regularizer to have sparse (LASSO) Select a tractable number of patterns

80

196

Problem formulation: Quadratic Programming

Sum of squared loss and L1 regularizer : Training examples : slack variables

81

197

Dual QP

Primal: Huge number of weight variables Dual: Huge number of constraints

82

198

Column Generation Algorithm for QP Boost (Demiriz et al., 2002)

Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge

Constraint Matrix

Used Part

83

199

Finding the most violated constraint

Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted itemset mining

84

200

Algorithm Overview

Iteration

Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by solving the dual problem

Convert dual solution to obtain primal solution

Return

85

201

Experimental settings

Three classes of drugs

NRTI (Nucleotide Reverse Transcriptase Inhibitors) PI (Protease Inhibitors) NNRTI (Nonnucleotide Reverse Transcriptase Inhibitors)

5fold cross validation

Linear SVM, Ridge regression, Lars Nonlinear SVM, iBoost

86

202

Regression results

Accuracy:

87

203

Accuracy Summary

NRTIs: iBoost performed best PIs:

Nonlinear methods were better than linear SVMs were slightly better (non-significant)

NNRTIs

Linear methods were better Combination is not necessary

88

204

NRTI Drugs

89

205

Known mutation associations in RT

Red: Thymidineassociated Mutations (TAM)

41L, 67N, 70R, 210W, 215Y/F, 219Q, 69i 75I, 77L, 116Y, 151M, 65R, 74V, 184I/V

RT of HIV-1

Blue: Q151M Complex

90

206

PI Drugs

91

207

NNRTI Drugs (almost no combination found)

92

208

Computation Time of iBoost

Training time for 3TC 507 isolates with 371 mutations on average QP time longer than mining time

93

209

Summary (HIV)

Itemset Boosting for finding mutation associations Good accuracy for NRTIs Our complex features re-discover known mutation clusters Broad applications

Multiple SNP analysis, RNAi efficacy prediction Motif combination, Flu mutation analysis P53 mutation analysis

94

210

Reference

[1] K. Tsuda and T. Kudo, Clustering graphs by weighted substructure mining, in Proceedings of the 23rd international conference on Machine learning - ICML 06, 2006, pp. 953960. [2] H. Saigo, N. Krmer, and K. Tsuda, Partial least squares regression for graph mining, in Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, 2008, p. 578. [3] H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. Tsuda, gBoost: a mathematical programming approach to graph classification and regression, Machine Learning, vol. 75, no. 1, pp. 6989, Nov. 2008. [4] H. Saigo, T. Uno, and K. Tsuda, Mining complex genotypic features for predicting HIV-1 drug resistance., Bioinformatics (Oxford, England), vol. 23, no. 18, pp. 245562, Sep. 2007. [5] K. Tsuda, Entire regularization paths for graph data, in Proceedings of the 24th international conference on Machine learning - ICML 07, 2007, pp. 919926.

95

211

Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 3: Kernels for Structured Data

National Institute of Advanced Industrial Science and Technology Koji Tsuda

1

212

Agenda 3 (Kernel)

1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

2

213

Part 1: Kernel Method Revisited

3

214

Kernels and Learning

In Kernel-based learning algorithms, problem solving is now decoupled into:

A general purpose learning algorithm (e.g. SVM, PCA, ) Often linear algorithm A problem specific kernel

Simple (linear) learning algorithm

Complex Learning Task

Specific Kernel function 4

215

Current Synthesis

Modularity and re-usability

Same kernel ,different learning algorithms Different kernels, same learning algorithms

Data 1 (Sequence) Data 2 (Network)

Kernel 1

Gram Matrix

(not necessarily stored)

Learning Algo 1 Learning Algo 2

5

Kernel 2 Gram Matrix

216

Kernel Methods : intuitive idea

Find a mapping f such that, in the new space, problem solving is linear Kernel represents the similarity between two objects, defined as the dot-product in this new vector space But the mapping is left implicit Easy generalization of a lot of dotproduct-based learning algorithms

6

217

Kernel Methods : the mapping

f

f

Original Space

f

Feature (Vector) Space

7

218

Kernel : more formal definition

A kernel k(x,y)

is a similarity measure defined by an implicit mapping f, such that: k(x,y)=f(x)f(y) This similarity measure implies:

Invariance or other a priori knowledge The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning) but still computational efficiency when computing k(x,y)

8

219

Kernel Trick

Generalizes (nonlinearly) algorithms in clustering, classification, density estimation ..

When these algorithms are dot-product based, by replacing the dot product (xy) by k(x,y)=f(x)f(y) When these algorithms are distance-based, by replacing d(x,y) by k(x,x)+k(y,y)-2k(x,y)

Freedom of choosing f implies a large variety of learning algorithms

9

220

Valid Kernels

Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel)

A function is P.D. if K (x, y) f (x) f (y)dxdy 0 f L2 In other words, the Gram matrix K (whose elements are k(xi,xj)) must be positive definite for all xi, xj of the input space One possible choice of f(x): k(,x) (maps a point x to a function k(,x) feature space with infinite dimension!)

10

221

How to build new kernels

Kernel combinations, preserving validity:

K (x,y ) K1 ( x,y ) (1 ) K 2 (x,y ) K (x,y ) a.K1 (x,y ) a0 K (x,y ) K1 ( x,y ).K 2 ( x,y ) K (x,y ) f ( x). f ( y ) f is real valued function K (x,y ) K 3 (( x) ,(y )) K (x,y ) xPy P symmetric definite positive K (x,y ) K1 ( x,y ) K1 ( x,x) K1 (y,y )

11

0 1

222

Strategies of Design

Convolution Kernels: text is a recursively-defined data structure. How to build global kernels form local (atomic level) kernels? Generative model-based kernels: the topology of the problem will be translated into a kernel function

12

223

IBM Research Tokyo Research Laboratory

2005 IBM Corporation

Family of kernels

Kernels for biological sequences

Spectrum kernel Marginalized kernel Profile kernel Local alignment kernel

Kernel Methods in Computational Biology, MIT Press

Tree Kernels

Kernel for phylogenetic profiles

Kernel for natural language Kernel for RNA sequences

224

IBM Research Tokyo Research Laboratory

2005 IBM Corporation

Family of kernels

Kernels for nodes in a network

Diffusion kernel Locally constrained diffusion kernel

Graph Kernels

Marginalized Graph Kernels MGK without tottering

Acyclic Pattern Kernels

Shortest Path Kernel Weisfeiler-Lehman Kernel

225

Weak points of kernel methods

Not Interpretable

Not sure which features are used -> Graph Mining, Boosting

Dense kernel matrices: Slow

Take O(n^3) time for manipulation -> Semi-supervised learning

15

226

Part 2. Marginalized kernels

16

227

Biological Sequences: Classification Tasks

DNA sequences (A,C,G,T)

Gene Finding, Splice Sites MicroRNA discovery, Classification into Rfam families Remote Homolog Detection, Fold recognition

17

RNA sequences (A,C,G,U)

Amino Acid Sequences (20 symbols)

228

Kernels for Sequences

Similarity between sequences of different lengths

ACGGTTCAA ATATCGCGGGAA

Later combined with SVMs and other kernel methods

18

229

Count Kernel

Inner product between symbol counts

Extension: Spectrum kernels (Leslie et al., 2002) Counts the number of k-mers (k-grams) efficiently Not good for sequences with frequent context change E.g., coding/non-coding regions in DNA

19

230

Hidden Markov Models for Estimating Context

Visible Variable

Hidden Variable

: Symbol Sequence

: Context

HMM can estimate the posterior probability of hidden variables from data

20

231

Marginalized kernels

Design a joint kernel for combined Hidden variable is not usually available

Take expectation with respect to the hidden variable

The marginalized kernel for visible variables

21

232

Designing a joint kernel for sequences

Symbols are counted separately in each context

:count of a combined symbol (k,l) Joint kernel: count kernel with context information

22

233

Marginalization of the joint kernel

Joint kernel

Marginalized count kernel

23

234

Computing Marginalized Counts from HMM

Marginalized count is described as

Posterior probability of i-th hidden variable is efficiently computed as

24

235

2nd order marginalized count kernel

If adjacent relations between symbols have essential meanings, the count kernel is obviously not sufficient

2nd order marginalized count kernel 4 neighboring symbols (i.e. 2 visible and 2 hidden) are combined and counted

25

236

Fisher Kernel

Probabilistic mode

Fisher Kernel Mapping to a feature vector (Fisher score vector)

Inner product of Fisher scores

Z: Fisher information matrix

26

237

Fisher Kernel from HMM

Derive FK from HMM (Jaakkola et al. 2000)

Derivative for emission probabilities only No Fisher information matrix Counts are centralized and weighted

FK from HMM is a special case of marginalized kernels

27

238

Difference between FK and Marginalized Kernels

FK: Probabilistic model determines the joint kernel and the posterior probabilities

MK: You can determine them separately

More flexible design!

28

239

Protein clustering experiment

84 proteins containing five classes

gyrB proteins from five bacteria species HMM + {FK,MCK1,MCK2}+K-Means

Clustering methods

Evaluation

Adjusted Rand Index (ARI)

29

240

Kernel Matrices

30

241

Clustering Evaluation

31

242

Applications since then..

Marginalized Graph Kernels (Kashima et al., ICML 2003) Sensor networks (Nyugen et al., ICML 2004) Labeling of structured data (Kashima et al., ICML 2004) Robotics (Shimosaka et al., ICRA 2005) Kernels for Promoter Regions (Vert et al., NIPS 2005) Web data (Zhao et al., WWW 2006)

32

243

Summary (Marginalized Kernels)

General Framework for using generative model for defining kernels Fisher kernel as a special case Broad applications

33

244

Part 3 Marginalized Graph Kernels

34

245

Motivations for graph analysis

Existing methods assume tables

Serial Num 0001 0002 Name Age 40 31 Sex Male Female Address Tokyo Osaka

Structured data beyond this framework New methods for analysis

35

246

Graphs..

36

247

Graph Structures in Biology

DNA Sequence

A C G C

Compounds

H

RNA

C

UA CG CG U U U U H H C C O H

C

C H

C

H

Texts in literature

Amitriptyline

inhibits

adenosine

uptake

37

248

Marginalized Graph Kernels

Going to define the kernel function Both vertex and edges are labeled

(Kashima, Tsuda, Inokuchi, ICML 2003)

38

249

Label path

Sequence of vertex and edge labels

Generated by random walking Uniform initial, transition, terminal probabilities

39

250

Path-probability vector

40

251

Kernel definition

Kernels for paths

A c D b E

B c D a A

Take expectation over all possible paths! Marginalized kernels for graphs

41

252

Computation

Transition probability : Initial and terminal : omitted

: Set of paths ending at v KV : Kernel computed from the paths ending at (v, v)

KV is written recursively

Kernel computed by solving linear equations polynomial time

A(v)

A(v) v

v

42

253

Graph Kernel Applications

Chemical Compounds (Mahe et al., 2005) Protein 3D structures (Borgwardt et al, 2005) RNA graphs (Karklin et al., 2005) Pedestrian detection Signal Processing

43

254

Predicting Mutagenicity

MUTAG benchmark dataset

Mutation of Salmonella typhimurium 125 positive data (effective for mutations) 63 negative data (not effective for mutations)

Mahe et al. J. Chem. Inf. Model., 2005

44

255

Classification of Protein 3D structures

Graphs for protein 3D structures

Node: Secondary structure elements Edge: Distance of two elements

Calculate the similarity by graph kernels

45 Borgwardt et al. Protein function prediction via graph kernels, ISMB2005

256

Classification of proteins: Accuracy

46 Borgwardt et al. Protein function prediction via graph kernels, ISMB2005

257

Pedestrian detection in images

(F. Suard et al., 2005)

47

258

Classifying RNA graphs

Karklin et al.,, 2005)

(Y.

48

259

Strong points of MGK

Polynomial time computation O(n^3) Positive definite kernel

Support Vector Machines Kernel PCA Kernel CCA And so on

49

260

Drawbacks of graph kernels

Global similarity measure

Fails to capture subtle differences Long paths suppressed

Results not interpretable

Structural features ignored (e.g. loops)

No labels -> kernel always 1

50

261

Part 4. Weisfeiler Lehman kernel

51

262

Convert a graph into a set of words

i) Make a label set of adjacent A vertices ex) {E,A,D} ii) Sort ex) A,D,E B E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a unique value E R ex) B,A,D,ER D v) Assign the value as the new Bag-of-words {A,B,D,E,,R,} vertex label

263

Courtesy K. Borgwardt

29/08/2012 53

264

54

265

Part 5. Reaction Graph Kernels

55

266

KEGG lysine degradation pathway

267

Missing enzymes in metabolic networks

Many enzymatic reactions whose substrate and product are known, but the enzyme involved is unknown.

?

Need to assign Enzymatic Classification numbers.

?

268

EC number

EC (Enzymatic Classification) number is a hierarchical categorization of

Enzymes Enzymatic reactions

class

subsubclass

EC 1.3.3.subclass

269

Task

Given a pair of substrate and product as a query, find similar reactions in the database

Query ?

Result of Retrieval EC: 3.5.2

Similarity measure of reactions is necessary

270

Reaction Graph

Represent enzymatic reaction as reaction graph

Node: Molecules Edge: chemical relation of molecules (main, leave, co-factor, transferase, ligase)

Reaction graph kernel: Similarity measure of reaction graphs

Molecule = Graph of atoms Reaction graph has graph of graphs structure Extension of existing walk-based graph kernel

(Kashima et al., 2003)

271

Main Substrate

2 Main Products

main

leave

main

272

Reaction graph kernels (RGK)

Two-layered kernels on graphs of graphs

Node kernel = walk-based graph kernel of molecules Edge kernel = delta kernel of labels

main, leave, cofactor, transferase, ligase

R

K r ( R, R' )

R'

273

Simplified Settings

Query might not come in the complete form Remove some edges in the database entries

RPAIR

Use only reactant edges (main, leave)

main-only Use main only

274

Automatic classification of enzymatic reactions in KEGG

KEGG/REACTION database 4610 reactions with known EC number 6 classes, 50 subclasses, 124 subsubclasses

Construct nearest neighbor classifier based on the reaction graph kernel Three different levels: class, subclass,subsubclass Three kernel versions: full, RPAIR, main-only Measure leave-one-out classification error

275

Prediction Accuracy

As expected, classification is easier for upper categories, but difficult for lower categories such as subsubclass. The order of accuracy (full-edge > RPAIR > main-pair) suggests that detailed edge information contributes to further accuracy.

276

Predicting unannotated reactions in plant secondary metabolism

KEGG pathway Biosynthesis of Secondary Metabolites Out of unannotated 56 reactions, we have manually assigned ECs of 36 reactions under chemists guidance Comparison with an existing rule-based method: ezyme (Kotera et al., J. Am. Chem. Soc, 2004). RGKs accuracy was better than e-zyme. (50% improvement for the top candidate)

277

Case 1: EC 3.1.1

query

Manual

3.1.1

annotation

Structures of C02046 and C01479 are almost same except structural isomerism. A hydrolysis occur at a carboxylic-esther bond.

method RGK rank1 3.1.1 rank2 1.14.11 rank3 1.14.11

e-zyme

6.1.1

3.1.1

NA

278

Case 2: EC 2.1.1+1.1.1

query

Manual

annotation

After removing a methyl group of C06175 (2.1.1), C01735 will be produced by oxidation of CH-OH group (1.1.1). In the second reactions, enzymes usually use NAD+/NADP+ as an acceptor.

method

RGK e-zyme

rank1

1.1.1 2.1.1

rank2

1.1.1 1.13.12

rank3

1.1.1 1.14.14

279

Case 3: EC1.14+2.4.1+5.2.1+3.2.1

Query

Manual Annotation

This reaction is thought to be a set of 5 reactions by analogy with the pathway from C00423 => C01772 => C05158 => C05839 => C05838, and to C05851. The last reaction is spontaneous and not enzymatic.

method RGK rank1 1.14.13 rank2 1.2.3 rank3 1.2.1

e-zyme

NA

NA-

NA-

280

A difficult case

No annotation

A very similar reaction (below) is found by manual inspection

Multi-step reaction is difficult to analyze. We dont know how many steps are hidden between given substrate and product.

281

Concluding Remarks: New Frontier

Developing Algorithms for Learning from Graphs Taming Combinatorial Explosion Recursive Fixed Point Iteration: Graph Kernels Statistical Pruning in Search Tree: Graph Boosting Hashing to a set of words: WL Kernel New ideas are necessary to get beyond the current level of speed and prediction accuracy Deeper integration of learning and mining

THANK YOU VERY MUCH!!

71

282

Reference

[1] H. Kashima, K. Tsuda, and A. Inokuchi, Marginalized kernels between labeled graphs, ICML, pp. 321328, 2003. [2] H. Saigo, M. Hattori, H. Kashima, and K. Tsuda, Reaction graph kernels predict EC numbers of unknown enzymatic reactions in plant secondary metabolism., BMC bioinformatics, vol. 11 Suppl 1, no. Suppl 1, p. S31, Jan. 2010. [3] N. Shervashidze, P. Schweitzer, V. Leeuwen, E. Jan, K. Mehlhorn, and K. Borgwardt, Weisfeiler-Lehman graph kernels, Journal of Machine Learning Research, vol. 12, pp. 25392561, 2011. [4] K. Tsuda, T. Kin, and K. Asai, Marginalized kernels for biological sequences, Bioinformatics, vol. 18, no. Suppl 1, pp. S268S275, Jul. 2002.

72

283

Site based on the django-slidedeck framework by Jason Yosinski.

Find a bug? Email Jason or submit a pull request on Github.