Graph Mining - Koji Tsuda - MLSS 2012 Kyoto Slides

« Graph Mining

Koji Tsuda

Labeled graphs are general and powerful data structures that can be used to represent diverse kinds of objects such as XML code, chemical compounds, proteins, and RNAs. In these 10 years, we saw significant progress in statistical learning algorithms for graph data, such as supervised classification, clustering and dimensionality reduction. Graph kernels and graph mining have been the main driving force of such innovations. In this lecture, I start from basics of the two techniques and cover several important algorithms in learning from graphs. Successful biological applications are featured. If time allows, I will also cover recent developments and show future directions.

Scroll with j/k | Numbers | Overlaid text | Size

Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 1: Data Mining
National Institute of Advanced Industrial Science and Technology Koji Tsuda

Slide: National Institute of Advanced Industrial Science and Technology (AIST)
Research institute under METI for 6 fields:
Life Science, Informatics, Environment and Energy, Nanotechnology and Materials, Geological Survey, Standard

2500 scientists, 700 administrative staff, 5200
scientists from outside

Since 2001
2

National Institute of Advanced Industrial Science and Technology (AIST)
Research institute under METI for 6 fields:
Life Science, Informatics, Environment and Energy, Nanotechnology and Materials, Geological Survey, Standard

2500 scientists, 700 administrative staff, 5200
scientists from outside

Since 2001
2

Computational Biology Research Center
Odaiba, Tokyo
Developing Novel Methods and Tools for

Genome Informatics Molecular Informatics Cellular Informatics

Diverse Collaboration with Companies and Universities
3

Slide: Koji Tsuda: Short Bio
1998 PhD in Kyoto Univ, Join ETL 2000 GMD FIRST (Berlin, Germany) 2001 Join CBRC/AIST 2003-2004, 2006-2008 Max Planck Institute for Biological Cybernetics, Tuebingen, Germany 2009 Back to CBRC, Machine Learning Group Leader
4

Koji Tsuda: Short Bio
1998 PhD in Kyoto Univ, Join ETL 2000 GMD FIRST (Berlin, Germany) 2001 Join CBRC/AIST 2003-2004, 2006-2008 Max Planck Institute for Biological Cybernetics, Tuebingen, Germany 2009 Back to CBRC, Machine Learning Group Leader
4

About this lecture
How to extract knowledge from structured data Itemset mining, tree mining, graph mining

Reverse Search Principle

Learning from structured data Kernels for structured data
5

Slide: Chapter 1 (Data Mining)
1. 2. 3. 4. 5. 5. 6. Structured Data in Biology Itemset Mining Closed Itemset Mining Ordered Tree Mining Unordered Tree Mining Graph Mining Dense Module Enumeration

6

Chapter 1 (Data Mining)
1. 2. 3. 4. 5. 5. 6. Structured Data in Biology Itemset Mining Closed Itemset Mining Ordered Tree Mining Unordered Tree Mining Graph Mining Dense Module Enumeration

Slide: Agenda 2 (Learning from Structured Data)
1. 2. 3. 4. 5. Preliminaries Graph Clustering by EM Graph Boosting Regularization Paths in Graph Classification Itemset Boosting for predicting HIV drug resistance

7

Agenda 2 (Learning from Structured Data)
1. 2. 3. 4. 5. Preliminaries Graph Clustering by EM Graph Boosting Regularization Paths in Graph Classification Itemset Boosting for predicting HIV drug resistance

Slide: Agenda 3 (Kernel)
1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

8

Agenda 3 (Kernel)
1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

Part 1: Structural Data in Biology

Biological Sequences
DNA sequences (A,C,G,T)

Gene Finding, Splice Sites MicroRNA discovery, etc. Remote homolog detection, Fold recognition etc.
10

RNA sequences (A,C,G,U)

Amino acid sequences (20 symbols)

Structures hidden in sequences (I)
Exon/intron of DNA (Gene)

Structures hidden in sequences (II)

RNA Secondary Structure

Protein 3D Structures

It is crucial to infer hidden structures and exploit them for classification

Biological Graphs

Molecular graphs
Structure: Thiamine (Vitamin B1) molecular graph

graph = a set of dots & lines (or nodes & edges)
Hydrogen Carbon Oxygen Nitrogen Sulfur single bond double bond

abstraction

implicit hydrogens

explicit hydrogens

Slide: Gene Expression Data
Measurement of many mRNAs in the cell Rough estimate of amount of proteins Time-series or not Snapshot of the underlying dynamic system
14

Gene Expression Data
Measurement of many mRNAs in the cell Rough estimate of amount of proteins Time-series or not Snapshot of the underlying dynamic system
14

Slide: Biological Networks
Protein-protein physical interaction Metabolic networks Gene regulatory networks Network induced from sequence similarity
Thousands of nodes (genes/proteins) 100000s of edges (interactions)
15

Biological Networks
Protein-protein physical interaction Metabolic networks Gene regulatory networks Network induced from sequence similarity
Thousands of nodes (genes/proteins) 100000s of edges (interactions)
15

Physical Interaction Network

Metabolic Network

Many possible prediction problems..

Part 2: Itemset mining

Data Mining
people

A formal study of efficient methods for extracting interesting rules and patterns

person name @age @id #text 2 5 60 8 #text Jo hn john@abc.com 60 9 email @id

person tel name #text Mar y

#text

from massive data

5554567

Frequent itemset mining
(Agrawal and Srikant 1994)

Closed pattern mining Structured data mining (Sequence, Trees, and Graphs)
20

Frequent Itemset Mining
[Agrawal, Srikant, VLDB'94]

Finding all "frequent" sets of elements (items) appearing times or more in a database

Minsup = 2
1 t1 t2 t3 t4 t5 2 3 4 5

Frequent sets , 1, 2, 3, 4, 12, 13, 14, 23, 24, 124
X = {2, 4} appears three times, thus frequent

Frequent sets

database

The itemset lattice (2, )

Definitions: Database
A set = { 1, ..., n } of items (elements) Transaction database A set T = { t1, ..., tm } of subsets of Each subset t is called a transaction

L = {1, 2, 3, 4}

Alphabet of items id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4
22 22

Slide: Definitions: Frequent sets
Itemset X appears in transaction t: X t Occurrence of X in database T: Occ(X, T) = { t T : X t } Frequency of X: Fr(X, T) = | Occ(X, T) | Minimum support (minsup): 0 |T| X is frequent in T if Fr(X, T) .
I = {1, 2, 3, 4} Alphabet of items Transaction database id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4
23

Occurrences and frequencies of itemsets Occ(3, T) = {t1, t3} Fr(3, T) = 2 Occ(24, T) = {t2, t3, t4}, Fr(24, T) = 3
23

Definitions: Frequent sets
Itemset X appears in transaction t: X t Occurrence of X in database T: Occ(X, T) = { t T : X t } Frequency of X: Fr(X, T) = | Occ(X, T) | Minimum support (minsup): 0 |T| X is frequent in T if Fr(X, T) .
I = {1, 2, 3, 4} Alphabet of items Transaction database id t1 t2 t3 t4 transaction 1, 3 2, 4 1, 2, 3, 4 1, 2, 4
23

Occurrences and frequencies of itemsets Occ(3, T) = {t1, t3} Fr(3, T) = 2 Occ(24, T) = {t2, t3, t4}, Fr(24, T) = 3
23

Market Basket Data
Popular application of itemset mining Business and Market data analysis
Transaction Data of purchase
ID 001 002 Chips 1 1 Mustard Sausage Softdrink 0 1 0 1 0 1 Beer 1 1

Item

003

1
0 0 1 1 1 1

0
0 1 1 0 1 0

1
1 1 1 1 1 0

0
0 1 0 1 0 1

0
1 1 1 1 0 0

a transaction or a "basket"

004 005 006 007 008 009

010 0 1 0 1 Meaning of the transaction 003 1 "Custmer 003 bought Chips and Sausage together in his basket"
24

DAG of itemsets: Hasse diagram
Edge: Adding
empty

one item
1 2 3 4

1,2

1,3

2,3

1,4

2,4

3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,2,3,4
25

Enumeration Tree by Lexicographical Order
Need a tree
empty

to avoid duplication
1,2

1,3

2,3

1,4

2,4

3,4

1,2,3

1,2,4

1,3,4

2,3,4

1,2,3,4
26

Backtracking Algorithm: FP Growth etc.
Monotonicity: Support only decreases Depth First Traversal, Prune if support <
Frequent sets

Association Rule Mining
Confidence: Supp(A B)/ Supp(A)
Probability of B, Given A

What item is likely to be bought when A

is bought
Search: large support, confidence large Post-processing of itemset mining
28

Summary: Itemset mining
Itemset mining is the simplest of all

mining algorithms
Need to maintain occurrence of each

pattern in database
Tree by lexicographical order is

(implicitly) used

Part 3: Closed Itemset mining

Slide: Problem in Frequent Pattern Mining
Huge Number of frequent itemsets Hard to analyze Most of them are similar
Huge number of frequent itemsets discovered in T

An input transaction database
1 t1 t2 t3 2 3 4 5

Frequent sets

minsup = 2
31

t4 t5

mining

database

Problem in Frequent Pattern Mining
Huge Number of frequent itemsets Hard to analyze Most of them are similar
Huge number of frequent itemsets discovered in T

An input transaction database
1 t1 t2 t3 2 3 4 5

Frequent sets

minsup = 2
31

t4 t5

mining

database

Solution: Closed Pattern Mining
Find only closed patterns Observation: Most frequent itemset X can be extended without changing occurrence by

adding new elements
def ([Pasquier et al., ICDT'99]). An itemset X is a closed set if and only if there is no proper superset of X with the same frequency (thus the same occurrence set).

Closed Pattern Mining
A closed itemset is the maximal set among all

itemsets with the same occurrences. Equivalence class [X] = {Y| Occ(X)=Occ(Y) }.

A AC AB ABC C AE ABE B BE ACE ABCE BC BCE E CE

Database
id 1 2 3 4 records ABCE AC BE BCE

Closed sets (maximal sets) Equivalence class w.r.t. occurrences

Brute-force: Stupid Baseline
ALGORITHM Bruteforce
First, generate all frequent itemsets Check them one by one via maximality test

Maximality test for each candidate frequent set X
Add some element e in to X If Freq(X U {e}) is properly less than Freq(X) then reject X.

Bruteforce
STEP1) first, generate all frequent sets

Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

Bruteforce
STEP1) first, generate all frequent sets STEP 2) make closedness test for each set

Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

Slide: Bruteforce
STEP1) first, generate all frequent sets STEP 2) make closedness test for each set STEP3) finally, extract all closed sets
Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

AC

CE

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

All closed sets are found!

37

Bruteforce
STEP1) first, generate all frequent sets STEP 2) make closedness test for each set STEP3) finally, extract all closed sets
Database T id 1 2 3 4
[1,2]

All itemsets in T
[1,2,4]

records ABCE AC BE BCE

[1,2,3,4] [1,3,4]

[1,2]

A AB ABC

C AE ABE

B BE ACE ABCE
[1]

E BC BCE

[1,4]

Occurrence (set of ids) Equivalence class w.r.t. occurrences Closed sets (maximal sets)

All closed sets are found!

Slide: Complexity of Enumeration Algorithms
Number of patterns usually exponential to input size Delay: Time between two pattern outputs Brute-force is exponential delay w.r.t. pattern size
Input size Input Output size

Delay D Total Time T

38

Complexity of Enumeration Algorithms
Number of patterns usually exponential to input size Delay: Time between two pattern outputs Brute-force is exponential delay w.r.t. pattern size
Input size Input Output size

Delay D Total Time T

To achieve linear delay,

Must jump from closed set to closed set How to define the search tree? Reverse search!
(Avis and Fukuda 1996)

{2} {7,9} {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 39

Slide: Reverse Search: Its a must
A general mathematical framework to design enumeration algorithms Can be used to prove the correctness of the algorithm Popular in computational geometry Data mining algorithms can be explained in remarkable simplicity
40

Reverse Search: Its a must
A general mathematical framework to design enumeration algorithms Can be used to prove the correctness of the algorithm Popular in computational geometry Data mining algorithms can be explained in remarkable simplicity
40

Often, search space comes as a DAG

Naive Backtracking
= Duplication {2} {7,9} {1,7,9} {2,5} {2,7,9}

Duplication check by
Marking = Exponential Memory

How to visit all
nodes without duplication?

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 41

Reduction Map
Mapping from a children to the parent Reduction map for closed itemset

Shrink the itemset until occurrence changes Take closure operation

closure(X) closure shrink
Parent of X
42

Closed set X
42

Slide: Closure Operation
closure(X) of a set X:
Closed set computed by closure(X) = { t in T : X t }. (taking the intersection of all transactions in T that X occurs as subset)

non-closed set

closure(X)
closed set
43

Equivalence class of itemsets with same occurrence

Closure Operation
closure(X) of a set X:
Closed set computed by closure(X) = { t in T : X t }. (taking the intersection of all transactions in T that X occurs as subset)

non-closed set

closure(X)
closed set
43

Equivalence class of itemsets with same occurrence

Example of Closure Operation
Database T

Non-closed itemset: (B,C) Occurrence: 1,4 Take Intersection of 1 and 4

id 1 2 3 4

records ABCE AC BE BCE

(A,B,C,E) (B,C,E) = (B,C,E) This is closed itemset

By applying the reduction map to all nodes, enumeration tree is defined.

But arrows are
in reverse direction..
{2} {7,9} {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

{1,2,7,8,9}

{1,2,5,6,7,9} 45

Children generation
In backtracking, one has to generate all children of the current node Parent Inverse of reduction map

Children candidates

real child

Generate all children candidates Apply reduction map to them Remove if not coming back

real child

Reverse Search Theorem
To prove the correctness, prove the following

Reduction map is uniquely defined on all nodes By applying the reduction map repeatedly, one can reach the root node from any node Children generation is inverse of reduction map

Easy to check !

Slide: LCM = Linear Time Closed Sets Miner (Uno et al., 2003)
Prefix Preserving Closure Extension Jump! = Children generation from the reduction map Linear Delay!
48

LCM = Linear Time Closed Sets Miner (Uno et al., 2003)
Prefix Preserving Closure Extension Jump! = Children generation from the reduction map Linear Delay!
48

Closure Extension
Repeat: Add an item and taking closure

Step 1: Start: Y = X U {i} closed set X closure closure(X) i Y X
add item i

Step 2: closed set Z = closure(XU{i}) Z

non-closed sets closed sets
49

Nave Closure Extension: Duplication!
closure extension DAG of closed itemsets {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

closure extension

{1,2,7,8,9}

{1,2,5,6,7,9}

Slide: Prefix Preserving Closure Extension
Ensure any closed set is generated from a unique parent Def. Closure tail of a closed itemset P the minimum j s.t. closure (P {1,,j}) P Def. H closure(P{i}) is a PPC-extension of P i > closure tail and H {1,,i-1} P {1,,i-1}
51

Prefix Preserving Closure Extension
Ensure any closed set is generated from a unique parent Def. Closure tail of a closed itemset P the minimum j s.t. closure (P {1,,j}) P Def. H closure(P{i}) is a PPC-extension of P i > closure tail and H {1,,i-1} P {1,,i-1}
51

Enumeration tree by PPC extension
closure extension DAG ppc extension tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} {2,5} {2,7,9}

{1,2,7,9}

{2,3,4,5}

closure extension ppc extension

{1,2,7,8,9}

52 {1,2,5,6,7,9}

Linear Delay in Pattern Size
(Uno, Uchida, Asai, Arimura, Discovery Science 2004)

Theorem : The algorithm LCM finds all frequent closed sets X appearing in a collection of a transaction database D in O(lmn) time per closed set in the total size of D without duplicates, where l is the maximum length of transactions in D, and n is the total size of D, m is the size of pattern X.

Note: The output polynomial time complexity of Closed sets discovery is shown by [Makino et al. STACS2002]
53

Summary: Closed Itemset Mining
Closure Extension: Jump from closed

set to closed set
LCM: Linear Delay

Very fast in practice, too
Winner of FIMI04 (Frequent Itemset Mining Implementation Workshop)

Relation to clique enumeration (Arimura,
Uno, SDM2009)
54

Part 4: Ordered Tree Mining

Frequent Ordered Tree Mining
Natural extension of frequent itemset

mining problem for trees
Finding all frequent substructure in a given collection of labeled trees How to enumerate them without duplicates

Efficient DFS Algorithm
FREQT [Asai, Arimura, SIAM DM2002] TreeMiner [Zaki, ACM KDD2002] Rightmost expansion technique

Labeled Ordered Trees
Rooted:
Ordered:
Siblings are ordered from left to right.
people person name @age @id email #text
25 608 John

Labeled
Each node has a label.

person tel @id

A model of
HTML/XML Hierarchical records Dependency tree of natural language texts.

name
#text #text
609

#text
Mary

john@abc.com

5554567 57

Matching between trees
Pattern tree T matches a data tree D
(T occurs in D ) pattern tree T matching function f

There is a matching function f from T into D.

A C

data tree D

f is 1-to-1. B f preserves parent-child relation. 3 f preserves (indirect) sibling

2 4

A A
5 6

C
11

C B
9

A
10

relation. f preserves labels.

C
58

Frequency of a pattern tree
A root occurrence of pattern T:
The node to which the root of T maps by a matching function

The frequency fr(T) = #root occurrences of T in D

T
B

A
1

D
P1
7

C
2 3

A
4

C B
11

Root occurrence list OccD(T) = {2, 8}

5 6

C
8

A B
9 10

C
59

Frequent Tree Mining Problem
Given: a colection S of labeled ordered trees

and a minimum frequency threshold

Task: Discover all frequent ordered trees in S (with frequency no less than ) without

duplicates
A minimum frequency threshold (min-sup) s = 50%

Key: How to enumerate ordered trees without duplicates?
A

A naive algorithm
Starting from the smallest tree Grow a pattern tree by adding a new node one by one
B

A C

Drawbacks
Exponentially many different ways to generate the same pattern tree Explicit duplication test needed
B

A C

How to overcome this difficulty?

An idea: DFS Code of Ordered Tree
Depth-label sequence in the preorder traversal

(depth first search)
S = ((d(v1), l(v1)) , , (d(vk), l(vk)) depth 0 1
B
A C B A

B C

DFS code
id 1 2 3 4 5 6 7

2
3

seq 0A 1B 2A 3C 2B 1B 2C
62

Rightmost expansion
Extending the DFS Code = Attaching a new node on the rightmost branch (d1,l1),,(dn,ln), (dn+1,ln+1)
pattern
C B 2 C 3 1 A 4 C 5 B 6

S
C
B 2 C 3

C B 2

1 A 4

C 3
A 4

C 5 B 6
D
7

1 A 4 C 5 B 6
63

B 2 C 3

C 5 B 6

Slide: Searching frequent ordered trees
Enumerate all frequent ordered trees by backtracking Tree extended only by rightmost extension = No duplication

frequent B

B B B B B

frequent A

B A

B B B B A B B B A B A

A

A
A infrequent

B infrequent B A A

B
64

Searching frequent ordered trees
Enumerate all frequent ordered trees by backtracking Tree extended only by rightmost extension = No duplication

frequent B

B B B B B

frequent A

B A

B B B B A B B B A B A

A
A infrequent

B infrequent B A A

B
64

Summary: Ordered tree mining
Convert tree to a string (DFS Code)
Adding element to the code =

Rightmost extension
It was relatively easy because nodes

are ordered
How about unordered case?

Part 5: Unordered Tree Mining

Slide: Frequent Unordered Tree Mining
Unordered trees: Non-trivial subclass of general graphs Problem: Exponentially many isomorphic trees Efficient DFS Algorithm
A
B B

A B A C

B

A

B

C

B

C

Unot [Asai, Arimura, DS'03] NK [Nijssen, Kok, MGTS03]

C

A B C B

A

B B
C B

B
A

A
C

C

67

Frequent Unordered Tree Mining
Unordered trees: Non-trivial subclass of general graphs Problem: Exponentially many isomorphic trees Efficient DFS Algorithm
A
B B

A B A C

Unot [Asai, Arimura, DS'03] NK [Nijssen, Kok, MGTS03]

A B C B

B B
C B

B
A

A
C

Slide: Canonical Ordered Representation
Given ordering among siblings, depth-first search (DFS) code is defined
Code(T) = ((depth(v1), label(v1)) , , (depth(vk), label(vk))

T
B
A C

A

Code(T) = (0A,1B,2A,3C,2B,1B,2C)
B

B

C
68

Canonical Ordered Representation
Given ordering among siblings, depth-first search (DFS) code is defined
Code(T) = ((depth(v1), label(v1)) , , (depth(vk), label(vk))

T
B
A C

Code(T) = (0A,1B,2A,3C,2B,1B,2C)
B

C
68

Canonical representation
Ordered tree T with lexicographically maximum code
T1
B A
C
(0A,1B,2A,3C,2B,1B,2C)

T2
B C

T3
B B C C

T4
B B B C

B B A
C

B
B

A
C

(0A,1B,2B,2A,3C,1B,2C)

(0A,1B,2C,1B,2A,3C,2B)

(0A,1B,2C,1B,2B,2A,3C) 69

Left Heavy Condition (Nakano and Uno, 2002)
T(v): subtree rooted on v
Ordered tree is canonical if and only if

Code(T(v1)) Code(T(v2))
for any pair of sibling nodes v1 (left) and v2 (right)

Reduction Map
How to define parent from child in the

enumeration tree
Generate canonical tree of size k-1

from canonical tree of size k
Remove the last element of DFS Code
Code(T) = (0A,1B,2A,3C,2B,1B,2C)
71

Children Generation
Generate children

pattern
C B 2 C 3 1

S
A 4

candidates by rightmost extension
Check the maximality

C 5 B 6

of candidate based on left heavy property
Discard if not

maximal
72

Slide: Maximality Check by Left Heavy Property
Code of left subtree must be larger than that of right subtree
Check only rightmost sibling and second rightmost sibling

73

Maximality Check by Left Heavy Property
Code of left subtree must be larger than that of right subtree
Check only rightmost sibling and second rightmost sibling

Complexity of UNOT
Delay per pattern O(kb2 mn)
k: pattern size

b: branching factor of the data tree
m: size of data tree

n: database size

Summary: Mining Unordered Tree
The following three elements are

necessary for a mining algorithm
Canonical Representation Reduction Map Children Generation including Maximality Check

Backtracking on the resulting

enumeration tree
75

Part 6: Graph Mining

Frequent Subgraph Mining

Graph Database

Enumerate all subgraphs occurring more than 3 times

Patterns
77

Slide: Gspan (Yan and Han, 2002)
Most widely used graph mining algorithm Can be interpreted with reverse search principle
Canonical representation? Reduction map? Children generation?
78

Gspan (Yan and Han, 2002)
Most widely used graph mining algorithm Can be interpreted with reverse search principle
Canonical representation? Reduction map? Children generation?
78

$Slide: DFS Code for Graph Depth first search and preorder node labeling (src, dest, src_label, edge_label, dest_label) Some edges not traversed backward edge (dest < src) Elements sorted in the code 0 A a a B b A 3 A a 2 {(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)} 1 79$

DFS Code for Graph
Depth first search and preorder node labeling (src, dest, src_label, edge_label, dest_label) Some edges not traversed
backward edge (dest < src)

Elements sorted in the code
0 A a a B b A 3 A a 2 {(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

1
79

Slide: Canonical Representation
Multiple DFS codes: different starting point and children ordering Minimum DFS Code: Lexicographically Minimum

80

Canonical Representation
Multiple DFS codes: different starting point and children ordering Minimum DFS Code: Lexicographically Minimum

Reduction Map
Removing the tail of minimum DFS code preserves minimality

min DFS: {e1, e2, e3, e4}

{e1, e2, e3}
81

Slide: Children Generation
Create candidates by adding an element to DFS code Check if each candidate is minimum If not, remove it
Children candidates

Parent

real child

real child

82

Children Generation
Create candidates by adding an element to DFS code Check if each candidate is minimum If not, remove it
Children candidates

Parent

real child

Slide: Minimality Check
Reconstruct the graph from DFS Code Derive the minimum DFS Code by trying all DFSs
Speed up by traversing minimal label only
{(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

If the minimal code is not identical to the original, prune it

0

A a a B b A 3 A a 2

1
83

Minimality Check
Reconstruct the graph from DFS Code Derive the minimum DFS Code by trying all DFSs
Speed up by traversing minimal label only
{(0,1,A,a,A), (1,2,A,a,B), (2,0,B,a,A), (2,3,B,b,A)}

If the minimal code is not identical to the original, prune it

A a a B b A 3 A a 2

1
83

Slide: Summary: Graph Mining
gSpan is a typical example of reverse search Not explained: Closed tree mining, Closed Graph mining Delay exponential to pattern size
It cannot be avoided due to NP-hardness of graph isomorphism Yet it scales to millions for sparse molecular graphs

Applications covered in next chapter
84

Summary: Graph Mining
gSpan is a typical example of reverse search Not explained: Closed tree mining, Closed Graph mining Delay exponential to pattern size
It cannot be avoided due to NP-hardness of graph isomorphism Yet it scales to millions for sparse molecular graphs

Applications covered in next chapter
84

Part7: Dense Module Enumeration

Slide: Biological Motivation
Most cellular processes performed by multi-component protein complexes Increasing amount of experimental protein interaction data available Our approach Predict complexes (modules) from protein interaction network Exploit additional information given by gene expression data, evolutionary conservation, phenotypic profiles etc.

29/08/2012

86

Biological Motivation
Most cellular processes performed by multi-component protein complexes Increasing amount of experimental protein interaction data available Our approach Predict complexes (modules) from protein interaction network Exploit additional information given by gene expression data, evolutionary conservation, phenotypic profiles etc.

29/08/2012

Slide: Protein interaction networks
Node: Proteins Edge: Physical interaction of two proteins
Challenge 1: False negative edges Go beyond clique search! Challenge 2: False positive edges Assign confidence scores to edges Find node sets with high density of high confidence edges

29/08/2012

88

Protein interaction networks
Node: Proteins Edge: Physical interaction of two proteins
Challenge 1: False negative edges Go beyond clique search! Challenge 2: False positive edges Assign confidence scores to edges Find node sets with high density of high confidence edges

29/08/2012

Slide: Module Discovery
Previous work Clique percolation [Palla et al., 2005] Partitioning
Hierarchical clustering [Girvan and Newman, 2001] Flow Simulation [Krogan et al., 2006] Spectral methods [Newman, 2006]

Heuristic Local Search [Bader and Hogue, 2003]

Our approach Exhaustively enumerate all dense subgraphs efficiently

29/08/2012

89

Module Discovery
Previous work Clique percolation [Palla et al., 2005] Partitioning
Hierarchical clustering [Girvan and Newman, 2001] Flow Simulation [Krogan et al., 2006] Spectral methods [Newman, 2006]

Heuristic Local Search [Bader and Hogue, 2003]

Our approach Exhaustively enumerate all dense subgraphs efficiently

29/08/2012

Slide: Motivation for Enumeration Approach
Detects overlapping modules Allows to specify minimum density for outcoming modules Outputs all modules satisfying the density threshold

29/08/2012

90

Motivation for Enumeration Approach
Detects overlapping modules Allows to specify minimum density for outcoming modules Outputs all modules satisfying the density threshold

29/08/2012

Slide: Differential Expression Criterion
Incorporation of gene expression Presence of proteins depends on cell type Additional Criterion for modules : Num of conditions where whole module expressed : Num of conditions where whole module not expressed Fix minimum values for both quantities

29/08/2012

91

Differential Expression Criterion
Incorporation of gene expression Presence of proteins depends on cell type Additional Criterion for modules : Num of conditions where whole module expressed : Num of conditions where whole module not expressed Fix minimum values for both quantities

29/08/2012

Problem Formalization

29/08/2012

Slide: Typical Enumeration Algorithms
Itemset mining, graph mining etc. Enumerate all entities whose frequency >= 10 Set up a search tree Tree Pruning by anti-monotonicity An entitys frequency is always smaller than that of sub-entity

Not generated

29/08/2012

93

Typical Enumeration Algorithms
Itemset mining, graph mining etc. Enumerate all entities whose frequency >= 10 Set up a search tree Tree Pruning by anti-monotonicity An entitys frequency is always smaller than that of sub-entity

Not generated

29/08/2012

Network Example

29/08/2012

Graph-shaped Search Space of Modules

29/08/2012

Slide: Choosing a search tree
For efficient search, a search tree is needed There are many possible search trees Default: Lexicographical ordering

29/08/2012

96

Choosing a search tree
For efficient search, a search tree is needed There are many possible search trees Default: Lexicographical ordering

29/08/2012

Density is not a monotonic criterion
Subset of dense set is not necessarily dense Density does not decrease monotonically on a path Pruning Impossible

Density

1.0 1

0.1 1,2

0.5 1,2,3

29/08/2012

Question
Is it possible to make a search tree such that density decreases monotonically?

29/08/2012

Question
Is it possible to make a search tree such that density decreases monotonically? YES! Use Reverse Search (Avis and Fukuda 1993)

29/08/2012

Slide: Reverse search (Avis and Fukuda, 1993)
Specify a search tree in the graph-shaped search space Reduction Mapping Rule to generate a parent from a child Remove the node with the smallest degree Density always increase by the removal

CHILD

PARENT
29/08/2012 100

Reverse search (Avis and Fukuda, 1993)
Specify a search tree in the graph-shaped search space Reduction Mapping Rule to generate a parent from a child Remove the node with the smallest degree Density always increase by the removal

CHILD

PARENT
29/08/2012 100

100

Slide: Search Tree is uniquely specified by the reduction mapping
Condition: Every node should converge to the root node by applying the reduction mapping repeatedly

29/08/2012

101

Search Tree is uniquely specified by the reduction mapping
Condition: Every node should converge to the root node by applying the reduction mapping repeatedly

29/08/2012

101

Slide: Enumeration algorithm by reverse search
A set of children is generated from a parent node Try every possible children, and choose the ones satisfying the reduction mapping Prune if no children exist

29/08/2012

102

Enumeration algorithm by reverse search
A set of children is generated from a parent node Try every possible children, and choose the ones satisfying the reduction mapping Prune if no children exist

29/08/2012

102

Constraint Integration
Differential expression constraint
Monotonicity: e_0 and e_1 decrease with extension of U

Can be used for extra pruning without difficulty

29/08/2012

103

Slide: Statistical Significance of a module
k : The number of nodes in the module : Density of the module : The number of modules of size k with density at least Probability of random selection making a denser module (p-value)

29/08/2012

104

Statistical Significance of a module
k : The number of nodes in the module : Density of the module : The number of modules of size k with density at least Probability of random selection making a denser module (p-value)

29/08/2012

104

Slide: Benchmarking in yeast complex discovery
Combined interactions from CYGD-Mpact and DIP Interactions among 3559 nodes Confidence weights on edges due to (Jansen, 2003) Methods in comparison Clique detection (Clique) Clique Parcolation Method (CPM) Markov Clustering Modules compared with MIPS complexes

29/08/2012

105

Benchmarking in yeast complex discovery
Combined interactions from CYGD-Mpact and DIP Interactions among 3559 nodes Confidence weights on edges due to (Jansen, 2003) Methods in comparison Clique detection (Clique) Clique Parcolation Method (CPM) Markov Clustering Modules compared with MIPS complexes

29/08/2012

105

29/08/2012

106

Slide: Evolutionary Conserved Yeast Modules
Use ortholog profiles (10 species, InParanoid) Density >= 50%, at least three orthologs 1917 modules in 30 minutes Recovered evolutionary conserved complexes 20S proteasome 19S/22S regulator COPI vesicle coat complex DNA polymerase I and II subunits Translation initiation factor eIF2B complex They could not be recovered by simple DME due to low density
29/08/2012 107

Evolutionary Conserved Yeast Modules
Use ortholog profiles (10 species, InParanoid) Density >= 50%, at least three orthologs 1917 modules in 30 minutes Recovered evolutionary conserved complexes 20S proteasome 19S/22S regulator COPI vesicle coat complex DNA polymerase I and II subunits Translation initiation factor eIF2B complex They could not be recovered by simple DME due to low density
29/08/2012 107

107

MIPS Complexes discovered by DME (Conserved in Evolution)

29/08/2012

108

Slide: Phenotype-associated yeast modules
Use growth phenotypic profiles (21 conditions, Dudley et al, 2005) Growth defect in at least one condition Each of the 13 highest ranking modules covers the large subunit of mitochondrial ribosome
Found additional protein, Mhr1

Exactly recovered the nucleoplasmic THO complex (Hpr1, Mft1, Rlr1, Thp2) Transcription elongation, hyperrecombination Growth defect under ethanol

29/08/2012

109

Phenotype-associated yeast modules
Use growth phenotypic profiles (21 conditions, Dudley et al, 2005) Growth defect in at least one condition Each of the 13 highest ranking modules covers the large subunit of mitochondrial ribosome
Found additional protein, Mhr1

Exactly recovered the nucleoplasmic THO complex (Hpr1, Mft1, Rlr1, Thp2) Transcription elongation, hyperrecombination Growth defect under ethanol

29/08/2012

109

Phenotype Associated Modules Large subunit of mitochondrial ribosome

Mhr1: involved in homologous recombination of the mitochondrial genome
29/08/2012 110

110

Slide: Human Settings
Tissue-specific gene expression data (Su et al., 2004) 79 different tissues Consistently expressed in 3 tissues, not in 10 tissues 7763 proteins, density >= 35%, 5 minutes 1021 maximal modules MIPS human complex database (Ruepp et al., to appear)

29/08/2012

111

Human Settings
Tissue-specific gene expression data (Su et al., 2004) 79 different tissues Consistently expressed in 3 tissues, not in 10 tissues 7763 proteins, density >= 35%, 5 minutes 1021 maximal modules MIPS human complex database (Ruepp et al., to appear)

29/08/2012

111

Slide: Human-expression result
Around MCM complex, we found inter-complex relationships with ORC, CDC7, Toposome, PLK1 protein Module Uqcrc1, Uqcrc2, Uqcrb, Cyc1 (lg p = -13) No overlap with MIPS Ubiquinol-cytochrome c reductase complex SCF E3 ubiquitin ligase complex: Mark protein for degradation 5 different modules with different tissue specificity Peripheral proteins: Substrate recognition particles Target proteins are selected in a tissue specific manner! Natural Killer cells have all particles

29/08/2012

112

Human-expression result
Around MCM complex, we found inter-complex relationships with ORC, CDC7, Toposome, PLK1 protein Module Uqcrc1, Uqcrc2, Uqcrb, Cyc1 (lg p = -13) No overlap with MIPS Ubiquinol-cytochrome c reductase complex SCF E3 ubiquitin ligase complex: Mark protein for degradation 5 different modules with different tissue specificity Peripheral proteins: Substrate recognition particles Target proteins are selected in a tissue specific manner! Natural Killer cells have all particles

29/08/2012

112

High ranking modules around the MCM complex

Expressed in bone mallow cells Not expressed in brain, liver, kidney etc.
29/08/2012 113

113

Tissue Specific organization of the SCF ligase complex

29/08/2012

114

Slide: Summary: Dense Module Enumeration
Novel module enumeration algorithm based on reverse search Combination with other information sources Statistical significance of dense modules Successfully applied to yeast/human protein interaction networks

29/08/2012

115

Summary: Dense Module Enumeration
Novel module enumeration algorithm based on reverse search Combination with other information sources Statistical significance of dense modules Successfully applied to yeast/human protein interaction networks

29/08/2012

115

Reference

[1] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc. 20th Int. Conf. Very Large Data Bases, 1994. [2] T. Asai, H. Arimura, T. Uno, and S. Nakano, Discovering frequent substructures in large unordered trees, Discovery science, 2003. [3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, Efficient Substructure Discovery from Large Semi-structured Data., in SDM, 2002. [4] D. Avis and K. Fukuda, Reverse search for enumeration, Discrete Applied Mathematics, vol. 65, no. 13, pp. 2146, 1996. [5] E. Georgii, S. Dietmann, T. Uno, P. Pagel, and K. Tsuda, Enumeration of conditiondependent dense modules in protein interaction networks., Bioinformatics (Oxford, England), vol. 25, no. 7, pp. 93340, Apr. 2009. [6] S. Nijssen and J. Kok, Efficient discovery of frequent unordered trees, MGTS, 2003. [7] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering frequent closed itemsets for association rules, Database TheoryICDT99, 1999. [8] I. Takigawa and H. Mamitsuka, Efficiently mining -tolerance closed frequent subgraphs, Machine Learning, vol. 82, no. 2, pp. 95121, Sep. 2010. [9] X. Yan and J. Han, gSpan: graph-based substructure pattern mining, in 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 721724. [10] M. Zaki, Efficiently mining frequent trees in a forest: Algorithms and applications, IEEE Transaction on Knowledge and Data Engineering, 2005.
29/08/2012 116

116

Machine Learning Summer School Kyoto 2012

Graph Mining

Chapter 2: Learning from Structured Data
National Institute of Advanced Industrial Science and Technology Koji Tsuda

117

118

Part 1: Preliminaries

119

Clustering Graphs

120

Graph Regression

Training ( (
( ,-0.2) , 0.7) ,-0.5) (

Test

, ?)

121

Substructure Representation
0/1 vector of pattern indicators Huge dimensionality! Need Graph Mining for selecting features

patterns
6

122

Graph Mining
Frequent Substructure Mining

Enumerate all patterns occurred in at least m graphs

:Indicator of pattern k in graph i
Support(k): # of occurrence of pattern k
7

123

Enumeration on Tree-shaped Search Space
Each node has a pattern Generate nodes from the root:

Add an edge at each step

124

Support(g): # of occurrence of pattern g

Tree Pruning
Anti-monotonicity:

If support(g) < m, stop exploring!

Not generated

125

Gspan

(Yan and Han, 2002)

Efficient Frequent Substructure Mining Method DFS Code

Efficient detection of isomorphic patterns

126

Depth First Search (DFS) Code
A labeled graph G A a B 1 c b b A 2 [1,2,B,c,A] [0,3,A,b,C] [0,1,A,a,B] [0,2,A,b,A]
G0

DFS Code Tree on G
[2,0,A,b,A] [0,3,A,b,C]
Isomorphic

[0,1,A,a,B]
G1

[2,0,A,b,A] [0,3,A,b,C]

[0,3,A,b,C]

Non-minimum DFS code. Prune it. 11

127

Discriminative patterns
w_i > 0: positive class w_i < 0: negative class Weighted Substructure Mining

Patterns with large frequency difference Not Anti-Monotonic: Use a bound
12

128

Multiclass version
Multiple weight vectors

(graph belongs to class ) (otherwise)

Search patterns overrepresented in a class

129

Basic Bound
xij : Occurrence of pattern j If k is supergraph of pattern j,

130

Pruning Condition
Summarizing the bound for all classes,

If it is negative, the search tree can be pruned safely

131

Summary: Preliminaries
Various graph learning problems

Supervised/Unsupervised

Discovery of salient features by graph mining Actual speed depends on the data

Faster for..
Sparse graphs Diverse kinds of labels
16

132

Part 2: EM-based clustering of graphs

133

EM-based graph clustering
Motivation

Learning a mixture model in the feature space of patterns Basis for more complex probabilistic inference

L1 regularization & Graph Mining E-step -> Mining -> M-step
18

134

Probabilistic Model
Binomial Mixture

Each Component

:Feature vector of a graph (0 or 1) :Mixing weight for cluster :Parameter vector for cluster
19

135

Ordinary EM algorithm
Maximizing the log likelihood

E-step: Get posterior M-step: Estimate using posterior probs. Both are computationally prohibitive (!)

136

Regularization
L1-Regularized log likelihood

Baseline constant

ML parameter estimate using single binomial distribution

In solution, most parameters exactly equal to constants
21

137

E-step
Active pattern

E-step computed only with active patterns (computable!)

138

M-step
Putative cluster assignment Each parameter is solved separately

Nave way:

solve it for all params and identify active patterns

Use graph mining to find active patterns

139

Solution

Occurrence probability in a cluster

Overall occurrence probability
24

140

Solution

141

Important Observation
For active pattern k, the occurrence probability in a graph cluster is significantly different from the average

142

Mining for Active Patterns
Active pattern
Equivalently written as

F can be found by graph mining! (multiclass)
27

143

Experiments: RNA graphs
Stem as a node Secondary structure by RNAfold 0/1 Vertex label (self loop or not)

144

Clustering RNA graphs
Three Rfam families

Intron GP I (Int, 30 graphs) SSU rRNA 5 (SSU, 50 graphs) RNase bact a (RNase, 50 graphs)

Three bipartition problems

Results evaluated by ROC scores (Area under the ROC curve)

145

Examples of RNA Graphs

146

ROC Scores

147

No of Patterns & Time

148

Found Patterns

149

Slide: Summary (graph EM)
Substructure representation is better than paths Probabilistic inference helped by graph mining Extension to Dirichlet mixture model

Reported in Tsuda et al., SDM 2008 Graph PCA, LFD, CCA Semi-supervised learning
34

Possible extension

Summary (graph EM)
Substructure representation is better than paths Probabilistic inference helped by graph mining Extension to Dirichlet mixture model

Reported in Tsuda et al., SDM 2008 Graph PCA, LFD, CCA Semi-supervised learning
34

Possible extension

150

Part 3: Graph Boosting

151

Graph classification problem in chemoinformatics
Known as SAR problem in chemical informatics

(Quantitative) Structure-Activity Analysis

Given a graph, predict a class-label (+1 or -1)

Typically, features (descriptors) are given e.g., Dragon Descriptors, JOELIB2

152

SAR with conventional descriptors

#atoms #bonds #rings 22 25

Activity +1

20 23
11 21

21 24
11 22

+1 +1
-1 -1
37

153

Slide: Motivation of Graph Boosting
Descriptors are not always available New features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpan Linear Programming (LP) Boosting

Reduce the number of graph mining calls Faster than AdaBoost

Accurate prediction & interpretable results

38

Motivation of Graph Boosting
Descriptors are not always available New features by obtaining informative patterns (i.e., subgraphs) Greedy pattern discovery by Boosting + gSpan Linear Programming (LP) Boosting

Reduce the number of graph mining calls Faster than AdaBoost

Accurate prediction & interpretable results

154

Molecule as a labeled graph
C C C C C C C O C C C

155

SAR with patterns
C C C C C O C C C C C C C C C C

Activity

1 -1 -1 -1 1

1 1 1 1 1
C Cl

1 -1 -1 -1 -1
C C C C O C C

+1 +1 +1 -1 -1
3
C C

f 1

C
C C

C
C

...
40

156

Slide: Sparse classification in a very high dimensional space
G: all possible patterns (intractably large) |G|-dimensional feature vector x for a molecule Linear Classifier
f (x) j x j
j 1 d

Use L1 regularizer to have sparse Select a tractable number of patterns
41

Sparse classification in a very high dimensional space
G: all possible patterns (intractably large) |G|-dimensional feature vector x for a molecule Linear Classifier
f (x) j x j
j 1 d

Use L1 regularizer to have sparse Select a tractable number of patterns
41

157

Problem formulation

Sum of hinge loss and L1 regularizer : Training examples : slack variables

158

Dual LP
Primal: Huge number of weight variables Dual: Huge number of constraints
Dual problem

159

Slide: Column Generation Algorithm for LP Boost (Demiriz et al., 2002)
Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge
Constraint Matrix
Used Part
44

Column Generation Algorithm for LP Boost (Demiriz et al., 2002)
Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge
Constraint Matrix
Used Part
44

160

Finding the most violated constraint
Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted substructure mining

161

Algorithm Overview
Iteration

Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by solving the dual problem
Convert dual solution to obtain primal solution

Return

162

Experimental Settings
Classification and Regression Datasets

163

Classification Results

164

Regression Results

165

Extracted patterns from CPDB

166

Memory Usage

167

Runtime

168

Comparison with AdaBoost

169

Slide: Summary (Graph Boosting)
Graph Boosting simultaneously generate patterns and learn their weights Finite convergence by column generation Interpretable by chemists. Flexible constraints and speed-up by LP.

54

Summary (Graph Boosting)
Graph Boosting simultaneously generate patterns and learn their weights Finite convergence by column generation Interpretable by chemists. Flexible constraints and speed-up by LP.

170

Part 4: Entire Regularization Path

171

Overview
Entire regularization paths

LARS-LASSO (Efron et al., 2004), L1SVM Forward selection of features Trace the solution trajectory of L1-regularized learning Feature search -> pattern search Branch-and-bound algorithm DFS code tree, New Bound
56

Path following algorithm for graph data

172

Path Following Algorithms
LASSO regression

Follow the complete trajectory of

: Infinity to Zero

Active feature set

Features corresponding to nonzero weights
57

173

Piecewise Linear Path
At a turning point,

A new feature included into the active set, or An existing feature excluded from the active set

174

Practical Merit of Path Following
Cross validation by grid search

Has to solve QP many times Especially time-consuming for graph data

Path following does not include QP Determine the CV-optimal regularization parameter in the finest precision
59

175

Pseudo code of path following
Set initial point Do

and direction
Main Search Problem

d1 = Step size if next event is inclusion d2 = Step size if next event is exclusion d = min(d1,d2) Update the active feature set Set the next direction

Until all features are included
60

176

Feature space of patterns
Graph training data Set of all subgraphs (patterns) Each graph is represented as

177

Main Search problem
Step size if pattern t is included next

: constants computed from active set

Find pattern

that minimizes

178

Tree-shaped Search Space
Each node has a pattern Generate nodes from the root:

Add an edge at each step

179

Tree Pruning
If it is guaranteed that the optimal pattern is not in the downstream, the search tree can be pruned

Not generated

180

Theorem (Pruning condition)
Traversed up to pattern t :Minimum value so far No better pattern in the downstream, if

where

181

Initial Experiments
EDKB Estrogen receptor database

131 training graphs (chemical compounds)

Computation Time: 4 sec/search

Pattern size limitation: 10 edges

182

Solution Path

183

Events

184

Slide: Summary: Regularization Path
Path following implemented for graph data Pattern search by graph mining Classification: To do Combination with item set mining

69

Summary: Regularization Path
Path following implemented for graph data Pattern search by graph mining Classification: To do Combination with item set mining

185

Part 5: . Itemset Boosting for predicting HIV drug resistance

186

Life cycle of an HIV Virus

Drug Target

187

Slide: Approved HIV Drugs
8 Protease inhibitors (PI) 8 Nucleoside/nucleotide reverse transcriptase inhibitors (NRTI) 3 Non-nucleoside reverse transcriptase inhibitors (NNRTI) 1 Fusion inhibitor

72

Approved HIV Drugs
8 Protease inhibitors (PI) 8 Nucleoside/nucleotide reverse transcriptase inhibitors (NRTI) 3 Non-nucleoside reverse transcriptase inhibitors (NNRTI) 1 Fusion inhibitor

188

Slide: Drug resistance of HIV
Exposure to a drug causes mutations in HIVs genes As a result, HIV gains resistance against the drug
Cost of identifying the genotypes of HIV in a patient is relatively cheap Predict the drug resistance from HIVs genotypes !
Effective Pharmacotherapy for individuals
73

Drug resistance of HIV
Exposure to a drug causes mutations in HIVs genes As a result, HIV gains resistance against the drug
Cost of identifying the genotypes of HIV in a patient is relatively cheap Predict the drug resistance from HIVs genotypes !
Effective Pharmacotherapy for individuals
73

189

Drug resistance prediction problem as regression
Input: Set of mutations in a target protein
41L: Amino acid at position 41 changed to L

Output: Resistance against a drug (fold change) (40F,41L,43E,210W,211K,215Y) (43Q,75M,122E,210W,211K) (75I, 77L, 116Y,151M,184V) 0.8 12.8 ?
74

190

Simple vs Complex Genotypic Features
Simple genotypic features

(0, 1, 0, 1, 0, 0, 0, 1,)
41L
62V F116Y

Complex genotypic features

(0, 1, 0, 1, 0, 0, 0, 1,)
77L,116Y 103N,210W,215Y

191

Linear Regression on Simple Genotypic Features (Rhee et al., PNAS 2006)
Mutation associations not discovered

192

Slide: Complex Genotypic Features
0/1 vector of pattern indicators Huge dimensionality! Need itemset mining for selecting features Selection of salient features (0, 1, 0, 1, 0, 0, 0, 1,)
77L,116Y 103N,210W,215Y

patterns
77

Complex Genotypic Features
0/1 vector of pattern indicators Huge dimensionality! Need itemset mining for selecting features Selection of salient features (0, 1, 0, 1, 0, 0, 0, 1,)
77L,116Y 103N,210W,215Y

patterns
77

193

Other methods
Nonlinear SVM
Not interpretable High accuracy

Decision trees
Interpretable Low accuracy

194

Slide: Motivation of Itemset Boosting
Impossible to maintain all complex features Greedy feature discovery by Boosting + itemset mining Quadratic Programming (QP) Boosting

Reduce the number of itemset mining calls Faster than AdaBoost

Accurate prediction & interpretable results

79

Motivation of Itemset Boosting
Impossible to maintain all complex features Greedy feature discovery by Boosting + itemset mining Quadratic Programming (QP) Boosting

Reduce the number of itemset mining calls Faster than AdaBoost

Accurate prediction & interpretable results

195

Slide: Sparse classification in a very high dimensional space
G: all possible patterns (intractably large) |G|-dimensional feature vector x Linear Classifier
f (x) j x j
j 1 d

Use L1 regularizer to have sparse (LASSO) Select a tractable number of patterns
80

Sparse classification in a very high dimensional space
G: all possible patterns (intractably large) |G|-dimensional feature vector x Linear Classifier
f (x) j x j
j 1 d

Use L1 regularizer to have sparse (LASSO) Select a tractable number of patterns
80

196

Problem formulation: Quadratic Programming

Sum of squared loss and L1 regularizer : Training examples : slack variables

197

Dual QP
Primal: Huge number of weight variables Dual: Huge number of constraints

198

Slide: Column Generation Algorithm for QP Boost (Demiriz et al., 2002)
Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge
Constraint Matrix
Used Part
83

Column Generation Algorithm for QP Boost (Demiriz et al., 2002)
Start from the dual with no constraints Add the most violated constraint each time Guaranteed to converge
Constraint Matrix
Used Part
83

199

Finding the most violated constraint
Constraint for a pattern (shown again)

Finding the most violated one

Searched by weighted itemset mining
84

200

Algorithm Overview
Iteration

Find a new pattern by graph mining with weight u If all constraints are satisfied, break Add a new constraint Update u by solving the dual problem
Convert dual solution to obtain primal solution

Return

201

Slide: Experimental settings
Three classes of drugs
NRTI (Nucleotide Reverse Transcriptase Inhibitors) PI (Protease Inhibitors) NNRTI (Nonnucleotide Reverse Transcriptase Inhibitors)

5fold cross validation
Linear SVM, Ridge regression, Lars Nonlinear SVM, iBoost
86

Experimental settings
Three classes of drugs
NRTI (Nucleotide Reverse Transcriptase Inhibitors) PI (Protease Inhibitors) NNRTI (Nonnucleotide Reverse Transcriptase Inhibitors)

5fold cross validation
Linear SVM, Ridge regression, Lars Nonlinear SVM, iBoost
86

202

Regression results

Accuracy:

203

Accuracy Summary
NRTIs: iBoost performed best PIs:
Nonlinear methods were better than linear SVMs were slightly better (non-significant)

NNRTIs
Linear methods were better Combination is not necessary
88

204

NRTI Drugs
89

205

Slide: Known mutation associations in RT
Red: Thymidineassociated Mutations (TAM)
41L, 67N, 70R, 210W, 215Y/F, 219Q, 69i 75I, 77L, 116Y, 151M, 65R, 74V, 184I/V
RT of HIV-1

Blue: Q151M Complex

90

Known mutation associations in RT
Red: Thymidineassociated Mutations (TAM)
41L, 67N, 70R, 210W, 215Y/F, 219Q, 69i 75I, 77L, 116Y, 151M, 65R, 74V, 184I/V
RT of HIV-1

Blue: Q151M Complex

206

PI Drugs
91

207

NNRTI Drugs (almost no combination found)
92

208

Computation Time of iBoost
Training time for 3TC 507 isolates with 371 mutations on average QP time longer than mining time

209

Slide: Summary (HIV)
Itemset Boosting for finding mutation associations Good accuracy for NRTIs Our complex features re-discover known mutation clusters Broad applications
Multiple SNP analysis, RNAi efficacy prediction Motif combination, Flu mutation analysis P53 mutation analysis

94

Summary (HIV)
Itemset Boosting for finding mutation associations Good accuracy for NRTIs Our complex features re-discover known mutation clusters Broad applications
Multiple SNP analysis, RNAi efficacy prediction Motif combination, Flu mutation analysis P53 mutation analysis

210

Reference

[1] K. Tsuda and T. Kudo, Clustering graphs by weighted substructure mining, in Proceedings of the 23rd international conference on Machine learning - ICML 06, 2006, pp. 953960. [2] H. Saigo, N. Krmer, and K. Tsuda, Partial least squares regression for graph mining, in Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, 2008, p. 578. [3] H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. Tsuda, gBoost: a mathematical programming approach to graph classification and regression, Machine Learning, vol. 75, no. 1, pp. 6989, Nov. 2008. [4] H. Saigo, T. Uno, and K. Tsuda, Mining complex genotypic features for predicting HIV-1 drug resistance., Bioinformatics (Oxford, England), vol. 23, no. 18, pp. 245562, Sep. 2007. [5] K. Tsuda, Entire regularization paths for graph data, in Proceedings of the 24th international conference on Machine learning - ICML 07, 2007, pp. 919926.

211

Machine Learning Summer School Kyoto 2012

Graph Mining
Chapter 3: Kernels for Structured Data
National Institute of Advanced Industrial Science and Technology Koji Tsuda

212

Agenda 3 (Kernel)
1. 2. 3. 4. 5. 6. Kernel Method Revisited Marginalized Kernels (Fisher Kernels) Marginalized Graph Kernels Weisfeiler-Lehman kernels Reaction Graph kernels Concluding Remark

213

Part 1: Kernel Method Revisited

214

Kernels and Learning
In Kernel-based learning algorithms, problem solving is now decoupled into:

A general purpose learning algorithm (e.g. SVM, PCA, ) Often linear algorithm A problem specific kernel
Simple (linear) learning algorithm

Complex Learning Task

Specific Kernel function 4

215

Current Synthesis
Modularity and re-usability

Same kernel ,different learning algorithms Different kernels, same learning algorithms

Data 1 (Sequence) Data 2 (Network)

Kernel 1
Gram Matrix
(not necessarily stored)

Learning Algo 1 Learning Algo 2
5

Kernel 2 Gram Matrix

216

Slide: Kernel Methods : intuitive idea
Find a mapping f such that, in the new space, problem solving is linear Kernel represents the similarity between two objects, defined as the dot-product in this new vector space But the mapping is left implicit Easy generalization of a lot of dotproduct-based learning algorithms
6

Kernel Methods : intuitive idea
Find a mapping f such that, in the new space, problem solving is linear Kernel represents the similarity between two objects, defined as the dot-product in this new vector space But the mapping is left implicit Easy generalization of a lot of dotproduct-based learning algorithms
6

217

Kernel Methods : the mapping
f

Original Space

Feature (Vector) Space
7

218

Slide: Kernel : more formal definition
A kernel k(x,y)
is a similarity measure defined by an implicit mapping f, such that: k(x,y)=f(x)f(y) This similarity measure implies:

Invariance or other a priori knowledge The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning) but still computational efficiency when computing k(x,y)
8

Kernel : more formal definition
A kernel k(x,y)
is a similarity measure defined by an implicit mapping f, such that: k(x,y)=f(x)f(y) This similarity measure implies:

Invariance or other a priori knowledge The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning) but still computational efficiency when computing k(x,y)
8

219

Kernel Trick
Generalizes (nonlinearly) algorithms in clustering, classification, density estimation ..

When these algorithms are dot-product based, by replacing the dot product (xy) by k(x,y)=f(x)f(y) When these algorithms are distance-based, by replacing d(x,y) by k(x,x)+k(y,y)-2k(x,y)

Freedom of choosing f implies a large variety of learning algorithms
9

220

Valid Kernels
Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel)

A function is P.D. if K (x, y) f (x) f (y)dxdy 0 f L2 In other words, the Gram matrix K (whose elements are k(xi,xj)) must be positive definite for all xi, xj of the input space One possible choice of f(x): k(,x) (maps a point x to a function k(,x) feature space with infinite dimension!)
10

221

Slide: How to build new kernels
Kernel combinations, preserving validity:
K (x,y ) K1 ( x,y ) (1 ) K 2 (x,y ) K (x,y ) a.K1 (x,y ) a0 K (x,y ) K1 ( x,y ).K 2 ( x,y ) K (x,y ) f ( x). f ( y ) f is real valued function K (x,y ) K 3 (( x) ,(y )) K (x,y ) xPy P symmetric definite positive K (x,y ) K1 ( x,y ) K1 ( x,x) K1 (y,y )
11

0 1

How to build new kernels
Kernel combinations, preserving validity:
K (x,y ) K1 ( x,y ) (1 ) K 2 (x,y ) K (x,y ) a.K1 (x,y ) a0 K (x,y ) K1 ( x,y ).K 2 ( x,y ) K (x,y ) f ( x). f ( y ) f is real valued function K (x,y ) K 3 (( x) ,(y )) K (x,y ) xPy P symmetric definite positive K (x,y ) K1 ( x,y ) K1 ( x,x) K1 (y,y )
11

0 1

222

Slide: Strategies of Design
Convolution Kernels: text is a recursively-defined data structure. How to build global kernels form local (atomic level) kernels? Generative model-based kernels: the topology of the problem will be translated into a kernel function
12

Strategies of Design
Convolution Kernels: text is a recursively-defined data structure. How to build global kernels form local (atomic level) kernels? Generative model-based kernels: the topology of the problem will be translated into a kernel function
12

223

IBM Research Tokyo Research Laboratory

2005 IBM Corporation

Family of kernels
Kernels for biological sequences
Spectrum kernel Marginalized kernel Profile kernel Local alignment kernel
Kernel Methods in Computational Biology, MIT Press

Tree Kernels
Kernel for phylogenetic profiles
Kernel for natural language Kernel for RNA sequences

224

IBM Research Tokyo Research Laboratory

2005 IBM Corporation

Family of kernels
Kernels for nodes in a network
Diffusion kernel Locally constrained diffusion kernel

Graph Kernels
Marginalized Graph Kernels MGK without tottering

Acyclic Pattern Kernels
Shortest Path Kernel Weisfeiler-Lehman Kernel

225

Weak points of kernel methods
Not Interpretable

Not sure which features are used -> Graph Mining, Boosting

Dense kernel matrices: Slow

Take O(n^3) time for manipulation -> Semi-supervised learning
15

226

Part 2. Marginalized kernels

227

Biological Sequences: Classification Tasks
DNA sequences (A,C,G,T)

Gene Finding, Splice Sites MicroRNA discovery, Classification into Rfam families Remote Homolog Detection, Fold recognition
17

RNA sequences (A,C,G,U)

Amino Acid Sequences (20 symbols)

228

Kernels for Sequences
Similarity between sequences of different lengths

ACGGTTCAA ATATCGCGGGAA
Later combined with SVMs and other kernel methods
18

229

Count Kernel
Inner product between symbol counts

Extension: Spectrum kernels (Leslie et al., 2002) Counts the number of k-mers (k-grams) efficiently Not good for sequences with frequent context change E.g., coding/non-coding regions in DNA

230

Hidden Markov Models for Estimating Context
Visible Variable
Hidden Variable

: Symbol Sequence
: Context

HMM can estimate the posterior probability of hidden variables from data

231

Marginalized kernels
Design a joint kernel for combined Hidden variable is not usually available

Take expectation with respect to the hidden variable

The marginalized kernel for visible variables

232

Designing a joint kernel for sequences
Symbols are counted separately in each context

:count of a combined symbol (k,l) Joint kernel: count kernel with context information

233

Marginalization of the joint kernel
Joint kernel

Marginalized count kernel

234

Computing Marginalized Counts from HMM
Marginalized count is described as

Posterior probability of i-th hidden variable is efficiently computed as

235

Slide: 2nd order marginalized count kernel
If adjacent relations between symbols have essential meanings, the count kernel is obviously not sufficient
2nd order marginalized count kernel 4 neighboring symbols (i.e. 2 visible and 2 hidden) are combined and counted

25

2nd order marginalized count kernel
If adjacent relations between symbols have essential meanings, the count kernel is obviously not sufficient
2nd order marginalized count kernel 4 neighboring symbols (i.e. 2 visible and 2 hidden) are combined and counted

236

Fisher Kernel
Probabilistic mode

Fisher Kernel Mapping to a feature vector (Fisher score vector)

Inner product of Fisher scores

Z: Fisher information matrix
26

237

Fisher Kernel from HMM
Derive FK from HMM (Jaakkola et al. 2000)

Derivative for emission probabilities only No Fisher information matrix Counts are centralized and weighted

FK from HMM is a special case of marginalized kernels

238

Slide: Difference between FK and Marginalized Kernels
FK: Probabilistic model determines the joint kernel and the posterior probabilities
MK: You can determine them separately

More flexible design!

28

Difference between FK and Marginalized Kernels
FK: Probabilistic model determines the joint kernel and the posterior probabilities
MK: You can determine them separately

More flexible design!

239

Protein clustering experiment
84 proteins containing five classes

gyrB proteins from five bacteria species HMM + {FK,MCK1,MCK2}+K-Means

Clustering methods

Evaluation

Adjusted Rand Index (ARI)

240

Kernel Matrices

241

Clustering Evaluation

242

Slide: Applications since then..
Marginalized Graph Kernels (Kashima et al., ICML 2003) Sensor networks (Nyugen et al., ICML 2004) Labeling of structured data (Kashima et al., ICML 2004) Robotics (Shimosaka et al., ICRA 2005) Kernels for Promoter Regions (Vert et al., NIPS 2005) Web data (Zhao et al., WWW 2006)

32

Applications since then..
Marginalized Graph Kernels (Kashima et al., ICML 2003) Sensor networks (Nyugen et al., ICML 2004) Labeling of structured data (Kashima et al., ICML 2004) Robotics (Shimosaka et al., ICRA 2005) Kernels for Promoter Regions (Vert et al., NIPS 2005) Web data (Zhao et al., WWW 2006)

243

Slide: Summary (Marginalized Kernels)
General Framework for using generative model for defining kernels Fisher kernel as a special case Broad applications

33

Summary (Marginalized Kernels)
General Framework for using generative model for defining kernels Fisher kernel as a special case Broad applications

244

Part 3 Marginalized Graph Kernels

245

Motivations for graph analysis
Existing methods assume tables
Serial Num 0001 0002 Name Age 40 31 Sex Male Female Address Tokyo Osaka

Structured data beyond this framework New methods for analysis

246

Graphs..

247

Graph Structures in Biology
DNA Sequence
A C G C

Compounds
H

RNA

C
UA CG CG U U U U H H C C O H

C
C H

C
H

Texts in literature
Amitriptyline

inhibits

adenosine

uptake

248

Marginalized Graph Kernels
Going to define the kernel function Both vertex and edges are labeled

(Kashima, Tsuda, Inokuchi, ICML 2003)

249

Label path
Sequence of vertex and edge labels

Generated by random walking Uniform initial, transition, terminal probabilities

250

Path-probability vector

251

Kernel definition
Kernels for paths

A c D b E

B c D a A

Take expectation over all possible paths! Marginalized kernels for graphs

252

Computation

Transition probability : Initial and terminal : omitted

: Set of paths ending at v KV : Kernel computed from the paths ending at (v, v)

KV is written recursively

Kernel computed by solving linear equations polynomial time
A(v)

A(v) v

v
42

253

Slide: Graph Kernel Applications
Chemical Compounds (Mahe et al., 2005) Protein 3D structures (Borgwardt et al, 2005) RNA graphs (Karklin et al., 2005) Pedestrian detection Signal Processing

43

Graph Kernel Applications
Chemical Compounds (Mahe et al., 2005) Protein 3D structures (Borgwardt et al, 2005) RNA graphs (Karklin et al., 2005) Pedestrian detection Signal Processing

254

Predicting Mutagenicity
MUTAG benchmark dataset

Mutation of Salmonella typhimurium 125 positive data (effective for mutations) 63 negative data (not effective for mutations)

Mahe et al. J. Chem. Inf. Model., 2005

255

Classification of Protein 3D structures
Graphs for protein 3D structures

Node: Secondary structure elements Edge: Distance of two elements

Calculate the similarity by graph kernels

45 Borgwardt et al. Protein function prediction via graph kernels, ISMB2005

256

Classification of proteins: Accuracy

46 Borgwardt et al. Protein function prediction via graph kernels, ISMB2005

257

Pedestrian detection in images
(F. Suard et al., 2005)

258

Classifying RNA graphs
Karklin et al.,, 2005)

(Y.

259

Strong points of MGK
Polynomial time computation O(n^3) Positive definite kernel

Support Vector Machines Kernel PCA Kernel CCA And so on

260

Drawbacks of graph kernels
Global similarity measure

Fails to capture subtle differences Long paths suppressed

Results not interpretable

Structural features ignored (e.g. loops)

No labels -> kernel always 1
50

261

Part 4. Weisfeiler Lehman kernel

262

Slide: Convert a graph into a set of words
i) Make a label set of adjacent A vertices ex) {E,A,D} ii) Sort ex) A,D,E B E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a unique value E R ex) B,A,D,ER D v) Assign the value as the new Bag-of-words {A,B,D,E,,R,} vertex label

Convert a graph into a set of words
i) Make a label set of adjacent A vertices ex) {E,A,D} ii) Sort ex) A,D,E B E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a unique value E R ex) B,A,D,ER D v) Assign the value as the new Bag-of-words {A,B,D,E,,R,} vertex label

263

Courtesy K. Borgwardt
29/08/2012 53

264

265

Part 5. Reaction Graph Kernels

266

KEGG lysine degradation pathway

267

Slide: Missing enzymes in metabolic networks
Many enzymatic reactions whose substrate and product are known, but the enzyme involved is unknown.
?

Need to assign Enzymatic Classification numbers.

?

Missing enzymes in metabolic networks
Many enzymatic reactions whose substrate and product are known, but the enzyme involved is unknown.
?

Need to assign Enzymatic Classification numbers.

268

EC number
EC (Enzymatic Classification) number is a hierarchical categorization of
Enzymes Enzymatic reactions

class

subsubclass

EC 1.3.3.subclass

269

Task
Given a pair of substrate and product as a query, find similar reactions in the database

Query ?

Result of Retrieval EC: 3.5.2

Similarity measure of reactions is necessary

270

Slide: Reaction Graph
Represent enzymatic reaction as reaction graph
Node: Molecules Edge: chemical relation of molecules (main, leave, co-factor, transferase, ligase)

Reaction graph kernel: Similarity measure of reaction graphs
Molecule = Graph of atoms Reaction graph has graph of graphs structure Extension of existing walk-based graph kernel
(Kashima et al., 2003)

Reaction Graph
Represent enzymatic reaction as reaction graph
Node: Molecules Edge: chemical relation of molecules (main, leave, co-factor, transferase, ligase)

Reaction graph kernel: Similarity measure of reaction graphs
Molecule = Graph of atoms Reaction graph has graph of graphs structure Extension of existing walk-based graph kernel
(Kashima et al., 2003)

271

Main Substrate

2 Main Products

main

leave

main

272

Slide: Reaction graph kernels (RGK)
Two-layered kernels on graphs of graphs
Node kernel = walk-based graph kernel of molecules Edge kernel = delta kernel of labels
main, leave, cofactor, transferase, ligase

R

K r ( R, R' )
R'

Reaction graph kernels (RGK)
Two-layered kernels on graphs of graphs
Node kernel = walk-based graph kernel of molecules Edge kernel = delta kernel of labels
main, leave, cofactor, transferase, ligase

K r ( R, R' )
R'

273

Simplified Settings
Query might not come in the complete form Remove some edges in the database entries
RPAIR

Use only reactant edges (main, leave)
main-only Use main only

274

Slide: Automatic classification of enzymatic reactions in KEGG
KEGG/REACTION database 4610 reactions with known EC number 6 classes, 50 subclasses, 124 subsubclasses

Construct nearest neighbor classifier based on the reaction graph kernel Three different levels: class, subclass,subsubclass Three kernel versions: full, RPAIR, main-only Measure leave-one-out classification error

Automatic classification of enzymatic reactions in KEGG
KEGG/REACTION database 4610 reactions with known EC number 6 classes, 50 subclasses, 124 subsubclasses

Construct nearest neighbor classifier based on the reaction graph kernel Three different levels: class, subclass,subsubclass Three kernel versions: full, RPAIR, main-only Measure leave-one-out classification error

275

Prediction Accuracy

As expected, classification is easier for upper categories, but difficult for lower categories such as subsubclass. The order of accuracy (full-edge > RPAIR > main-pair) suggests that detailed edge information contributes to further accuracy.

276

Slide: Predicting unannotated reactions in plant secondary metabolism
KEGG pathway Biosynthesis of Secondary Metabolites Out of unannotated 56 reactions, we have manually assigned ECs of 36 reactions under chemists guidance Comparison with an existing rule-based method: ezyme (Kotera et al., J. Am. Chem. Soc, 2004). RGKs accuracy was better than e-zyme. (50% improvement for the top candidate)

Predicting unannotated reactions in plant secondary metabolism
KEGG pathway Biosynthesis of Secondary Metabolites Out of unannotated 56 reactions, we have manually assigned ECs of 36 reactions under chemists guidance Comparison with an existing rule-based method: ezyme (Kotera et al., J. Am. Chem. Soc, 2004). RGKs accuracy was better than e-zyme. (50% improvement for the top candidate)

277

Case 1: EC 3.1.1
query

Manual

3.1.1

annotation

Structures of C02046 and C01479 are almost same except structural isomerism. A hydrolysis occur at a carboxylic-esther bond.
method RGK rank1 3.1.1 rank2 1.14.11 rank3 1.14.11

e-zyme

6.1.1

3.1.1

278

Case 2: EC 2.1.1+1.1.1
query

Manual
annotation

After removing a methyl group of C06175 (2.1.1), C01735 will be produced by oxidation of CH-OH group (1.1.1). In the second reactions, enzymes usually use NAD+/NADP+ as an acceptor.
method
RGK e-zyme

rank1
1.1.1 2.1.1

rank2
1.1.1 1.13.12

rank3
1.1.1 1.14.14

279

Case 3: EC1.14+2.4.1+5.2.1+3.2.1
Query

Manual Annotation
This reaction is thought to be a set of 5 reactions by analogy with the pathway from C00423 => C01772 => C05158 => C05839 => C05838, and to C05851. The last reaction is spontaneous and not enzymatic.
method RGK rank1 1.14.13 rank2 1.2.3 rank3 1.2.1

e-zyme

NA-

280

A difficult case
No annotation

A very similar reaction (below) is found by manual inspection

Multi-step reaction is difficult to analyze. We dont know how many steps are hidden between given substrate and product.

281

Slide: Concluding Remarks: New Frontier
Developing Algorithms for Learning from Graphs Taming Combinatorial Explosion Recursive Fixed Point Iteration: Graph Kernels Statistical Pruning in Search Tree: Graph Boosting Hashing to a set of words: WL Kernel New ideas are necessary to get beyond the current level of speed and prediction accuracy Deeper integration of learning and mining
THANK YOU VERY MUCH!!
71

Concluding Remarks: New Frontier
Developing Algorithms for Learning from Graphs Taming Combinatorial Explosion Recursive Fixed Point Iteration: Graph Kernels Statistical Pruning in Search Tree: Graph Boosting Hashing to a set of words: WL Kernel New ideas are necessary to get beyond the current level of speed and prediction accuracy Deeper integration of learning and mining
THANK YOU VERY MUCH!!
71

282

Slide: Reference
[1] H. Kashima, K. Tsuda, and A. Inokuchi, Marginalized kernels between labeled graphs, ICML, pp. 321328, 2003. [2] H. Saigo, M. Hattori, H. Kashima, and K. Tsuda, Reaction graph kernels predict EC numbers of unknown enzymatic reactions in plant secondary metabolism., BMC bioinformatics, vol. 11 Suppl 1, no. Suppl 1, p. S31, Jan. 2010. [3] N. Shervashidze, P. Schweitzer, V. Leeuwen, E. Jan, K. Mehlhorn, and K. Borgwardt, Weisfeiler-Lehman graph kernels, Journal of Machine Learning Research, vol. 12, pp. 25392561, 2011. [4] K. Tsuda, T. Kin, and K. Asai, Marginalized kernels for biological sequences, Bioinformatics, vol. 18, no. Suppl 1, pp. S268S275, Jul. 2002.

72

Reference
[1] H. Kashima, K. Tsuda, and A. Inokuchi, Marginalized kernels between labeled graphs, ICML, pp. 321328, 2003. [2] H. Saigo, M. Hattori, H. Kashima, and K. Tsuda, Reaction graph kernels predict EC numbers of unknown enzymatic reactions in plant secondary metabolism., BMC bioinformatics, vol. 11 Suppl 1, no. Suppl 1, p. S31, Jan. 2010. [3] N. Shervashidze, P. Schweitzer, V. Leeuwen, E. Jan, K. Mehlhorn, and K. Borgwardt, Weisfeiler-Lehman graph kernels, Journal of Machine Learning Research, vol. 12, pp. 25392561, 2011. [4] K. Tsuda, T. Kin, and K. Asai, Marginalized kernels for biological sequences, Bioinformatics, vol. 18, no. Suppl 1, pp. S268S275, Jul. 2002.

283