Theory and Applications of Boosting
Rob Schapire
« Theory and Applications of Boosting
Robert Schapire
Boosting is a general method for producing a very accurate classification rule by combining rough and moderately inaccurate “rules of thumb.” While rooted in a theoretical framework of machine learning, boosting has been found to perform quite well empirically. This tutorial will focus on the boosting algorithm AdaBoost, and will explain the underlying theory of boosting, including explanations that have been given as to why boosting often does not suffer from overfitting, as well as interpretations based on game theory, optimization, statistics, and maximum entropy. Some practical applications and extensions of boosting will also be described.
Scroll with j/k | | | Size
1
Example: How May I Help You?
[Gorin et al.]
goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) yes Id like to place a collect call long distance please (Collect) operator I need to make a call but I need to bill it to my office (ThirdNumber) yes Id like to place a call on my master card please (CallingCard) I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) observation:
easy to nd rules of thumb that are often correct e.g.: IF card occurs in utterance THEN predict CallingCard hard to nd single highly accurate prediction rule
2
The Boosting Approach
devise computer program for deriving rough rules of thumb apply procedure to subset of examples obtain rule of thumb apply to 2nd subset of examples obtain 2nd rule of thumb repeat T times
3
Key Details
how to choose examples on each round?
concentrate on hardest examples (those most often misclassied by previous rules of thumb) take (weighted) majority vote of rules of thumb
how to combine rules of thumb into single prediction rule?
4
Boosting
boosting = general method of converting rough rules of
thumb into highly accurate prediction rule
technically:
assume given weak learning algorithm that can consistently nd classiers (rules of thumb) at least slightly better than random, say, accuracy 55% (in two-class setting) [ weak learning assumption ] given sucient data, a boosting algorithm can provably construct single classier with very high accuracy, say, 99%
5
Outline of Tutorial
basic algorithm and core theory fundamental perspectives practical extensions advanced topics
6
Preamble: Early History
7
Strong and Weak Learnability
boostings roots are in PAC learning model strong PAC learning algorithm:
[Valiant 84]
get random examples from unknown, arbitrary distribution
for any distribution with high probability given polynomially many examples (and polynomial time) can nd classier with arbitrarily small generalization error same, but generalization error only needs to be slightly better than random guessing ( 1 ) 2 does weak learnability imply strong learnability?
weak PAC learning algorithm
[Kearns & Valiant 88]:
8
If Boosting Possible, Then...
can use (fairly) wild guesses to produce highly accurate
predictions
if can learn part way then can learn all the way should be able to improve any learning algorithm for any learning problem:
either can always learn with nearly perfect accuracy or there exist cases where cannot learn even slightly better than random guessing
9
First Boosting Algorithms
[Schapire 89]:
rst provable boosting algorithm optimal algorithm that boosts by majority rst experiments using boosting limited by practical drawbacks introduced AdaBoost algorithm strong practical advantages over previous boosting algorithms
[Freund 90]:
[Drucker, Schapire & Simard 92]:
[Freund & Schapire 95]:
10
Basic Algorithm and Core Theory
introduction to AdaBoost analysis of training error analysis of test error
and the margins theory
experiments and applications
11
Basic Algorithm and Core Theory
introduction to AdaBoost analysis of training error analysis of test error
and the margins theory
experiments and applications
12
A Formal Description of Boosting
given training set for t = 1, . . . , T :
(x1 , y1 ), . . . , (xm , ym )
yi {1, +1} correct label of instance xi X
construct distribution Dt on {1, . . . , m} nd weak classier (rule of thumb) ht : X {1, +1} with error t on Dt : t = PriDt [ht (xi ) = yi ]
output nal/combined classier Hnal
13
AdaBoost
constructing Dt :
[with Freund]
D1 (i) = 1/m given Dt and ht : Dt+1 (i) = = Dt (i) e t if yi = ht (xi ) if yi = ht (xi ) e t Zt Dt (i) exp(t yi ht (xi )) Zt
where Zt = normalization factor 1 1 t >0 t = ln 2 t nal classier:
Hnal (x) = sign
t
t ht (x)
14
Toy Example
D1
weak classiers = vertical or horizontal half-planes
15
Round 1
1 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 1 =0.30 1=0.42 h D2
16
Round 2
11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 1111111111111 2 00000000000001111 11111111111110000 1111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000001111 11111111111110000 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000
h
D3
2 =0.21 2=0.65
17
Round 3
11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 11111 1111111111111 00000 0000000000000 11111 00000 1111111111111 0000000000000 11111 00000 1111111111111 0000000000000 11111 1111111111111 00000 0000000000000 1111111111111 00000000000001111 111111111111111 000000000000000 11111111111110000 1111 00000000000000000 111111111111111 000000000000000 11111111111111111 00000000000000000 111111111111111 000000000000000 11111111111111111 000000000000000 00000000000000000 111111111111111 11111111111111111 111111111111111 000000000000000 00000000000000000 11111111111111111 111111111111111 000000000000000 00000000000000000 11111111111111111 111111111111111 00000000000000000 11111111111111111 00000000000000000 h 000000000000000 111111111111111 000000000000000 000000000000000 11111111111111111 00000000000000000 111111111111111 11111111111111111 111111111111111 00000000000000000 3 000000000000000 11111111111111111 00000000000000000 111111111111111 000000000000000 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 111111111111111 000000000000000 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 111111111111111 000000000000000 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 111111111111111 000000000000000 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 111111111111111 000000000000000 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111 11111111111111111 00000000000000000 000000000000000 111111111111111
3 =0.14 3=0.92
18
Final Classier
000 000000000 111 111111111 000 000000000 111 111111111 111 000 111111111 000000000 000 000000000 111 111111111 000 000000000 111 111111111 000 000000000 111 111111111 111 000 111111111 000000000 000 000000000 111 111111111 000 000000000 111 111111111 000 000000000 111 111 000 111111111 000000000 0.42 111111111 + 000 000000000 111 111111111 000 000000000 111 111111111 000 000000000 111 111111111 000 000000000 111 111111111 111 000 111111111 000000000 000 000000000 111 111111111 000 000000000 111 111111111 000 000000000 111 111111111 000000000 00 111111111 11 000000000 00 111111111 11 111111111 000000000 11 00 000000000 00 111111111 11 000000000 00 111111111 11 000000000 00 111111111 11 111111111 000000000 11 00 000000000 00 111111111 11 000000000 00 111111111 11 000000000 00 111111111 11 111111111 000000000 11 00 0.65 000000000 00 111111111 11 000000000 00 111111111 11 000000000 00 111111111 11 000000000 00 111111111 11 111111111 000000000 11 00 000000000 00 111111111 11 000000000 00 111111111 11 000000000 00 111111111 11 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 0000000000 1111111111 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 0000000000 1111111111 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111 0000000000 0000000000 1111111111 1111111111 0000000000 1111111111 0000000000
H = sign final
+ 0.92
=
1111111111111 0000000000000 11111 00000 111111111111 000000000000 1111111111111 0000000000000 11111 00000 111111111111 000000000000 1111111111111 0000000000000 11111 00000 111111111111 000000000000 0000000000000 1111111111111 11111 00000 111111111111 000000000000 0000000000000 1111111111111 11111 111111111111 00000 000000000000 1111111111111 0000000000000 11111 00000 111111111111 000000000000 1111111111111 0000000000000 11111 111111111111 00000 000000000000 1111111111111 0000000000000 11111 00000 111111111111 000000000000 1111111111111 0000000000000 11111 111111111111 00000 000000000000 0000000000000 1111111111111 11111 111111111111 00000 000000000000 1111111111111 0000000000000 11111 00000 111111111111 000000000000 11111 111111111111 00000 000000000000 11111 00000 111111111111 000000000000 11111 00000 111111111111 000000000000 11111 111111111111 00000 000000000000 11111 00000 111111111111 000000000000 11111 111111111111 00000 000000000000 11111 111111111111 00000 000000000000 11111 00000 111111111111 000000000000 11111 111111111111 00000 000000000000 11111 00000 111111111111 000000000000 11111 00000 111111111111 000000000000 11111 111111111111 00000 000000000000 11111 00000 111111111111 000000000000 11111 111111111111 00000 000000000000 11111 111111111111 00000 000000000000 11111 00000 111111111111 000000000000 11111 00000 111111111111 000000000000 11111 00000 111111111111 000000000000 11111 111111111111 00000 000000000000 11111 111111111111 00000 000000000000 11111 00000 111111111111 000000000000
19
Basic Algorithm and Core Theory
introduction to AdaBoost analysis of training error analysis of test error
and the margins theory
experiments and applications
20
Analyzing the Training Error
Theorem:
[with Freund]
write t as then
1 2
t
[ t = edge ] 2 t (1 t )
t 2 1 4t t
training error(Hnal ) =
exp 2
t
2 t
so: if t : t > 0
then training error(Hnal ) e 2 T AdaBoost is adaptive: does not need to know or T a priori can exploit t
2
21
Proof
let F (x) =
t
t ht (x) Hnal (x) = sign(F (x))
Step 1: unwrapping recurrence:
Dnal (i) =
1 m
exp yi
t
t ht (xi ) Zt
t
=
1 exp (yi F (xi )) m Zt
t
22
Proof (cont.)
Step 2: training error(Hnal )
t
Zt
Proof:
training error(Hnal ) =
1 m 1 m 1 m
i
1 if yi = Hnal (xi ) 0 else 1 if yi F (xi ) 0 0 else exp(yi F (xi ))
= =
i
i
Dnal (i)
i t
Zt
=
t
Zt
23
Proof (cont.)
Step 3: Zt = 2 Proof:
t (1 t )
Zt
=
i
Dt (i) exp(t yi ht (xi )) Dt (i)e t +
i:yi =ht (xi ) t e t + (1
= =
Dt (i)e t
t ) e
i:yi =ht (xi ) t
= 2 t (1 t )
24
Basic Algorithm and Core Theory
introduction to AdaBoost analysis of training error analysis of test error
and the margins theory
experiments and applications
25
How Will Test Error Behave? (A First Guess)
1 0.8
error
0.6 0.4 0.2
test train
20 40 60 80 100
# of rounds (T)
expect:
training error to continue to drop (or reach zero) test error to increase when Hnal becomes too complex
Occams razor overtting hard to know when to stop training
26
Technically...
with high probability:
generalization error training error + O bound depends on m = # training examples d = complexity of weak classiers T = # rounds
dT m
generalization error = E [test error] predicts overtting
27
Overtting Can Happen
30 25
error
20 15 10 5 0 1 10
test train
(boosting stumps on heart-disease dataset)
100
1000
# rounds
but often doesnt...
28
Actual Typical Run
20 15
error
10 5 0
test train
10 100 1000
(boosting C4.5 on letter dataset)
# of rounds (T)
test error does not increase, even after 1000 rounds
(total size > 2,000,000 nodes) test error continues to drop even after training error is zero!
train error test error
# rounds 5 100 1000 0.0 0.0 0.0 8.4 3.3 3.1
Occams razor wrongly predicts simpler rule is better
29
A Better Story: The Margins Explanation
key idea:
[with Freund, Bartlett & Lee]
training error only measures whether classications are right or wrong should also consider condence of classications
recall: Hnal is weighted majority vote of weak classiers measure condence by margin = strength of the vote
= (weighted fraction voting correctly) (weighted fraction voting incorrectly)
high conf. incorrect 1 Hfinal incorrect high conf. correct Hfinal correct +1
low conf. 0
30
Empirical Evidence: The Margin Distribution
margin distribution
= cumulative distribution of margins of training examples
cumulative distribution
20 15
1.0
1000 100
0.5
error
10 5 0
test train
10 100 1000
5
-1 -0.5
# of rounds (T)
margin
0.5
1
train error test error % margins 0.5 minimum margin
# rounds 5 100 1000 0.0 0.0 0.0 8.4 3.3 3.1 7.7 0.0 0.0 0.14 0.52 0.55
31
Theoretical Evidence: Analyzing Boosting Using Margins
Theorem: large margins better bound on generalization
error (independent of number of rounds) proof idea: if all margins are large, then can approximate nal classier by a much smaller classier (just as polls can predict not-too-close election)
Theorem: boosting tends to increase margins of training
examples (given weak learning assumption) moreover, larger edges larger margins proof idea: similar to training error proof
so:
although nal classier is getting larger, margins are likely to be increasing, so nal classier actually getting close to a simpler classier, driving down the test error
32
More Technically...
with high probability, > 0 :
generalization error Pr[margin ] + O
d/m
(Pr[ ] = empirical probability)
bound depends on m = # training examples d = complexity of weak classiers entire distribution of margins of training examples
1 2
Pr[margin ] 0 exponentially fast (in T ) if t <
(t)
so: if weak learning assumption holds, then all examples will quickly have large margins
33
Consequences of Margins Theory
predicts good generalization with no overtting if:
weak classiers have large edges (implying large margins) weak classiers not too complex relative to size of training set
e.g., boosting decision trees resistant to overtting since trees
often have large edges and limited complexity
overtting may occur if:
small edges (undertting), or overly complex weak classiers
e.g., heart-disease dataset:
stumps yield small edges also, small dataset
34
Improved Boosting with Better Margin-Maximization?
can design algorithms more eective than AdaBoost at
maximizing the minimum margin
in practice, often perform worse why?? more aggressive margin maximization seems to lead to: [Breiman]
more complex weak classiers (even using same weak learner); or higher minimum margins, but margin distributions that are lower overall
[with Reyzin]
35
Comparison to SVMs
both AdaBoost and SVMs:
work by maximizing margins nd linear threshold function in high-dimensional space
dierences:
margin measured slightly dierently (using dierent norms) SVMs handle high-dimensional space using kernel trick; AdaBoost uses weak learner to search over space SVMs maximize minimum margin; AdaBoost maximizes margin distribution in a more diuse sense
36
Basic Algorithm and Core Theory
introduction to AdaBoost analysis of training error analysis of test error
and the margins theory
experiments and applications
37
Practical Advantages of AdaBoost
fast simple and easy to program no parameters to tune (except T ) exible can combine with any learning algorithm no prior knowledge needed about weak learner provably eective, provided can consistently nd rough rules
of thumb shift in mind set goal now is merely to nd classiers barely better than random guessing
versatile
can use with data that is textual, numeric, discrete, etc. has been extended to learning problems well beyond binary classication
38
Caveats
performance of AdaBoost depends on data and weak learner consistent with theory, AdaBoost can fail if
weak classiers too complex overtting weak classiers too weak (t 0 too quickly) undertting low margins overtting
empirically, AdaBoost seems especially susceptible to uniform
noise
39
UCI Experiments
tested AdaBoost on UCI benchmarks used:
[with Freund]
C4.5 (Quinlans decision tree algorithm) decision stumps: very simple rules of thumb that test on single attributes
eye color = brown ? yes predict +1 no predict -1
height > 5 feet ? yes predict -1 no predict +1
40
UCI Results
30 25
30 25
C4.5
C4.5
20 15 10 5 0 0 5 10 15 20 25 30
20 15 10 5 0 0 5 10 15 20 25 30
boosting Stumps
boosting C4.5
41
Application: Detecting Faces
[Viola & Jones]
problem: nd faces in photograph or movie weak classiers: detect light/dark rectangles in image
many clever tricks to make extremely fast and accurate
42
Application: Human-Computer Spoken Dialogue
[with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi]
application: automatic store front or help desk for AT&T
Labs Natural Voices business
caller can request demo, pricing information, technical
support, sales agent, etc.
interactive dialogue
43
How It Works
computer speech texttospeech Human raw utterance
automatic speech recognizer text
text response
dialogue manager
predicted category
natural language understanding
NLUs job: classify caller utterances into 24 categories
(demo, sales rep, pricing info, yes, no, etc.)
weak classiers: test for presence of word or phrase
44
Problem: Labels are Expensive
for spoken-dialogue task
getting examples is cheap getting labels is expensive must be annotated by humans
how to reduce number of labels needed?
45
Active Learning
[with Tur & Hakkani-Tr] u
idea:
use selective sampling to choose which examples to label focus on least condent examples [Lewis & Gale]
for boosting, use (absolute) margin as natural condence measure [Abe & Mamitsuka]
46
Labeling Scheme
start with pool of unlabeled examples choose (say) 500 examples at random for labeling run boosting on all labeled examples
get combined classier F choose examples with minimum |F (x)| (proportional to absolute margin)
pick (say) 250 additional examples from pool for labeling
repeat
47
Results: How-May-I-Help-You?
34 random active
32
% error rate
30
28
26
24 0 5000 10000 15000 20000 25000 # labeled examples 30000 35000 40000
% error 28 26 25
rst reached random active 11,000 5,500 22,000 9,500 40,000 13,000
% label savings 50 57 68
48
Results: Letter
25 random active
20
% error rate
15
10
5
0 0 2000 4000 6000 8000 10000 # labeled examples 12000 14000 16000
% error 10 5 4
rst reached random active 3,500 1,500 9,000 2,750 13,000 3,500
% label savings 57 69 73
49
Fundamental Perspectives
game theory loss minimization an information-geometric view
50
Fundamental Perspectives
game theory loss minimization an information-geometric view
51
Just a Game
[with Freund]
can view boosting as a game, a formal interaction between
booster and weak learner
on each round t:
booster chooses distribution Dt weak learner responds with weak classier ht
game theory: studies interactions between all sorts of
players
52
Games
game dened by matrix M:
Rock Paper Scissors
Rock 1/2 0 1
Paper 1 1/2 0
Scissors 0 1 1/2
row player (Mindy) chooses row i column player (Max) chooses column j (simultaneously) Mindys goal: minimize her loss M(i, j) assume (wlog) all entries in [0, 1]
53
Randomized Play
usually allow randomized play:
Mindy chooses distribution P over rows Max chooses distribution Q over columns (simultaneously)
Mindys (expected) loss
=
i,j
P(i)M(i, j)Q(j)
= P MQ M(P, Q)
i, j = pure strategies P, Q = mixed strategies m = # rows of M also write M(i, Q) and M(P, j) when one side plays pure and
other plays mixed
54
Sequential Play
say Mindy plays before Max if Mindy chooses P then Max will pick Q to maximize
M(P, Q) loss will be L(P) max M(P, Q)
Q
so Mindy should pick P to minimize L(P)
loss will be min L(P) = min max M(P, Q)
P P Q
similarly, if Max plays rst, loss will be
max min M(P, Q)
Q P
55
Minmax Theorem
playing second (with knowledge of other players move)
cannot be worse than playing rst, so: min max M(P, Q) max min M(P, Q)
P Q Q P
Mindy plays rst
Mindy plays second
von Neumanns minmax theorem:
min max M(P, Q) = max min M(P, Q)
P Q Q P
in words: no advantage to playing second
56
Optimal Play
minmax theorem:
min max M(P, Q) = max min M(P, Q) = value v of game
P Q Q P
optimal strategies:
P = arg minP maxQ M(P, Q) = minmax strategy Q = arg maxQ minP M(P, Q) = maxmin strategy
in words:
Mindys minmax strategy P guarantees loss v (regardless of Maxs play) optimal because Max has maxmin strategy Q that can force loss v (regardless of Mindys play)
e.g.: in RPS, P = Q = uniform solving game = nding minmax/maxmin strategies
57
Weaknesses of Classical Theory
seems to fully answer how to play games just compute
minmax strategy (e.g., using linear programming)
weaknesses:
game M may be unknown game M may be extremely large opponent may not be fully adversarial may be possible to do better than value v e.g.: Lisa (thinks): Poor predictable Bart, always takes Rock. Bart (thinks): Good old Rock, nothing beats that.
58
Repeated Play
if only playing once, hopeless to overcome ignorance of game
M or opponent
but if game played repeatedly, may be possible to learn to
play well
goal: play (almost) as well as if knew game and how
opponent would play ahead of time
59
Repeated Play (cont.)
M unknown for t = 1, . . . , T :
Mindy chooses Pt Max chooses Qt (possibly depending on Pt ) Mindys loss = M(Pt , Qt ) Mindy observes loss M(i, Qt ) of each pure strategy i
want:
1 T
T
M(Pt , Qt ) min
t=1 P
1 T
T
M(P, Qt ) + [small amount]
t=1
actual average loss
best loss (in hindsight)
60
Multiplicative-Weights Algorithm (MW)
choose > 0 initialize: P1 = uniform on round t:
[with Freund]
Pt+1 (i) =
Pt (i) exp ( M(i, Qt )) normalization
idea: decrease weight of strategies suering the most loss directly generalizes [Littlestone & Warmuth] other algorithms: [Hannan57] [Blackwell56] [Foster & Vohra] [Fudenberg & Levine]
. . .
61
Analysis
Theorem: can choose so that, for any game M with m
rows, and any opponent, 1 T
T
M(Pt , Qt )
t=1
1 min P T
T
M(P, Qt ) + T
t=1
actual average loss where T = O
regret T is:
best average loss ( v ) 0
ln m T
logarithmic in # rows m independent of # columns
therefore, can use when working with very large games
62
Solving a Game
suppose game M played repeatedly
[with Freund]
Mindy plays using MW on round t, Max chooses best response: Qt = arg max M(Pt , Q)
Q
let
1 P= T
T
Pt ,
t=1
1 Q= T
T
Qt
t=1
can prove that P and Q are T -approximate minmax and
maxmin strategies: max M(P, Q) v + T
Q
and min M(P, Q) v T
P
63
Boosting as a Game
Mindy (row player) booster Max (column player) weak learner matrix M:
row training example column weak classier M(i, j) = 1 if j-th weak classier correct on i-th training example 0 else encodes which weak classiers correct on which examples huge # of columns one for every possible weak classier
64
Boosting and the Minmax Theorem
-weak learning assumption:
for every distribution on examples can nd weak classier with weighted error
1 2
equivalent to:
(value of game M)
1 2
+
by minmax theorem, implies that:
some weighted majority classier that correctly classies all training examples with margin 2 further, weights are given by maxmin strategy of game M
65
Idea for Boosting
maxmin strategy of M has perfect (training) accuracy and
large margins
nd approximately using earlier algorithm for solving a game
i.e., apply MW to M
yields (variant of) AdaBoost
66
AdaBoost and Game Theory
summarizing:
weak learning assumption implies maxmin strategy for M denes large-margin classier AdaBoost nds maxmin strategy by applying general algorithm for solving games through repeated play
consequences:
weights on weak classiers converge to (approximately) maxmin strategy for game M (average) of distributions Dt converges to (approximately) minmax strategy margins and edges connected via minmax theorem explains why AdaBoost maximizes margins
dierent instantiation of game-playing algorithm gives online
learning algorithms (such as weighted majority algorithm)
67
Fundamental Perspectives
game theory loss minimization an information-geometric view
68
AdaBoost and Loss Minimization
many (most?) learning and statistical methods can be viewed
as minimizing loss (a.k.a. cost or objective) function measuring t to data: 2 e.g. least squares regression i (F (xi ) yi )
AdaBoost also minimizes a loss function helpful to understand because:
claries goal of algorithm and useful in proving convergence properties decoupling of algorithm from its objective means: faster algorithms possible for same objective same algorithm may generalize for new learning challenges
69
What AdaBoost Minimizes
recall proof of training error bound:
training error(Hnal )
t
Zt
Zt = t e t + (1 t )e t = 2 t (1 t ) t chosen to minimize Zt ht chosen to minimize t same as minimizing Zt (since increasing in t on [0, 1/2]) equivalent to greedily minimizing
t
closer look:
so: both AdaBoost and weak learner minimize Zt on round t
Zt
70
AdaBoost and Exponential Loss
so AdaBoost is greedy procedure for minimizing
exponential loss Zt =
t
1 m
exp(yi F (xi ))
i
where F (x) =
t
t ht (x)
why exponential loss?
intuitively, strongly favors F (xi ) to have same sign as yi upper bound on training error smooth and convex (but very loose)
how does AdaBoost minimize it?
71
Coordinate Descent
{g1 , . . . , gN } = space of all weak classiers
N
[Breiman]
then can write F (x) =
t
t ht (x) =
j=1
j gj (x)
want to nd 1 , . . . , N to minimize
L(1 , . . . , N ) =
i
AdaBoost is actually doing coordinate descent on this
exp yi
j
j gj (xi )
optimization problem: initially, all j = 0 each round: choose one coordinate j (corresponding to ht ) and update (increment by t ) choose update causing biggest decrease in loss powerful technique for minimizing over huge space of functions
72
Functional Gradient Descent
want to minimize
[Mason et al.][Friedman]
L(F ) = L(F (x1 ), . . . , F (xm )) =
i
exp(yi F (xi ))
say have current estimate F and want to improve to do gradient descent, would like update
F F F L(F )
but update restricted in class of weak classiers
F F + ht
so choose ht closest to F L(F ) equivalent to AdaBoost
73
Estimating Conditional Probabilities
[Friedman, Hastie & Tibshirani]
often want to estimate probability that y = +1 given x AdaBoost minimizes (empirical version of):
Ex,y e yF (x) = Ex Pr [y = +1|x] e F (x) + Pr [y = 1|x] e F (x) where x, y random from true distribution
over all F , minimized when
F (x) = or
1 ln 2
Pr [y = +1|x] Pr [y = 1|x] 1 1 + e 2F (x)
Pr [y = +1|x] =
so, to convert F output by AdaBoost to probability estimate,
use same formula
74
Calibration Curve
100
80
observed probability
60
40
20
0 0 20 40 60 80 100 predicted probability
order examples by F value output by AdaBoost break into bins of xed size for each bin, plot a point:
x-value: average estimated probability of examples in bin y -value: actual fraction of positive examples in bin
75
A Synthetic Example
x [2, +2] uniform Pr [y = +1|x] = 2x
2
m = 500 training examples
1 1
0.5
0.5
0 -2 -1 0 1 2
0 -2 -1 0 1 2
if run AdaBoost with stumps and convert to probabilities,
result is poor extreme overtting
76
Regularization
AdaBoost minimizes
L() =
i
to avoid overtting, want to constrain to make solution
exp yi
j
j gj (xi )
smoother
(1 ) regularization:
minimize: L() subject to:
or:
1
B
minimize: L() +
other norms possible
1
1 (lasso) currently popular since encourages sparsity
[Tibshirani]
77
Regularization Example
1 1 1
0.5
0.5
0.5
0 -2 -1 0 1 2
0 -2 -1 0 1 2
0 -2 -1 0 1 2
=
103
=
102.5
=
102
1
1
1
0.5
0.5
0.5
0 -2 -1 0 1 2
0 -2 -1 0 1 2
0 -2 -1 0 1 2
=
101.5
=
101
=
100.5
78
Regularization and AdaBoost
0.6 individual classifier weights 0.4 0.2 0 -0.2 -0.4 -0.6 0 0.5 1 1.5 2 2.5
[Hastie, Tibshirani & Friedman; Rosset, Zhu & Hastie]
0.6 individual classifier weights 0.4 0.2 0 -0.2 -0.4 -0.6 0 0.5 1 1.5 2 2.5
B Experiment 1: regularized
T Experiment 2: AdaBoost run
solution vectors plotted as function of B
with t xed to (small) solution vectors plotted as function of T
plots are identical! can prove under certain (but not all) conditions that results will be the same (as 0) [Zhao & Yu]
79
Regularization and AdaBoost
suggests stopping AdaBoost early is akin to applying
1 -regularization
caveats:
does not strictly apply to AdaBoost (only variant) not helpful when boosting run to convergence (would correspond to very weak regularization)
in fact, in limit of vanishingly weak regularization (B ),
solution converges to maximum margin solution
[Rosset, Zhu & Hastie]
80
Benets of Loss-Minimization View
immediate generalization to other loss functions and learning
problems
e.g. squared error for regression e.g. logistic regression (by only changing one line of AdaBoost)
sensible approach for converting output of boosting into
conditional probability estimates
helpful connection to regularization basis for proving AdaBoost is statistically consistent
i.e., under right assumptions, converges to best possible classier [Bartlett & Traskin]
81
A Note of Caution
tempting (but incorrect!) to conclude:
AdaBoost is just an algorithm for minimizing exponential loss AdaBoost works only because of its loss function more powerful optimization techniques for same loss should work even better
incorrect because:
other algorithms that minimize exponential loss can give very poor generalization performance compared to AdaBoost
for example...
82
An Experiment
data:
instances x uniform from {1, +1}10,000 label y = majority vote of three coordinates weak classier = single coordinate (or its negation) training set size m = 1000 algorithms (all provably minimize exponential loss): standard AdaBoost gradient descent on exponential loss AdaBoost, but in which weak classiers chosen at random results:
exp. loss 1010 1020 1040 10100
stand. AdaB. 0.0 [94] 0.0 [190] 0.0 [382] 0.0 [956]
% test error [# rounds] grad. desc. random AdaB. 40.7 [5] 44.0 [24,464] 40.8 [9] 41.6 [47,534] 40.8 [21] 40.9 [94,479] 40.8 [70] 40.3 [234,654]
83
An Experiment (cont.)
conclusions:
not just what is being minimized that matters, but how it is being minimized loss-minimization view has benets and is fundamental to understanding AdaBoost but is limited in what it says about generalization results are consistent with margins theory
stan. AdaBoost grad. descent rand. AdaBoost 1
0.5
0 -1 -0.5 0 0.5 1
84
Fundamental Perspectives
game theory loss minimization an information-geometric view
85
A Dual Information-Geometric Perspective
loss minimization focuses on function computed by AdaBoost
(i.e., weights on weak classiers)
dual view: instead focus on distributions Dt
(i.e., weights on examples)
dual perspective combines geometry and information theory exposes underlying mathematical structure basis for proving convergence
86
An Iterative-Projection Algorithm
say want to nd point closest to x0 in set
P = { intersection of N hyperplanes }
algorithm:
[Bregman; Censor & Zenios]
start at x0 repeat: pick a hyperplane and project onto it
P x0
if P = , under general conditions, will converge correctly
87
AdaBoost is an Iterative-Projection Algorithm
points = distributions Dt over training examples distance = relative entropy:
[Kivinen & Warmuth]
RE (P
Q) =
i
P(i) ln
P(i) Q(i)
reference point x0 = uniform distribution hyperplanes dened by all possible weak classiers gj :
D(i)yi gj (xi ) = 0 Pr [gj (xi ) = yi ] =
i iD
1 2
intuition: looking for hardest distribution
88
AdaBoost as Iterative Projection (cont.)
algorithm:
start at D1 = uniform for t = 1, 2, . . .: pick hyperplane/weak classier ht gj Dt+1 = (entropy) projection of Dt onto hyperplane = arg min RE (D Dt )
D:
i
D(i)yi gj (xi )=0
claim: equivalent to AdaBoost further: choosing ht with minimum error choosing farthest
hyperplane
89
Boosting as Maximum Entropy
corresponding optimization problem:
DP
min RE (D
uniform) max entropy(D)
DP
where
P = feasible set = D:
i
D(i)yi gj (xi ) = 0 j
P = weak learning assumption does not hold
in this case, Dt (unique) solution
if weak learning assumption does hold then
P= Dt can never converge dynamics are fascinating but unclear in this case
90
Visualizing Dynamics
plot one circle for each round t:
[with Rudin & Daubechies]
center at (Dt (1), Dt (2)) radius t (color also varies with t)
t=3
0.5
11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000
0.4
t=6
11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000
d
D (2) t
t,2
t=1 t=4 t=5
0.2
111111 000000 0.3 d 111111 000000 0.4 111111 000000 t,1 111111 000000 111111 000000 111111 000000 111111 000000
t=2
0.2
D (1) t
0.5
in all cases examined, appears to converge eventually to cycle
open if always true
91
More Examples
0.35
0.25
111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000
D (2) t
dt,2
0.1
0 0
111111 0.1 000000 111111 000000 d 111111 000000 111111 000000 111111 000000
111111 000000 111111 000000
D (1) t t,1
0.2
0.3
92
More Examples
0.2
0.15
111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000
d
D (2) t
0.05
t,12
0 0
0.05
d (1) D t,11 t
11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000
0.15
0.2
93
More Examples
0.35
111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000
0.25
d
D (2) t
0.05 0 0 0.05 0.15
11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000
t,2
Ddt,1 t (1)
0.25
94
More Examples
0.4 0.35 0.3
11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000
d
D (2) t
0.15 0.1 0.05 0 0 0.05 0.1 0.15
11111 00000 11111 00000 11111 00000 11111 00000 t,1 11111 00000 11111 00000
t,2
d D (1) t
0.3
0.35
0.4
95
Unifying the Two Cases
two distinct cases:
[with Collins & Singer]
weak learning assumption holds P = dynamics unclear weak learning assumption does not hold P = can prove convergence of Dt s to unify: work instead with unnormalized versions of Dt s Dt (i) exp(t yi ht (xi )) standard AdaBoost: Dt+1 (i) = normalization instead: dt+1 (i) = dt (i) exp(t yi ht (xi )) dt+1 (i) Dt+1 (i) = normalization
algorithm is unchanged
96
Reformulating AdaBoost as Iterative Projection
points = nonnegative vectors dt distance = unnormalized relative entropy:
RE (p
q) =
i
p(i) ln
p(i) q(i)
+ q(i) p(i)
reference point x0 = 1 (all 1s vector) hyperplanes dened by weak classiers gj :
d(i)yi gj (xi ) = 0
i
resulting iterative-projection algorithm is again equivalent to
AdaBoost
97
Reformulated Optimization Problem
optimization problem:
min RE (d
dP
1)
where
P=
d:
i
d(i)yi gj (xi ) = 0 j
note: feasible set P never empty (since 0 P)
98
Exponential Loss as Entropy Optimization
all vectors dt created by AdaBoost have form:
d(i) = exp yi
j
j gj (xi )
let Q = { all vectors d of this form } can rewrite exponential loss:
inf
i
exp yi
j
j gj (xi )
=
dQ
inf
d(i)
i
= min
dQ dQ i
d(i) d)
= min RE (0
Q = closure of Q
99
Duality
[Della Pietra, Della Pietra & Laerty]
presented two optimization problems:
min RE (d
dP
1) d)
min RE (0
dQ
which is AdaBoost solving? Both! problems have same solution moreover: solution given by unique point in P Q problems are convex duals of each other
100
Convergence of AdaBoost
can use to prove AdaBoost converges to common solution of
both problems: can argue that d = lim dt is in P vectors dt are in Q always d Q d P Q d solves both optimization problems
so:
AdaBoost minimizes exponential loss exactly characterizes limit of unnormalized distributions likewise for normalized distributions when weak learning assumption does not hold
also, provides additional link to logistic regression
only need slight change in optimization problem
[with Collins & Singer; Lebannon & Laerty]
101
Practical Extensions
multiclass classication ranking problems condence-rated predictions
102
Practical Extensions
multiclass classication ranking problems condence-rated predictions
103
Multiclass Problems
say y Y where |Y | = k direct approach (AdaBoost.M1):
[with Freund]
ht : X Y Dt+1 (i) = Dt (i) Zt e t e t
y Y
if yi = ht (xi ) if yi = ht (xi ) t
t:ht (x)=y
Hnal (x) = arg max
can prove same bound on error if t : t 1/2
in practice, not usually a problem for strong weak learners (e.g., C4.5) signicant problem for weak weak learners (e.g., decision stumps) instead, reduce to binary
104
The One-Against-All Approach
break k-class problem into k binary problems and
solve each separately
say possible labels are Y = { , , , }
x1 x2 x3 x4 x5
x1 x2 x3 x4 x5
+
x1 x2 x3 x4 x5
+ +
x1 x2 x3 x4 x5
+
x1 x2 x3 x4 x5
+
to classify new example, choose label predicted to be most
positive
AdaBoost.MH problem: not robust to errors in predictions [with Singer]
105
Using Output Codes
[with Allwein & Singer][Dietterich & Bakiri]
M
reduce to binary using
coding matrix M
rows of M code words
1 + + + 4 + + +
2 + + x1 x2 x3 x4 x5
3 + + + 5 + + +
4 + +
5 + +
1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 + + + x1 x2 x3 x4 x5
2 + + x1 x2 x3 x4 x5
3 + + + + x1 x2 x3 x4 x5
to classify new example, choose closest row of M
106
Output Codes (continued)
if rows of M far from one another,
will be highly robust to errors
potentially much faster when k (# of classes) large disadvantage:
binary problems may be unnatural and hard to solve
107
Practical Extensions
multiclass classication ranking problems condence-rated predictions
108
Ranking Problems
[with Freund, Iyer & Singer]
goal: learn to rank objects (e.g., movies, webpages, etc.) from
examples
can reduce to multiple binary questions of form:
is or is not object A preferred to object B?
now apply (binary) AdaBoost RankBoost
109
Application: Finding Cancer Genes
[Agarwal & Sengupta]
examples are genes (described by microarray vectors) want to rank genes from most to least relevant to leukemia data sizes:
7129 genes total 10 known relevant 157 known irrelevant
110
Top-Ranked Cancer Genes
Gene 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. KIAA0220 G-gamma globin Delta-globin Brain-expressed HHCPA78 homolog Myeloperoxidase Probable protein disulde isomerase ER-60 precursor NPM1 Nucleophosmin CD34 Elongation factor-1-beta CD24 Relevance Summary
= = = = =
known therapeutic target potential therapeutic target known marker potential marker no link found
111
Practical Extensions
multiclass classication ranking problems condence-rated predictions
112
Hard Predictions Can Slow Learning
L
ideally, want weak classier that says:
h(x) =
+1 if x above L dont know else
problem: cannot express using hard predictions if must predict 1 below L, will introduce many bad
predictions need to clean up on later rounds dramatically increases time to convergence
113
Condence-Rated Predictions
[with Singer]
useful to allow weak classiers to assign condences to
predictions
formally, allow ht : X R
sign(ht (x)) = prediction |ht (x)| = condence
use identical update:
Dt+1 (i) =
Dt (i) exp(t yi ht (xi )) Zt
and identical rule for combining weak classiers
question: how to choose t and ht on each round
114
Condence-Rated Predictions (cont.)
saw earlier:
training error(Hnal )
t
Zt =
1 m
exp yi
i t
t ht (xi )
therefore, on each round t, should choose t ht to minimize:
Zt =
i
Dt (i) exp(t yi ht (xi ))
in many cases (e.g., decision stumps), best condence-rated
weak classier has simple form that can be found eciently
115
Condence-Rated Predictions Help a Lot
70 60 50 test no conf train no conf test conf train conf
% Error
40 30 20 10 1 10 100 1000 Number of rounds 10000
% error 40 35 30
round rst reached conf. no conf. 268 16,938 598 65,292 1,888 >80,000
speedup 63.2 109.2
116
Application: Boosting for Text Categorization
[with Singer]
weak classiers: very simple weak classiers that test on
simple patterns, namely, (sparse) n-grams nd parameter t and rule ht of given form which minimize Zt use eciently implemented exhaustive search
How may I help you data:
7844 training examples 1000 test examples categories: AreaCode, AttService, BillingCredit, CallingCard, Collect, Competitor, DialForMe, Directory, HowToDial, PersonToPerson, Rate, ThirdNumber, Time, TimeCharge, Other.
117
Weak Classiers
rnd
1
term
collect
AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
2
card
3
my home
4
person ? person
5
code
6
I
118
More Weak Classiers
rnd
7
term
time
AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
8
wrong number
9 10 11
how call seven
12
trying to
13
and
119
More Weak Classiers
rnd
14
term
third
AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
15 16
to for
17
charges
18
dial
19
just
120
Finding Outliers
examples with most weight are often outliers (mislabeled and/or ambiguous)
Im trying to make a credit card call (Collect) hello (Rate) yes Id like to make a long distance collect call please (CallingCard) calling card please (Collect) yeah Id like to use my calling card number (Collect) can I get a collect call (CallingCard) yes I would like to make a long distant telephone call and have the charges billed to another number (CallingCard DialForMe) yeah I can not stand it this morning I did oversea (BillingCredit) call is so bad yeah special offers going on for long distance (AttService Rate) mister allen please william allen (PersonToPerson) yes maam I Im trying to make a long distance call to a non dialable point in san miguel philippines (AttService Other)
121
Advanced Topics
optimal accuracy optimal eciency boosting in continuous time
122
Advanced Topics
optimal accuracy optimal eciency boosting in continuous time
123
Optimal Accuracy
noise or uncertainty
[Bartlett & Traskin]
usually, impossible to get perfect accuracy due to intrinsic Bayes optimal error = best possible error of any classier
usually > 0
can prove AdaBoosts classier converges to Bayes optimal if:
enough data run for many (but not too many) rounds weak classiers suciently rich
universally consistent related results: [Jiang], [Lugosi & Vayatis], [Zhang & Yu], . . . means:
AdaBoost can (theoretically) learn optimally even in noisy settings but: does not explain why works when run for very many rounds
124
Boosting and Noise
[Long & Servedio]
can construct data source on which AdaBoost fails miserably
with even tiny amount of noise (say, 1%) Bayes optimal error = 1% (obtainable by classier of same form as AdaBoost) AdaBoost provably has error 50%
holds even if:
given unlimited training data use any method for minimizing exponential loss
also holds:
for most other convex losses even if add regularization e.g. applies to SVMs, logistic regression, . . .
125
Boosting and Noise (cont.)
shows:
consistency result can fail badly if weak classiers not rich enough AdaBoost (and lots of other loss-based methods) susceptible to noise regularization might not help
how to handle noise?
on real-world datasets, AdaBoost often works anyway various theoretical algorithms based on branching programs (e.g., [Kalai & Servedio], [Long & Servedio])
126
Advanced Topics
optimal accuracy optimal eciency boosting in continuous time
127
Optimal Eciency
for AdaBoost, saw: training error e 2
2T
[Freund]
is AdaBoost most ecient boosting algorithm?
no!
given T rounds and -weak learning assumption,
boost-by-majority (BBM) algorithm is provably exactly best possible:
T /2
training error
j=0
T j
1 2
+
j
1 2
T j
(probability of T /2 heads in T coin ips if probability of heads = 1 + ) 2
AdaBoosts training error is like Cherno approximation of
BBMs
128
Weighting Functions: AdaBoost versus BBM
AdaBoost BBM
950
weight
650 350 50
weight
30
20
unnormalized margin
s
10
300
200
unnormalized margin
s
100
both put more weight on harder examples, but BBM gives
up on very hardest examples may make more robust to noise
problem: BBM not adaptive
need to know and T a priori
129
Advanced Topics
optimal accuracy optimal eciency boosting in continuous time
130
Boosting in Continuous Time
[Freund]
idea: let get very small so that -weak learning assumption
eventually satised
need to make T correspondingly large if scale time to begin at = 0 and end at = 1, then each
boosting round takes time 1/T
in limit T , boosting is happening in continuous time
131
BrownBoost
algorithm has sensible limit called BrownBoost
(due to connection to Brownian motion)
harder to implement, but potentially more resistant to noise
and outliers, e.g.: dataset letter noise 0% 10% 20% 0% 10% 20% AdaBoost 3.7 10.8 15.7 4.9 12.1 21.3 BrownBoost 4.2 7.0 10.5 5.2 6.2 7.4
satimage
[Cheamanunkul, Ettinger & Freund]
132
Conclusions
from dierent perspectives, AdaBoost can be interpreted as:
a method for boosting the accuracy of a weak learner a procedure for maximizing margins an algorithm for playing repeated games a numerical method for minimizing exponential loss an iterative-projection algorithm based on an information-theoretic geometry
none is entirely satisfactory by itself, but each useful in its
own way
taken together, create rich theoretical understanding
connect boosting to other learning problems and techniques provide foundation for versatile set of methods with many extensions, variations and applications
133
References
Robert E. Schapire and Yoav Freund.
Boosting: Foundations and Algorithms. MIT Press, 2012.
134