pdf.io >> Free >> An empirical study of the naive Bayes classiﬁer.pdf

An empirical study of the naive Bayes classiﬁer
 FileName: Rish.pdf [readonline]


 FileSize: 252 KB download
 Shared by: cc_gatech_edu 35 month ago
 Category: Free
 Report us: delete it


Abstract: The success of naive Bayes in the presence of feature de pendencies can be explained as ... on the bias of naive Bayes classiﬁer, not on its variance. ...

An empirical study of the naive Bayes classiﬁer
I. Rish
T.J. Watson Research Center
[email protected]
Abstract The success of naive Bayes in the presence of feature de
pendencies can be explained as follows: optimality in terms
The naive Bayes classiﬁer greatly simplify learn
of zeroone loss (classiﬁcation error) is not necessarily related
ing by assuming that features are independent given
to the quality of the ﬁt to a probability distribution (i.e., the
class. Although independence is generally a poor appropriateness of the independence assumption). Rather, an
assumption, in practice naive Bayes often competes optimal classiﬁer is obtained as long as both the actual and
well with more sophisticated classiﬁers.
estimated distributions agree on the mostprobable class [2].
Our broad goal is to understand the data character For example, [2] prove naive Bayes optimality for some prob
istics which affect the performance of naive Bayes. lems classes that have a high degree of feature dependencies,
Our approach uses Monte Carlo simulations that al such as disjunctive and conjunctive concepts.
low a systematic study of classiﬁcation accuracy However, this explanation is too general and therefore not
for several classes of randomly generated prob very informative. Ultimately, we would like to understand
lems. We analyze the impact of the distribution the data characteristics which affect the performance of naive
entropy on the classiﬁcation error, showing that Bayes. While most of the work on naive Bayes compares
lowentropy feature distributions yield good per its performance to other classiﬁers on particular benchmark
formance of naive Bayes. We also demonstrate problems (e.g., UCI benchmarks), our approach uses Monte
that naive Bayes works well for certain nearly Carlo simulations that allow a more systematic study of clas
functional feature dependencies, thus reaching its siﬁcation accuracy on parametric families of randomly gen
best performance in two opposite cases: completely erated problems. Also, our current analysis is focused only
independent features (as expected) and function on the bias of naive Bayes classiﬁer, not on its variance.
ally dependent features (which is surprising). An Namely, we assume an inﬁnite amount of data (i.e., a perfect
other surprising result is that the accuracy of naive knowledge of data distribution) which allows us to separate
Bayes is not directly correlated with the degree the approximation error (bias) of naive Bayes from the error
of feature dependencies measured as the class induced by training sample set size (variance).
conditional mutual information between the fea
tures. Instead, a better predictor of naive Bayes ac We analyze the impact of the distribution entropy
curacy is the amount of information about the class on the classiﬁcation error, showing that certain almost
that is lost because of the independence assump deterministic, or lowentropy, dependencies yield good per
tion. formance of naive Bayes. (Note that almostdeterministic
dependencies are a common characteristic in many practi
cal problem domains, such as, for example, computer sys
1 Introduction tem management and errorcorrecting codes.) We show that
Bayesian classiﬁers assign the most likely class to a given the error of naive Bayes vanishes as the entropy
4¨ §%¤£¥¡¢ 3
©© ¦ ¢
example described by its feature vector. Learning such clas approaches zero. Another class of almostdeterministic de
siﬁers can be greatly simpliﬁed by assuming that features are
© ¦¨ #"£! §¨ §¥¤£¡
¢ ¡ © ¦ ¢ pendencies generalizes functional dependencies between the
independent given class, that is,
© " 100)' %"$¤
& (((& ¢ ¨ , features. Particularly, we show that naive Bayes works best in
where is a feature vector and is a class. two cases: completely independent features (as expected) and
Despite this unrealistic assumption, the resulting classiﬁer functionally dependent features (which is less obvious), while
known as naive Bayes is remarkably successful in practice, reaching its worst performance between these extremes.
often competing with much more sophisticated techniques [6; We also show that, surprisingly, the accuracy of naive
8; 4; 2]. Naive Bayes has proven effective in many practical Bayes is not directly correlated with the degree of feature de
applications, including text classiﬁcation, medical diagnosis, pendencies measured as the classconditional mutual infor
and systems performance management [2; 9; 5]. mation between the features,
© ¦7 6 5
" ¨ 98" 4 %"¢
( and
"
are
2
T.J. Watson Research Center, 30 Saw Mill River Road, features and
¨ is the class). Instead, our experiments re
Hawthorne, NY 10532. Phone +1 (914) 7847431 veal that a better predictor of naive Bayes accuracy can be
41
the loss of information that features contain about the class A4© 7 " & #"¢ 6 ¥¨¢ 5
@© The probability of a classiﬁcation error, or
xdx
of a clas
when assuming naive Bayes model, namely siﬁer is deﬁned as
I)H" G %"¢ 6 F¢ EC5
©©7 & ¨ D B , where is the mutual information be
EC5
D B 4©%¤¢ q %¤¢ £© ¢
© © ¢ ¡
tween features and class under naive Bayes assumption. & 4%y¢ #y¢ £¡  WWQ%¤£44%y¢ #y¢ £¡
h © © q © ¢ c © y ¢ ¡ © © q © ¢
This paper is structured as follows. In the next section we i
provide necessary background and deﬁnitions. Section 3 dis y
cusses naive Bayes performance for nearlydeterministic de © l ¢ 
where is the expectation over .
l denotes the
pendencies, while Section 4 demonstrates that the “informa Bayes error (Bayes risk).
tion loss” criterion can be a better error predictor than the We say that classiﬁer is optimal on a given problem if its
strength of feature dependencies. A summary and conclu risk coincides with the Bayes risk. Assuming there is no noise
sions are given in Section 5. (i.e. zero Bayes risk), a concept is called separable by a set of
functions
@ &RSR)& )© ¢
hf R ¦if every example
y
2 Deﬁnitions and Background © p ¢ Y e ` p 'c
is classiﬁed correctly when using each
as discriminant
Let
© " &SRR0' %"QP¤
R& ¢
be a vector of observed random vari functions.
As a measure of dependence between two features
"
ables, called features, where each feature takes values from T 7 "
its domain . The set of all feature vectors (examples, or
XRRWV£
V R and we use the classconditional mutual information [1],
states), is denoted
¨ T . Let
T U be an un which can be deﬁned as
observed random variable denoting the class of an example,
& ¨ 98" & #"¢ z¨ 98#"¢ ¨ ~#"¢ ¨ 9H" 6 #"¢
© ¦7 3 @© ¦7 3 © ¦ 3 © ¦7 5
h f Y R Adab`
R ig@ &SRR0&
where can take one of values
ec Y " Capi
¨ ¨ %¢ 3
© ¦
where is the classconditional entropy of , deﬁned
tal letters, such as , will denote variables, while lowercase
as:
letters, such as , will denote their values; boldface letters
p R © s¨ 0¤
%£¡ S¦© s¨ 0¤
%£¢ © s¥¨£¡ @
will denote vectors.
¨#y¢ q
© hi@ Y &SRSR)& xcwvtsq
f R ¦£ ¢ ¥ ¦£ ¢ ¡ ¡ ¢
A function
© p¢ q , where
e u U r ,
denotes a concept to be learned. Deterministic corre
7 8" "
Mutual information is zero when and are mutually in
sponds to a concept without noise, which always assigns the
¨
dependent given class , and increases with increasing level
same class to a given example (e.g., disjunctive and conjunc of dependence, reaching the maximum when one feature is a
tive concepts are deterministic). In general, however, a con © p¢ q deterministic function of the other.
cept can be noisy, yielding a random function .
A classiﬁer is deﬁned by a (deterministic) function
@ Y &RSR)& Ac
U
hf R r 3 When does naive Bayes work well? Effects
(a hypothesis) that assigns a class
e u
to any given example. A common approach is to asso
of some nearlydeterministic dependencies
ciate each class with a discriminant function ,
W@ &RSR)& %y¢
© In this section, we discuss known limitations of naive
f Y R e
, and let the classiﬁer select the class with max Bayes and then some conditions of its optimality and near
imum discriminant function on a given example:
%y¢
© optimality, that include lowentropy feature distributions and
.
©y#¢ G1jh gegfgfgf)d9% '£G
k i e nearlyfunctional feature dependencies.
The Bayes classiﬁer
©#y¢ !
(that we also call Bayesoptimal
classiﬁer and denote
#y¢ l 8m
© n
), uses as discriminant functions
3.1 Concepts without noise
the class posterior probabilities given a feature vector, i.e. © l We focus ﬁrst on concepts with
y e ¦y s¥¨£¡
© ¢
or for any
f
o yzw 1o }© ~ysrx vuFvp§t#y¥¨¢ £¡
o S{zw ¦¤ © q¨ £¡ ¦¤ ¢ o
 y ¢
. Applying Bayes rule gives
y v¥¤£¡
©y ¢
and (i.e. no noise), which therefore have zero Bayes risk.
The features are assumed to have ﬁnite domains ( th feature
S{w
y , where
has values), and are often called nominal. (A nominal fea
is identical for all classes, and therefore can be ignored. This ture can be transformed into a numeric one by imposing an
yields Bayes discriminant functions order on its domain.) Our attention will be restricted to bi
& ©
F£d©
¨ $¥¤£W#y¢ l
¨¢ ¡ ¦ y ¢ ¡ © (1) nary classiﬁcation problems where the class is either 0 or 1.
Some limitations of naive Bayes are wellknown: in case
where
© s¨ W$%¤£¡
¦ y ¢
is called the classconditional prob
§ v 4&RSR)&
¨ R f
of binary features ( for all ), it can only
ability distribution (CPD). Thus, the Bayes classiﬁer learn linear discriminant functions [3], and thus it is always
©
F£d©
¨ $¥¤£¡ £Ii %y¢ l
¨ ¢ ¡ ¦ y ¢ © (2)
suboptimal for nonlinearly separable concepts (the classical ¨
example is XOR function; another one is of concepts [7;
§ ª
© Y
ﬁnds the maximum a posteriori probability (MAP) hypothe 2]). When for some features, naive Bayes is able
y
sis given example . However, direct estimation of
%¤£¡
¢ to learn (some) polynomial discriminant functions [3]; thus,
© ¨ y
¦ from a given set of training examples is hard when polynomial separability is a necessary, although not sufﬁ
the feature space is highdimensional. Therefore, approxima cient, condition of naive Bayes optimality for concepts with
tions are commonly used, such as using the simplifying as ﬁnitedomain features.
sumption that features are independent given the class. This Despite its limitations, naive Bayes was shown to be opti
#y¢
© m
yields the naive Bayes classiﬁer deﬁned by discrimi mal for some important classes of concepts that have a high
nant functions degree of feature dependencies, such as disjunctive and con
R © F¢£¡d© ¨ 97 v8#"£¡ 7 W%y¢ EB
© D junctive concepts [2]. These results can be generalized to
¨ ¦ p 7 ¢ (3) concepts with any nominal features (see [10] for details):
42
Theorem 1 [10] The naive Bayes classiﬁer is optimal for NBerror, I(X1;X2C), and H(P(x1c) vs. P(0) (n=2, m=2, k=10, N=1000)
any twoclass concept with nominal features that assigns 3
class 0 to exactly one example, and class 1 to the other ex NBerr
amples, with probability 1. 1 2.5
I(X1;X2C)
H(P(x1c)
The performance of naive Bayes degrades with increas
ing number of class0 examples (i.e., with increasing prior
© e ¥¨£¡
¢ © e £¡
¢ 2
, also denoted ), as demonstrated in Figure
1a. This ﬁgure plots average naive Bayes error computed 1.5
over 1000 problem instances generated randomly for each
© e «¥¨£¡
¢
value of . The problem generator, called Zer 1
oBayesRisk, assumes features (here we only consider two
Y
features), each having values, and varies the number of
¬ © vF£¡
class0 examples from 1 to (so that e g°R ¨ © ¢ © ¯¥¨£¡ § x¢
varies¢ ®
0.5
from ® f
to 0.5; the results for e e © ¥¨£¡
are sym
metric)2 . As expected, larger e ¢
(equivalently, larger
0
0.05 0.1 0.15 0.2 0.25
P(0)
0.3 0.35 0.4 0.45
¬), yield a wider range of problems with various dependencies
(a)
among features, which result into increased errors of Bayes; Average errors vs. mutual information (n=2, m=2, k=10)
a closer look at the data shows no other cases of optimality
besides ® u© e ¥¨£¡
f ¢ . 0.6
Surprisingly, the strength of interfeature dependen
cies, measured as the classconditional mutual information
© ¦¨ " 6 #"¢ 5 0.5
(also denoted ), is not a good predictor of
5 s±
naive Bayes performance: while average naive Bayes error
increases monotonically with
© e ¢£¡
, the mutual information
error 0.4
is nonmonotone, reaching its maximum around .
f R e © e £¡
¢
0.3
This observation is consistent with previous empirical results
on UCI benchmarks [2]) that also revealed low correlation 0.2
between the degree of feature dependence and relative perfor
mance of naive Bayes with respect to other classiﬁers, such 0.1
as C4.5, CN2, and PEBLS. boptErr
NBerr
I(X;YC)
It turns out that the entropy of classconditional marginal
© ¦¨ #"£¡
¢ 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
distributions, , is a better predictor of naive Bayes
© e ¦ #"£¡
¢ delta
performance. Intuitively, low entropy of means that (b)
most of 0s are “concentrated around ” one state (in the limit, Average error difference vs. mutual information (n=2, m=2, k=10)
0.01
this yields the optimality condition stated by Theorem 1). In
© e ¦ #"£¥¡¢ 3
¢ NBerr−boptErr
I(X;YC)/300
deed, plotting average in Figure 1a demonstrates 0.009
that both average error and average entropy increase mono
© e £¡
¢ 0.008
tonically in . Further discussion of lowentropy distri 0.007
butions is given next in the more general context of noisy
NBerr−boptErr
0.006
(nonzero Bayes risk) classiﬁcation problems.
0.005
3.2 Noisy concepts 0.004
Lowentropy feature distributions 0.003
Generally, concepts can be noisy, i.e. can have non
¦y ¥¨£¡
© ¢ 0.002
deterministic and thus a nonzero Bayes risk.
A natural extension of the conditions of Theorem 1 to noisy 0.001
concepts yields lowentropy, or “extreme”, probability distri 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
butions, having almost all the probability mass concentrated delta
in one state. Indeed, as shown in [10], the independence (c)
assumption becomes more accurate with decreasing entropy
which yields an asymptotically optimal performance of naive Figure 1: (a) results for the generator ZeroBayesRisk
Bayes. Namely, (k=10, 1000 instances): average naive Bayes error
Theorem 2 [10] Given that one of the following conditions (NBerr), classconditional mutual information between fea
¨ ³§" 6 ²#"¢ 5
© ¦§ f
hold: tures ( © ), and entropy of marginal distribution,
I© e ¦ #"£¥¡¢ 3
¢
; the error bars correspond to the standard devi
1
Clearly, this also holds in case of a single example of class 1. ation of each measurement across 1000 problem instances;
2
Note that in all experiments perfect knowledge of data distribu (b) Results for the generator EXTREME: average Bayes ¨ ´" 6 #"¢ 5
© ¦§ f
tion (i.e., inﬁnite amount of data)is assumed in order to avoid the and naive Bayes errors and average ; (c) results
effect of ﬁnite sample size. for the generator FUNC1: average difference between naive i¸ R·
¹·
Bayes error and Bayes error (  constant for all ), e ¶µ º
and scaled I(X1;X2—C) (divided by 300).
43
© " &SRR0& #"£¡
R ¢
1. a joint probability distribution is such
"
feature,
© %"¢ ¡
and
© %"¢ ¡
, for class 0 and class 1, re
ly
that
@ » l &RSR)& lR
º §f © !p © 1p £RSR&¢ ¡ 0& ¢
for some state spectively. Finally, it creates classconditional joint feature
, or l p R l p distributions satisfying the following conditions:
& © tdC© ¢ 4© ¢ ¢ ¡
 Related pdf books
 Detecting and Resolving Informational Uncertainty in Complex Domains
 Functional Programming in C++
 Exploratory Programming of Distributed Augmented Environments
 Extraordinary Computing: Religion as a Lens for Reconsidering ...
 Science, Computational Science, and Computer Science
 Constructing personal identities in holiday letters
 SingleNarrative, Multiple PointofView
 tidse2006fb final fixed mangled.doc
 Popular epubs
 Sunshine Bill Withers
 roster availability template
 word sort third grade
 50th birthday program sample
 practical writer with readings answers to exercises
 Apartment Search Guidebook
 Materials Strang
 Linha Romi G
 transparency 1 answers
 Aufbau und Validierung eines elastischen manipulatormodells
 Laura Guyon
Download the ebook