mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: What's the best method or strategy to train a bayes classifier on a multi labeled training set ?
Date Sat, 04 Sep 2010 06:19:03 GMT
Multiple classification is a classic problem and raises many problems.
 Currently Mahout has classifiers that do 1 of n classification which is a
useful basis for multiple classification, but it isn't the final answer by
any means.

As a simple start, you can build multiple binary classifiers, one for each
category.  In practice, you want to do better than this for several reasons:

a) there is often a logical structure expressed as constraints on membership
in different categories.  THis is especially true when one category is a
subset of another.

b) information about membership in one category is very informative about
member ship in other categories.

c) learning with the constraints from (a) and (b) can be much more efficient
of data than just learning the multiple binary classifiers.

As a second hack after the set of binary classifiers, you can build a second
set of binary classifiers that has all the inputs of the first as well as
the outputs of the first set of classifiers.  This will often get you very
far down the road.

As your data gets larger, these subtleties get less important and
scalability gets more important.

Let us hear what you decide and how your results turn out.

On Fri, Sep 3, 2010 at 4:19 AM, jun li <> wrote:

> for example, I want to train a webpage classifier on dmoz. (
> but many urls in dmoz are belong to mutliple categories.
> How do I prepare training set to get a better bayes classifier?
> the uniq links only consist about millions pages. but if consier
> multiple categories, there will be hundreds millions pages (add up
> every pages in corresponding categorys.)
> thanks.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message