mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Options in TrainClassifier.java
Date Mon, 20 Sep 2010 03:25:18 GMT
I am watching these efforts with interest, but have been unable to
contribute much to the process.  I would encourage Joe and others to keep
whittling this problem down so that we can understand what is causing it.

In the meantime, I think that the SGD classifiers are close to production
quality.  For problems with less than several million training examples, and
especially problems with many sparse features, I think that these
classifiers might be easier to get started with than the Naive Bayes
classifiers.  To make a virtue of a defect, the SGD based classifiers to not
use Hadoop for training.  This makes deployment of a classification training
workflow easier, but limits the total size of data that can be handled.

What would you guys need to get started with trying these alternative
models?

On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
<npk.gangadhar@gmail.com>wrote:

> Joe,
> Even I tried with reducing the number of countries in the country.txt.
> That didn't help. And in my case, I was monitoring the disk space and
> at no time did it reach 0%. So, I am not sure if that is the case. To
> remove the dependency on the number of countries, I even tried with
> the subjects.txt as the classification - that also did not help.
> I think this problem is due to the type of the data being processed,
> but what I am not sure of is what I need to change to get the data to
> be processed successfully.
>
> The experienced folks on Mahout will be able to tell us what is missing I
> guess.
>
> Thank you
> Gangadhar
>
> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <joekumar@gmail.com> wrote:
> > Gangadhar,
> >
> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
> have
> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > wikipediainput data set and then ran TrainClassifier and it worked. when
> I
> > ran TestClassifier as below, I got blank results in the output.
> >
> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
> >  wikipediainput  -ng 3 -type bayes -source hdfs
> >
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :          0         ?%
> > Incorrectly Classified Instances        :          0         ?%
> > Total Classified Instances              :          0
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a     <--Classified as
> > 0     |  0     a     = spain
> > Default Category: unknown: 1
> >
> > I am not sure if I am doing something wrong.. have to figure out why my
> o/p
> > is so blank.
> > I'll document these steps and mention about country.txt in the wiki.
> >
> > Question to all
> > Should we have 2 country.txt
> >
> >   1. country_full_list.txt - this is the existing list
> >   2. country_sample_list.txt - a list with 2 or 3 countries
> >
> > To get a flavor of the wikipedia bayes example, we can use
> > country_sample.txt. When new people want to just try out the example,
> they
> > can reference this txt file  as a parameter.
> > To run the example in a robust scalable infrastructure, we could use
> > country_full_list.txt.
> > any thots ?
> >
> > regards
> > Joe.
> >
> > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <joekumar@gmail.com> wrote:
> >
> >> Gangadhar,
> >>
> >> After running TrainClassifier again, the map task just failed with the
> same
> >> exception and I am pretty sure it is an issue with disk space.
> >> As the map was progressing, I was monitoring my free disk space dropping
> >> from 81GB. It came down to 0 after almost 66% through the map task and
> then
> >> the exception happened. After the exception, another map task was
> resuming
> >> at 33% and I got close to 15GB free space (i guess the first map task
> freed
> >> up some space) and I am sure they would drop down to zero again and
> throw
> >> the same exception.
> >> I am going to modify the country.txt to just 1 country and recreate
> >> wikipediainput and run TrainClassifier. Will let you know how it goes..
> >>
> >> Do we have any benchmarks / system requirements for running this example
> ?
> >> Has anyone else had success running this example anytime. Would
> appreciate
> >> your inputs / thots.
> >>
> >> Should we look at tuning the code for handling these situations ? Any
> quick
> >> suggestions on where to start looking at ?
> >>
> >> regards,
> >> Joe.
> >>
> >>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message