mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Getting Started with Classification
Date Wed, 22 Jul 2009 02:41:17 GMT
I have been doing some work on classification (of Wikipedia) and am  
having a hard time actually running the Test classifier.  I trained on  
a couple of categories (history and science) on quite a few docs, but  
now the model is so big, I can't load it, even with almost 3 GB of  
memory.   I'm just wondering what people would recommend here.  One  
thought is that our code is really String/Text based.  I also notice  
we start with default values for the maps used to load the models,  
which probably means we are resizing a lot.  Should we use Strings or  
would it be better to have some custom Writables and then keep track  
of the actual terms separately kind of like the doc clustering does as  
well as tracking the size so we can avoid resizing?

Also, what is generally the size of training sets that people use for  
something like Naive Bayes (or complementary)?  Or, do I suck it up  
and just use more memory?

Thoughts?

-Grant

Mime
View raw message