mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: CNB: Learning from Huge Datasets
Date Fri, 11 Jul 2008 08:28:33 GMT
It sounds to me like there is scope here for combiners, especially on the
final stage.  If they can be applied to earlier stages as well, you might be
able to collapse some of the data nicely.  If the number of unique words in
the corpus is a million, then a combiner might be able to improve the number
of items in the intermediate map output of your last stage by up to two
orders of magnitude.

Also, by my calculation, 20 x 200M = 4 x 10^9 (not 40 x 10^9).  Still large,
but not vast.


On Thu, Jul 10, 2008 at 11:15 PM, Robin Anil <robin.anil@gmail.com> wrote:

> Hi,
>   I had been experimenting with Wikipedia datadump(17GB) with the CNB
> classifier. I used a list of countries of the world(around 229 of them) as
> the labels and then created a classification dataset from the data dump.  I
> assigned the documents to each label if any of the wikipedia category of
> the
> article has the country name in it. So a lot of data is pruned. The final
> Dataset is around 2.2GB
>
> Now here is the predicament. In Complementary NB classifier you create a
> complement class for each label where the features of the complement class
> are the features of all the other class. This means for all the 20Million
> odd words in Wikipedia a float value weight is there for each label.
>
> In my code I generate this in the 4th Map stage.  for each word I need to
> output N  outputs  (N is the number of labels) of the form
> <"label,feature",
> sum_of_weights of features>. This explodes the whole data in the system so
> after the Map stage I am left with 200M x 20 = 40Billion keyvalue pairs.
> This really slows things down. Took me over 2 hours and a lot of
> diskspace(over 26GB).  Does anyone have any idea of doing this in an
> alternate way? One thing i am definitely doing is replacing all labels and
> features by integers. Please pour in optmisation ideas. I will submit this
> patch soon so that everyone can check out.
>
>
> Robin
>



-- 
ted

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message