mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil" <robin.a...@gmail.com>
Subject Re: CNB: Learning from Huge Datasets
Date Fri, 11 Jul 2008 23:24:44 GMT
On Fri, Jul 11, 2008 at 1:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> It sounds to me like there is scope here for combiners, especially on the
> final stage.  If they can be applied to earlier stages as well, you might
> be
> able to collapse some of the data nicely.  If the number of unique words in
> the corpus is a million, then a combiner might be able to improve the
> number
> of items in the intermediate map output of your last stage by up to two
> orders of magnitude.
>
> Here my key is a label,feature pair so the number of uniques = 40B


>
>
> Also, by my calculation, 20 x 200M = 4 x 10^9 (not 40 x 10^9).  Still
> large,
> but not vast.


:) That was ~200(Countries)  x 200M. The result was alright


>
> On Thu, Jul 10, 2008 at 11:15 PM, Robin Anil <robin.anil@gmail.com> wrote:
>
> > Hi,
> >   I had been experimenting with Wikipedia datadump(17GB) with the CNB
> > classifier. I used a list of countries of the world(around 229 of them)
> as
> > the labels and then created a classification dataset from the data dump.
>  I
> > assigned the documents to each label if any of the wikipedia category of
> > the
> > article has the country name in it. So a lot of data is pruned. The final
> > Dataset is around 2.2GB
> >
> > Now here is the predicament. In Complementary NB classifier you create a
> > complement class for each label where the features of the complement
> class
> > are the features of all the other class. This means for all the 20Million
> > odd words in Wikipedia a float value weight is there for each label.
> >
> > In my code I generate this in the 4th Map stage.  for each word I need to
> > output N  outputs  (N is the number of labels) of the form
> > <"label,feature",
> > sum_of_weights of features>. This explodes the whole data in the system
> so
> > after the Map stage I am left with 200M x 20 = 40Billion keyvalue pairs.
> > This really slows things down. Took me over 2 hours and a lot of
> > diskspace(over 26GB).  Does anyone have any idea of doing this in an
> > alternate way? One thing i am definitely doing is replacing all labels
> and
> > features by integers. Please pour in optmisation ideas. I will submit
> this
> > patch soon so that everyone can check out.
> >
> >
> > Robin
> >
>
>
>
> --
> ted
>



-- 
Robin Anil
Senior Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message