mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: DictionaryVectorizer meets Wikipedia.
Date Thu, 14 Jan 2010 10:54:01 GMT
Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of
places where the old api is used.

On Thu, Jan 14, 2010 at 4:09 PM, Olivier Grisel <olivier.grisel@ensta.org>wrote:

> 2010/1/13 Robin Anil <robin.anil@gmail.com>:
> > I have fired up a small instance of EC2(Single node for the moment) and
> have
> > been dabbling with the latest XML dump of the articles base of Wikipedia
> >
> > wiki XML is around 25GB which was split into 128MB chunks and stored on
> hdfs
> > WikipediaToSequenceFile class runs an M/R job to convert articles(without
> > redirects) into sequence file format (took 6 hours over entire wikipedia)
> > produced a Gzip Block compressed sequence file of 6Gb
> > The bottleneck i found there was that the current XMLInputFormat which is
> > checked in examples is reading byte by byte to search for start tag and
> end
> > tag
> >
> > I am currently running the word count of the  DictionaryVectorizer(over
> the
> > gzip compressed 6GB data) where I see that cpu cycles are spent on one
> thing
> > only. i.e TokenStream.next(Token)  (StandardAnalyzer). This job is also
> > estimated to take 6 hours on that small instance. Beyond which multiple
> > map/reduce jobs calculates the the partial vectors. Each of those
> iterations
> > will take 6 hours more.
> >
> > If anyone has some idea on how to speed up both these bottlenecks(other
> than
> > running more instances :P), Please give some insight.
> >
> > the page is here
> >
> http://ec2-67-202-51-4.compute-1.amazonaws.com:50030/jobdetails.jsp?jobid=job_201001132019_0004&refresh=30
>
> Interesting. I plan to try to do similar processing using Amazon
> Elastic Cloud (to spare the burden of installing and tuning a Hadoop
> cluster myself), hence using the s3fs instead of regular HDFS. Anyone
> has metrics on such a setup (S3FS vs HDFS w.r.t. number of nodes for a
> text tokenization task)?
>
> While we are at it, it seems that TokenStream.next(Token) is
> deprecated and that it should be updated as indicated in the following
> patch. I have no clue whether this has an impact on the perfs by if it
> works I guess the mahout code should be upgraded to be future proof.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://code.oliviergrisel.name
>



-- 
------
Robin Anil
Blog: http://techdigger.wordpress.com
-------
Try out Swipeball for iPhone
Video: http://www.youtube.com/watch?v=3hvEbWHciwU
iTunes: http://itunes.com/apps/swipeball

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message