mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: clustering hardware requirements
Date Tue, 22 Nov 2011 12:23:50 GMT
Here's some numbers using running locally:

Raw content size:
9.2 GB, 48K "items" -- note, most of the files are GZipped

It took 15 minutes to convert all of these to sequence files on an i7 single CPU w/ 4 cores
and hyper-threading. 3.4 GHz machine with 16 GB of RAM

After converting to sequence files:
40 GB, 659 items.

Encoded Vectors (see cardinality = 5000: 11 GBs for 1,300 items.  This
took 83 minutes to convert

Splitting into test and train took 9 minutes for SGD.  I had to kill the SGD job due to some
issues I'm having on my machine w/ CPU temperature (SGD really cranks on the CPU and something
is messed up on my machine) that I need to track down.

For clustering,  about the same time for  converting to sequence files

The job to convert to vectors took a while (it scrolled out of my window).  The resulting
tfidf-vecs were 7.8 gb.
 82865442 2011-11-21 17:46 dictionary.file-0*
83269191 2011-11-21 17:46 dictionary.file-1*
10963133 2011-11-21 17:46 dictionary.file-2*

Freq files:

 37160153 2011-11-21 22:35 frequency.file-0*
 37160173 2011-11-21 22:35 frequency.file-1*
 37160173 2011-11-21 22:35 frequency.file-2*
 31407713 2011-11-21 22:35 frequency.file-3*

Total dir size for seq2sparse:  du -s seq2sparse/
30923564	seq2sparse/

More as they become available.


On Nov 21, 2011, at 3:57 AM, Ioan Eugen Stan wrote:

>> I'll try in the next few days to track down the numbers from running the stuff in
my recent IBM article: 
Or, you can go run them yourself!
> I think posting some reference data for the jobs will be great. I will have something
to compare to when I have something done. In the mean time I will try to do a quick and dirty
implementation working and see how things move and post my findings. This could take a while
as I depend on some modifications.
>> Otherwise, I don't know that we have any formula just yet.  I suspect that once you
reach a certain number of documents, your dictionary will stop growing, more or less.  Then,
it is just a question of how many vectors you have and the sparseness.  This probably could
be guessed at by looking at what the average number of words are in your email collection.
 Naturally, attachments may skew this if you are including them.
> I also suspect that things will be asymptotically after a certain number of documents,
remains to see where that threshold is.
>> That has been my experience, too.  Seq2Sparse is often the long part.  I suspect
one could get it done a lot faster in Lucene.  SequenceFilesFromDirectory is also slow, but
that is inherently sequential.
> I will be able to use a map reduce job to create vectors or just create them as an indexing
step so I hope this step will not count when considering the effective clustering time.
>> I haven't explored yet what it would mean to use Encoded vectors in Clustering, but
perhaps I can call Ted to the front of the class and see if he has thoughts on whether that
even makes sense, as that would give you a fixed size Vector.
>> -Grant
> I don't know about encoded vectors yet, I hope to get some more info on them from Mahout
in Action. If they do what I think they do, I will definitely try them, and probably complain
on the list (Ted) if I can't interpret them right :).
> Thanks for the reply,
> --
> Ioan Eugen Stan

Grant Ingersoll

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message