mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Filimon <>
Subject Re: Odd clustering fail
Date Sun, 10 Mar 2013 16:09:50 GMT
Hi Simon,

That looks like an error from the seq2sparse job you're using to
vectorize the code.
I think it's very surprising to get an error when vectorizing, but
more others more experienced than me should probably comment. :)

The line numbers don't match what I have in my version of Mahout (a
forked version of trunk).

If I'm not mistaken there should be an "inner" exception thrown by a
mapper or reducer that tells us more. Can you please look through the
error log and see if there's anything else?

As a side note, I'm clustering the 20 newsgroups data set (~20K
documents at ~20MB in total) and it's working fine.


On Sat, Mar 9, 2013 at 5:44 PM,  <> wrote:
> Hi there,
> I am doing a fairly silly experiment to measure hadoop performance. As part of this I
have extracted emails from the Enron database and I am clustering them using a proprietary
method for clustering short messages (ie. tweets, emails, sms's) and benchmarking clusters
in various configurations.
> As part of this I have been benchmarking a single processing machine (my new laptop)
this is a hp elite book with 32mb ram,sdds and nice processors ect ect, the point is that
when explaining to people that we need hadoop I can show them that a laptop is really really
useless and likely to remain so (I know this is obvious, come and work in a corporate and
find out what else you have to do to earn a living! Then tell me that I am silly! )
> Anyhooo...  I have seen reasonable behaviours from the algorithms I have built (ie. for
very small data map reduce puts an overhead on the processing, but once you get reasonably
large the parallelism wins) but when I try with mahout's kmeans I get an odd behaviour.
> When I get to ~175k individual files /175mb input data I get an exception
> Exception in thread "main" java.lang.IllegalStateException: Job failed!
>         at org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(
>         at org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(
>         at
>         at
>         at
> Is this because I am entirely inept and have missed something, or is this because of
a limitation on mahout sequence files due to them not being aimed at loads of short messages
that really can't be clustered anyway due to them having no information in them, hell?
> Simon
> ----
> Dr. Simon Thompson

View raw message