mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: Clustering from DB
Date Fri, 24 Jul 2009 12:14:42 GMT
I've been using RandomSeedGenerator to generate initial clusters for
kMeans and while checking its code I stumbled upon this:

      while (reader.next(key, value)) {
        Cluster newCluster = new Cluster(value);
        newCluster.addPoint(value);
        ....
      }

I can see it adds the vector to the newly created cluster, even though
it is setting it as its center in the constructor. Wasn't this
corrected in a past revision? I thought this was not necessary
anymore. I'll look into it a little bit more and see if this has
something to do with my lack of performance with my dataset.

On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nfantone@gmail.com> wrote:
>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>
>>> I'll try that.
>
> There was no significant change while modifying the convergence value.
> At least, none was observed during the first three iterations which
> lasted the same amount of time than before, more or less.
>
>>>> Is there any chance your data is publicly shareable?  Come to think of
>>>> it,
>>>> with the vector representations, as long as you don't publish the key
>>>> (which
>>>> terms map to which index), I would think most all data is publicly
>>>> shareable.
>>>
>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>> shareable? As in user-permissions to access/read/write the data?
>>
>> As in post a copy of the SequenceFile somewhere for download, assuming you
>> can.  Then others could presumably try it out.
>
> My bad. Of course it is:
>
> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>
> That's the ~62Mb SequenceFile sample I've using, in <Text,
> SparseVector> logical format.
>
>>That does seem like an awfully long time for 62 MB on a 6 node cluster. How many >terations
are running?
>
> I'm running the whole thing with a 20 iterations cap. Every iteration
> - EXCEPT the first one which, oddly, lasted just two minutes-, took
> around 3hs to complete:
>
> Hadoop job_200907221734_0001
> Finished in: 1mins, 42sec
>
> Hadoop job_200907221734_0004
> Finished in: 2hrs, 34mins, 3sec
>
> Hadoop job_200907221734_0005
> Finished in: 2hrs, 59mins, 34sec
>
>> How did you generate your initial clusters?
>
> I generate the initial clusters via the RandomSeedGenerator setting a
> 'k' value of 200.  This is what I did to initiate the process for the
> first time:
>
> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data
> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
> ./bin/hadoop jar ~/mahout-core-0.2.jar
> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
> init -o output -r 32 -d 0.01 -k 200
>
>>Where are the iteration jobs spending most of their time (map vs. reduce)
>
> I'm tempted to say map here, but their spent time is rather
> comparable, actually. Reduce attempts are taking an hour and a half to
> end (average), and so are Map attempts. Here are some representative
> examples from the web UI:
>
> reduce
> attempt_200907221734_0002_r_000006_0
> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>
> map
> attempt_200907221734_0002_m_000000_0
> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>
> Perhaps, there's some inconvenient in the way I create the
> SequenceFile? I could share the JAVA code as well, if required.
>

Mime
View raw message