mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: clusterdump + document ids?
Date Sat, 30 Oct 2010 02:07:12 GMT
Hi Matt,

K-means passes NamedVectors transparently through its processing, 
including the clustering output. Looking over the relevant code:

ClusterDumper.print() calls AbstractVector.asFormatString(v,bindings)
AbstractVector.asFormatString(v,bindings) evaluates
     if (v instanceof NamedVector) {
       buf.append(((NamedVector) v).getName()).append(" = ");
     }
...before printing the formatted vector. Given that the sequence files 
generated by seq2sparse and presented to kmeans contain NamedVector 
wrappers (which they appear to do), the output should look like <name> = 
[<vector>]

I don't know why you aren't seeing that. Can you please investigate?
Jeff

On 10/29/10 11:30 AM, Matt Spitz wrote:
> Hey, folks.
>
> If I run kmeans-clustering with the -cl option, I get
> <kmeans_output>/clusteredPoints.
>
> Running clusterdump with -p<kmeans_output>/clusteredPoints, I get output
> that looks like this for a given cluster (running on the reuters corpus):
>
> ...
> *        Top Terms:*
> *                said                                    =>
> 1.421944826722092*
> *                3                                       =>
>   0.9007495669006188*
> *                reuter                                  =>
>   0.8924866335531932*
> ...
> *        Weight:  Point:*
> *        1.0: [srd:7.671, 20.00:8.269, bp:13.510, co:3.989, financial:5.109,
> under:3.297, activities:5.452, called:4.288, investment:3.831, owns:5.457,
> interest:3.438, market:2.977, comp\*
> *anies:3.894, plan:4.041, 02:4.706, joint:4.573, both:4.004, manage:6.697,
> oversight:8.364, trading:3.842, also:2.737, venture:7.009, 15:2.866,
> money:4.178, borrowing:5.718, inc:2.282, c\*
> *ommittee:4.438, north:6.854, 26-feb-1987:5.496, 55:4.719, form:7.126,
> management:4.015, standard:10.665, subsidiary:4.040, plc:4.462,
> america:6.863, 3:1.124, which:2.445, petroleum:4.71\*
> *6, oil:7.377, operated:6.224, british:4.646, reuter:1.133, pct:2.233,
> said:1.330, unit:3.541]*
> *        1.0: [16:3.076, 35:4.400, 26-feb-1987:5.496]*
> *        1.0: [54.20:9.057, 16:3.076, 36:4.670, 26-feb-1987:5.496]*
> ...
>
> I get the bags of words that end up in a given cluster, but I don't see the
> original document ID from which that bag of words was generated (e.g.
> reut2-111.sgm-211.txt, etc)
>
> In the sequence file generated by 'seqdirectory', we get the following:
>
> *[mspitz@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
> examples/bin/work/reuters-out-seqdir/chunk-0 | head*
> *SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
> �$�]Ƥ��S@7A/reut2-020.sgm-237.txt�'20-OCT-1987
> 10:29:21.69*
> *
> *
> *HUTTON<EFH>  REITERATES STATEMENT OF SOLVENCY*
> ...
>
> In the sparse vectors (which are passed into kmeans), we get:
> *[mspitz@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
> examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000 | head*
> *1�e�>�C/reut2-000.sgm-0.txt����h@��@�V?��?�?��?��@�?��?��$?���?���?��@�1@
> �6@
> �"@�c?��?��?��3?��"?���?��c?�@��?�ľ@�@?��?��?��~?��u?��U?�؈�@?��?��+@ݗ?��?���?�̜?���@�y?��?��?�?��l?��t@
> �y@"��?���@��?��/?�ė?�ؾ@�����3@��?��}?��}@�p@�k@�D?���?��?��M?��?���?��k@
> ��?�Ņ?��0?��U?�ֹ?��o?��F@�%?�֖?��|@�Y@��@���?��?��o?��?��?���?��%?��H@��@�f@
> ��?��?��T?��f?��@,�)?��?���?���?��?��f?��j?��n@
> ��?���?���?��?��s?��?�؞?��]?��?��?��*
> *?���?���� ?��X?���?���?��?���?��*
> ...
>
> It looks like the document IDs are being passed on through the data
> wrangling but then unused by kmeans and/or not reported in clusteredPoints.
>   It seems to me like that'd be super useful to have them in the final
> output.  Are they easy to get at?
>
> Thanks,
> Matt


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message