mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (Commented) (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-845) Make cluster top terms code more reusable
Date Thu, 01 Dec 2011 03:27:41 GMT


Jake Mannix commented on MAHOUT-845:

Ok, so I've thought about this a little, and the implementation that Frank put on here, and
I had on my github branch too, essentially, is probably a bad idea, for exactly Lance's points
mentioned here.

So instead, we modify VectorDumper and VectorHelper to add a couple of static methods and

in VectorHelper:
public static String vectorToJson(Vector vector, String[] dictionary, int maxEntries, boolean

where the "sort" option sorts by the values of the Vector entries, and maxEntries describes
the maximum number of vector entries to use.  If dictionary is supplied and not null, then
the vector indexes are replaced with their respective term entries in the dictionary.

This way, VectorDumper is modified with the following options:
Option sortVectorsOpt = obuilder.withLongName("sortVectors").withRequired(false).withDescription(
            "Sort output key/value pairs of the vector entries in abs magnitude descending
Option numIndexesPerVectorOpt = obuilder.withLongName("vectorSize").withShortName("vs").withRequired(false)
         .withDescription("Truncate vectors to <vs> length when dumping (most useful
when in"
                          + " conjunction with -sort").create();

Then if you have clusters represented as vector centroids (or distributions over terms/features,
or anything else which is a collection of Vectors linked to a dictionary of String labels
for the vector indexes), then you don't really need a "ClusterDumper", as

$MAHOUT_HOME/bin/mahout vectordump -s "path/to/vectors/part-*" --dictionary "path/to/dictionary.file-0"
-dt sequencefile -sort --vectorSize 100 -o local_vectors.json

puts each vector in "path/to/vectors/part-*" one per line in local_vectors.json, in json format,
with the keys being the terms with the highest weight for the vector, the values being the
vector values, and only the top 100 (by value) per vector are emitted.

I've found this modification to VectorDumper invaluable in inspecting LDA topic models, but
doing it without modifying the Vector interface is even better.
> Make cluster top terms code more reusable
> -----------------------------------------
>                 Key: MAHOUT-845
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Frank Scholten
>            Priority: Minor
>             Fix For: 0.6
>         Attachments: MAHOUT-845.patch, MAHOUT-845.patch, MAHOUT-845.patch
> When working with Mahout text clustering I find that I keep writing code similar to the
contents of
> public static String getTopFeatures(Cluster cluster, String[] dictionary, int numTerms)
> in ClusterDumper in order to determine cluster labels.
> I think it would be useful if (parts of) this code are added to the cluster or vector
API so that you could do something like
> Cluster cluster = ... // get the cluster from seq file iterable
> String clusterLabel = cluster.getTopTerms(1, dictionary); // Do something with the label
> I think this would make it easier to export and post-process clustering results, like
indexing or storing them elsewhere.
> Thoughts?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message