mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Piggott (JIRA)" <>
Subject [jira] [Created] (MAHOUT-1349) Clusterdumper/loadTermDictionary crashes when highest index in (sparse) dictionary vector is larger than dictionary vector size?
Date Fri, 01 Nov 2013 13:00:19 GMT
Alex Piggott created MAHOUT-1349:

             Summary: Clusterdumper/loadTermDictionary crashes when highest index in (sparse)
dictionary vector is larger than dictionary vector size?
                 Key: MAHOUT-1349
             Project: Mahout
          Issue Type: Bug
          Components: Integration
    Affects Versions: 0.8, 0.7
         Environment: N/A
            Reporter: Alex Piggott
            Priority: Minor

I'm not sure if I'm doing something wrong here, or if ClusterDumper does
not support my (fairly simple) use case

I had a repository of 500K documents, for which I generated the input
vectors and a dictionary using some custom code (not seq2sparse etc).

I hashed the features with max size 5M (because I didn't know how many
features were in the dataset and wanted to minimize collisions).

The kmeans ran fine and generate sensible looking results, but when I tried
to run ClusterDumper I got the following error:

#bash> bin/mahout clusterdump -dt sequencefile -d
-i test-kmeans/clusters-19 -b 10 -n 10 -sp 10 -o ~/test-kmeans-out
Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout-distribution-0.7/mahout-examples-0.7-job.jar
13/05/17 08:26:41 INFO common.AbstractJob: Command line arguments:
--endPhase=[2147483647], --input=[test-kmeans/clusters-19],
--numWords=[10], --output=[/usr/share/tomcat6/test-kmeans-out],
--outputFormat=[TEXT], --samplePoints=[10], --startPhase=[0],
--substring=[10], --tempDir=[temp]}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 698948

The error is when it tries to access the dictionary for the feature with
index 698948

Looking at the dictionary loading code (
-  checked 0.8 and it hasn't changed)

It looks like the dictionary array is sized for the number of unique
keywords, not the highest index:

  OpenObjectIntHashMap dict = new OpenObjectIntHashMap();
  String [] dictionary = new String[dict.size()];

After I ran my custom dictionary/feature generation code I discovered I
only had 517,327 unique features, therefore it is not surprising it would
die on an index >= 517327 (though I don't understand why it didn't die when trying to load
the dictionary file)

Is there any reason why the VectorHelper code should not create a
dictionary array that has size the highest index read from the dictionary
sequence file (which can be easily calculated during the preceding loop)?

Or am I misunderstanding something?

It worked fine when I reduced the hash size to be <= than the total number
of features, but this is not desirable in general (for me) since I don't
know the number of features before I run the job (and if I guess too high
then ClusterDumper crashes)

Alex Piggott

This message was sent by Atlassian JIRA

View raw message