mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Drew Farris (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAHOUT-694) IndexOutOfBoundException using build-reuters.sh
Date Sat, 21 May 2011 15:18:47 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Drew Farris updated MAHOUT-694:
-------------------------------

    Attachment: MAHOUT-694.patch

Updated build-reuters.sh to be a bit more sane about its working directory, sensitive to whether
mahout will execute in local or distributed mode and to reuse the reuters-reuters-sgm, reuters-out
and reuters-out-seqdir directories if they already exist.

Work is done in a directory called mahout-work, seqdirectory is always run locally, and if
in distributed mode, the result are copied up to hdfs. 

Also updated output of lda and kmeans so that their directory names do not overlap.

This could do a better job of cleaning up before/after it executes, but it gets the job done.

Please take a look and give this a try, I've run kmeans in local and distributed mode (forcing
local by unsetting HADOOP_HOME), I am running lda in distributed mode now, but my cluster
is small and it takes a long time to complete. 
 

> IndexOutOfBoundException using build-reuters.sh
> -----------------------------------------------
>
>                 Key: MAHOUT-694
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-694
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.5
>         Environment: Linux Debian Lenny
> Hadoop 0.20 (Cloudera)
>            Reporter: Allan BLANCHARD
>            Assignee: Drew Farris
>             Fix For: 0.5
>
>         Attachments: MAHOUT-694.patch, MAHOUT-694.patch, MAHOUT-694.patch
>
>
> I run Hadoop-0.20 on distributed mode on 10 VMs (NameNode + JobTracker + 8 DataNodes/TaskTrackers)
with Mahout trunk.
> I tried to test kmeans example with build-reuters.sh but I got an IndexOutOfBoundException
when it starts kmeans.
> I don't know which operation fails ... ExtractReuters, seqdirectory, seq2sparse or kmeans.
Maybe I forgot a configuration ? I searched on the web and didn't find solutions ... 
> ------------------------ UPDATE == 05/16 -------------------------
> NameNode:/usr/local/mahout/trunk/examples/bin# ./build-reuters.sh 
> Please select a number to choose the corresponding clustering algorithm
> 1. kmeans clustering
> 2. lda clustering
> Enter your choice : 1
> ok. You chose 1 and we'll use kmeans Clustering
> ./build-reuters.sh: line 39: cd: examples/bin/: No such file or directory
> Downloading Reuters-21578
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 7959k  100 7959k    0     0   121k      0  0:01:05  0:01:05 --:--:--   99k
> Extracting...
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:20 WARN driver.MahoutDriver: No org.apache.lucene.benchmark.utils.ExtractReuters.props
found on classpath, will use command-line arguments only
> Deleting all files in ./examples/bin/work/reuters-out/-tmp
> 11/05/16 09:31:24 INFO driver.MahoutDriver: Program took 3471 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:26 INFO common.AbstractJob: Command line arguments: {--charset=UTF-8,
--chunkSize=5, --endPhase=2147483647, --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter,
--input=./examples/bin/work/reuters-out/, --keyPrefix=, --output=./examples/bin/work/reuters-out-seqdir,
--startPhase=0, --tempDir=temp}
> 11/05/16 09:31:26 INFO driver.MahoutDriver: Program took 398 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size
is: 1
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value:
1.0
> 11/05/16 09:31:28 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks:
1
> 11/05/16 09:31:29 INFO input.FileInputFormat: Total input paths to process : 1
> 11/05/16 09:31:29 INFO mapred.JobClient: Running job: job_201105160929_0001
> 11/05/16 09:31:30 INFO mapred.JobClient:  map 0% reduce 0%
> 11/05/16 09:31:40 INFO mapred.JobClient:  map 100% reduce 0%
> 11/05/16 09:31:42 INFO mapred.JobClient: Job complete: job_201105160929_0001
> [...]
> 11/05/16 09:33:58 INFO common.HadoopUtil: Deleting examples/bin/work/reuters-out-seqdir-sparse/partial-vectors-0
> 11/05/16 09:33:58 INFO driver.MahoutDriver: Program took 149846 ms
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf 
> 11/05/16 09:34:00 INFO common.AbstractJob: Command line arguments: {--clusters=./examples/bin/work/clusters,
--convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/,
--maxIter=10, --method=mapreduce, --numClusters=20, --output=./examples/bin/work/reuters-kmeans,
--overwrite=null, --startPhase=0, --tempDir=temp}
> 11/05/16 09:34:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 11/05/16 09:34:00 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib
library
> 11/05/16 09:34:00 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> 	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> 	at java.util.ArrayList.get(ArrayList.java:322)
> 	at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
> 	at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> ------------------------------------------------------------------
> EDIT : I just tried this on Mahout 0.4 and it seems to work (I use the same VM configuration).

> PS : Sorry for my very bad english :(

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message