mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reinis Vicups <mah...@orbit-x.de>
Subject SparseVectorsFromSequenceFiles: ArrayIndexOutOfBoundsException in DictionaryVectorizer
Date Sat, 12 Jul 2014 15:38:36 GMT
Hi,

the log bellow shows an issue that started to occur just "recently" (I 
haven't ran tests with this somewhat larger dataset (320K documents) for 
some time and when I did today, I got this "all of a sudden").
Am using mahout 0.9-cdh5.2.0-SNAPSHOT (yes its cloudera but as far as I 
can tell, that's vanilla mahout in the community edition I use).

As far as I can tell, it's happening in the middle of seq2sparse and all 
three - the input, the output and the mr-job are being generated by 
mahout and there's no my code involved.

It would be cool if  anyone could point me to the source of this error.

thanks and kind regards
reinis.

SETTINGS OF SEQ2SPARSE
----------------------------------------------

{"--analyzerName", "com.myproj.quantify.ticket.text.TicketTextAnalyzer",
               "--chunkSize", "200",
               "--output", finalDir,
               "--input", ticketTextsOutput.toString,
               "--minSupport", "2",
               "--minDF", "2",
               "--maxDFPercent", "85",
               "--weight", "tfidf",
               "--minLLR", "50",
               "--maxNGramSize", "3",
               "--norm", "2",
               "--namedVector", "--sequentialAccessVector", "--overwrite"}


LOG
-----------------------------------------------------

14/07/12 16:46:16 INFO vectorizer.SparseVectorsFromSequenceFiles: 
Creating Term Frequency Vectors
14/07/12 16:46:16 INFO vectorizer.DictionaryVectorizer: Creating 
dictionary from /quantify/ticket/text/final/tokenized-documents and 
saving at /quantify/ticket/text/final/wordcount
14/07/12 16:46:16 INFO client.RMProxy: Connecting to ResourceManager at 
hadoop1
14/07/12 16:46:17 INFO input.FileInputFormat: Total input paths to 
process : 1
14/07/12 16:46:17 INFO mapreduce.JobSubmitter: number of splits:2
14/07/12 16:46:17 INFO mapreduce.JobSubmitter: Submitting tokens for 
job: job_1404888747437_0074
14/07/12 16:46:17 INFO impl.YarnClientImpl: Submitted application 
application_1404888747437_0074
14/07/12 16:46:17 INFO mapreduce.Job: The url to track the job: 
http://hadoop1:8088/proxy/application_1404888747437_0074/
14/07/12 16:46:17 INFO mapreduce.Job: Running job: job_1404888747437_0074
14/07/12 16:46:30 INFO mapreduce.Job: Job job_1404888747437_0074 running 
in uber mode : false
14/07/12 16:46:30 INFO mapreduce.Job:  map 0% reduce 0%
14/07/12 16:46:41 INFO mapreduce.Job:  map 6% reduce 0%
14/07/12 16:46:44 INFO mapreduce.Job:  map 10% reduce 0%
14/07/12 16:46:47 INFO mapreduce.Job:  map 11% reduce 0%
14/07/12 16:46:48 INFO mapreduce.Job:  map 14% reduce 0%
14/07/12 16:46:50 INFO mapreduce.Job:  map 15% reduce 0%
14/07/12 16:46:51 INFO mapreduce.Job:  map 19% reduce 0%
14/07/12 16:46:53 INFO mapreduce.Job:  map 20% reduce 0%
14/07/12 16:46:54 INFO mapreduce.Job:  map 23% reduce 0%
14/07/12 16:46:57 INFO mapreduce.Job:  map 26% reduce 0%
14/07/12 16:47:00 INFO mapreduce.Job:  map 29% reduce 0%
14/07/12 16:47:01 INFO mapreduce.Job: Task Id : 
attempt_1404888747437_0074_m_000000_0, Status : FAILED
Error: java.lang.IllegalStateException: java.io.IOException: Spill failed
         at 
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:140)
         at 
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:115)
         at 
org.apache.mahout.math.map.OpenObjectIntHashMap.forEachPair(OpenObjectIntHashMap.java:185)
         at 
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:115)
         at 
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:415)
         at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: Spill failed
         at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1535)
         at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$300(MapTask.java:853)
         at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349)
         at java.io.DataOutputStream.write(DataOutputStream.java:107)
         at 
org.apache.mahout.vectorizer.collocations.llr.GramKey.write(GramKey.java:91)
         at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:98)
         at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:82)
         at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1126)
         at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692)
         at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
         at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
         at 
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:131)
         ... 12 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1836016430
         at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:144)
         at java.io.DataInputStream.readByte(DataInputStream.java:265)
         at 
org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
         at 
org.apache.mahout.vectorizer.collocations.llr.GramKey.readFields(GramKey.java:78)
         at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:132)
         at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1245)
         at 
org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:105)
         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:63)
         at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1575)
         at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:853)
         at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1505)

Mime
View raw message