mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darren Govoni <dar...@ontrenet.com>
Subject Re: Running CollocDriver, exception
Date Mon, 24 Jan 2011 04:09:19 GMT
Drew,
   Thanks for the tip. It works great now!

Darren

PS. the sort command you suggested doesn't quite sort by LLR score
because its only a lexical sort and misses something like 70.000 should 
be greater than 8.000


On 01/23/2011 11:59 AM, Drew Farris wrote:
> Ahh, ok. Output from seqdirectory is a SequenceFile<Text,Text>, where
> the value is the un-tokenized text of each document. By default the
> CollocDriver expects tokenized text as input, but if you add the '-p'
> option to the CollocDriver command-line it will tokenize the text
> before generating the collocations, so you can use the output of
> seqdirectory as is.
>
> for example:
>
> ./bin/mahout seqdirectory \
>   -i ./examples/bin/work/reuters-out/ \
>   -o ./examples/bin/work/reuters-out-seqdir \
>   -c UTF-8 -chunk 5
>
> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>    -i ./examples/bin/work/reuters-out-seqdir \
>    -o ./examples/bin/work/reuters-colloc-2 \
>    -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p
>
> Drew
>
> On Sun, Jan 23, 2011 at 10:44 AM, Darren Govoni<darren@ontrenet.com>  wrote:
>> Hi Drew,
>>   Thanks for the tips - much appreciated. See inline.
>>
>> On 01/23/2011 09:22 AM, Drew Farris wrote:
>>> Hi Darren,
>>>
>>>   From the error message you receive, it is not exactly clear what is
>>> happening here. I suppose it could be due to the format of the input
>>> sequence file, but I'm not certain.
>>>
>>> A couple questions that will help me answer your question:
>>>
>>> 1) What version of Mahout are you using?
>> 0.4
>>> 2) How are you generating the sequence file you are using as input to
>>> the CollocDriver?
>> bin/mahout seqdirectory --charset ascii --input textfiles/ --output out
>>
>> Then I run:
>>
>> bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>> out/chunk-0 -o phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>>
>> I am not running hadoop. The error is repeatable. Here is the full output.
>> -----------
>> no HADOOP_HOME set, running locally
>> Jan 23, 2011 10:42:50 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Program took 317 ms
>> [darren@cobalt mahout-distribution-0.4]$ bin/mahout
>> org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i out/chunk-0 -o
>> phrases -ng 2 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>> no HADOOP_HOME set, running locally
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter warn
>> WARNING: No org.apache.mahout.vectorizer.collocations.llr.CollocDriver.props
>> found on classpath, will use command-line arguments only
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Command line arguments:
>> {--analyzerName=org.apache.mahout.vectorizer.DefaultAnalyzer,
>> --endPhase=2147483647, --input=out/chunk-0, --maxNGramSize=2, --maxRed=2,
>> --minLLR=1.0, --minSupport=2, --output=phrases, --startPhase=0,
>> --tempDir=temp}
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Maximum n-gram size is: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Minimum Support value: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Minimum LLR value: 1.0
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Number of pass1 reduce tasks: 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Input will NOT be preprocessed
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
>> INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
>> Jan 23, 2011 10:42:56 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 1
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Running job: job_local_0001
>> Jan 23, 2011 10:42:56 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 1
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: io.sort.mb = 100
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: data buffer = 79691776/99614720
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>> <init>
>> INFO: record buffer = 262144/327680
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Max Ngram size is 2
>> Jan 23, 2011 10:42:56 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Emit Unitgrams is false
>> Jan 23, 2011 10:42:56 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
>> WARNING: job_local_0001
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.mahout.common.StringTuple
>>     at
>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO:  map 0% reduce 0%
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Job complete: job_local_0001
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.mapred.Counters log
>> INFO: Counters: 0
>> Jan 23, 2011 10:42:57 AM org.apache.hadoop.metrics.jvm.JvmMetrics init
>> INFO: Cannot initialize JVM Metrics with processName=JobTracker, sessionId=
>> - already initialized
>> Jan 23, 2011 10:42:57 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 0
>> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Running job: job_local_0002
>> Jan 23, 2011 10:42:58 AM
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
>> INFO: Total input paths to process : 0
>> Jan 23, 2011 10:42:58 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
>> WARNING: job_local_0002
>> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>     at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>     at java.util.ArrayList.get(ArrayList.java:322)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO:  map 0% reduce 0%
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.JobClient
>> monitorAndPrintJob
>> INFO: Job complete: job_local_0002
>> Jan 23, 2011 10:42:59 AM org.apache.hadoop.mapred.Counters log
>> INFO: Counters: 0
>> Jan 23, 2011 10:42:59 AM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Program took 3064 ms
>>
>>> Using the latest code from trunk, I was able to run the following
>>> sequence of commands on the data available after running
>>> ./examples/bin/build-reuters.sh
>>>
>>> (All run from the mahout toplevel directory)
>>>
>>> ./bin/mahout seqdirectory \
>>>    -i ./examples/bin/work/reuters-out/ \
>>>    -o ./examples/bin/work/reuters-out-seqdir \
>>>    -c UTF-8 -chunk 5 \
>>>
>>> ./bin/mahout seq2sparse \
>>>    -i ./examples/bin/work/reuters-out-seqdir/ \
>>>    -o ./examples/bin/work/reuters-out-seqdir-sparse \
>>>
>>> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>>>    -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o
>>> \
>>>    -o ./examples/bin/work/reuters-colloc \
>>>    -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
>>>
>>>   ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams//part-r-00000 | less
>>>
>>> This produces output like:
>>>
>>> Input Path: examples/bin/work/reuters-colloc/ngrams/part-r-00000
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.hadoop.io.DoubleWritable
>>> Key: 0 0 25: Value: 18.436118042416638
>>> Key: 0 0 zen: Value: 39.36827993847055
>>>
>>> Where the key is the trigram and the value is the llr score.
>>>
>>> If there are multiple parts in
>>> examples/bin/work/reuters-colloc/ngrams, you'll need to concatenate
>>> them e.g:
>>>
>>> ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00000>>    out
>>> ./bin/mahout seqdumper -s
>>> ./examples/bin/work/reuters-colloc/ngrams/part-r-00001>>    out
>>>
>>> Running the results through 'sort -rm -k 6,6' will give you output
>>> sorted by LLR score descending.
>>>
>>> HTH,
>>>
>>> Drew
>>>
>>> On Fri, Jan 21, 2011 at 5:36 PM, Darren Govoni<darren@ontrenet.com>
>>>   wrote:
>>>> Hi,
>>>>   I'm new to mahout and tried to research this a bit before encountering
>>>> this
>>>> problem.
>>>>
>>>> After I generate sequencefile for directory of text files, I run this:
>>>>
>>>>   bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i
>>>> out/chunk-0 -o colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng
>>>> 3
>>>>
>>>> It produces a couple exceptions:
>>>> ...
>>>> WARNING: job_local_0001
>>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>>>> org.apache.mahout.common.StringTuple
>>>>     at
>>>>
>>>> org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
>>>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>>     at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>> Jan 21, 2011 5:30:07 PM org.apache.hadoop.mapred.JobClient
>>>> monitorAndPrintJob
>>>> ...
>>>> ava.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>>>>     at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>>     at java.util.ArrayList.get(ArrayList.java:322)
>>>>     at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
>>>>
>>>> How can I make this work?
>>>>
>>>> Thanks for any tips,
>>>> Darren
>>>>
>>


Mime
View raw message