mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Divya" <di...@k2associates.com.sg>
Subject RE: rowSimilarity CLI
Date Fri, 26 Nov 2010 08:47:50 GMT

Hi,

I have created tf-idf vectors using Mahout's seq2sparse.
Its ouput sequence file in  org.apache.hadoop.io.Text and
org.apache.mahout.math.VectorWritable format.
And Rowsimilarity is expects format SequenceFile<IntWritable,VectorWritable>
That's why its throwing below exception.
How can I get rid of below error.


$ bin/mahout rowsimilarity -i
D:/MahoutResult/Seq2Sparse_NamedVector/tfidf-vectors -o
D:/MahoutResult/RowSimilarity_Output -s SIMILARI
TY_PEARSON_CORRELATION  -m 50 -r 5
Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
10/11/26 16:37:35 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --input=D:/MahoutResult/Seq2Sparse_NamedVect
or/tfidf-vectors, --maxSimilaritiesPerRow=50, --numberOfColumns=5,
--output=D:/MahoutResult/RowSimilarity_Output, --similarityClassname
=SIMILARITY_PEARSON_CORRELATION, --startPhase=0, --tempDir=temp}
10/11/26 16:37:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
10/11/26 16:37:37 INFO input.FileInputFormat: Total input paths to process :
1
10/11/26 16:37:42 INFO mapred.JobClient: Running job: job_local_0001
10/11/26 16:37:42 INFO input.FileInputFormat: Total input paths to process :
1
10/11/26 16:37:43 INFO mapred.MapTask: io.sort.mb = 100
10/11/26 16:37:43 INFO mapred.MapTask: data buffer = 79691776/99614720
10/11/26 16:37:43 INFO mapred.MapTask: record buffer = 262144/327680
10/11/26 16:37:43 INFO mapred.JobClient:  map 0% reduce 0%
10/11/26 16:37:43 WARN mapred.LocalJobRunner: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable
        at
org.apache.mahout.math.hadoop.similarity.RowSimilarityJob$RowWeightMapper.ma
p(RowSimilarityJob.java:195)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
10/11/26 16:37:44 INFO mapred.JobClient: Job complete: job_local_0001
10/11/26 16:37:44 INFO mapred.JobClient: Counters: 0
10/11/26 16:37:44 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
10/11/26 16:37:46 INFO input.FileInputFormat: Total input paths to process :
0
10/11/26 16:37:47 INFO mapred.JobClient: Running job: job_local_0002
10/11/26 16:37:47 INFO input.FileInputFormat: Total input paths to process :
0
10/11/26 16:37:47 WARN mapred.LocalJobRunner: job_local_0002
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:124)
10/11/26 16:37:48 INFO mapred.JobClient:  map 0% reduce 0%
10/11/26 16:37:48 INFO mapred.JobClient: Job complete: job_local_0002
10/11/26 16:37:48 INFO mapred.JobClient: Counters: 0
10/11/26 16:37:48 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
10/11/26 16:37:49 INFO mapred.LocalJobRunner:
Exception in thread "main"
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist: temp/pairwiseSimilar
ity
        at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFo
rmat.java:224)
        at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(Seq
uenceFileInputFormat.java:55)
        at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFor
mat.java:241)
        at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
        at
org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.run(RowSimilarityJ
ob.java:174)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at
org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.main(RowSimilarity
Job.java:86)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Thanks,
Regards,
Divya 

-----Original Message-----
From: Sebastian Schelter [mailto:ssc@apache.org] 
Sent: Friday, November 26, 2010 3:39 PM
To: user@mahout.apache.org
Subject: Re: rowSimilarity CLI

Hi,

You would need to convert your documents to tf-idf vectors, remove all
stopwords and run rowSimilarity on that with cosine as similarity
measure. That should you give you reasonable results.

--sebastian

Am 26.11.2010 06:27, schrieb Divya:
> Hi,
> 
>  
> 
> I need to know what is the usage of rowSimilarity CL.
> 
> I know we use I compute the pairwise row similarity.
> 
> I want to know more about it.
> 
> Where we can use it .
> 
> Can we use it compute similarity between two documents contents.
> 
>  
> 
> Regards,
> 
> Divya 
> 
> 



Mime
View raw message