mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nishant rathore <nishant.rathor...@gmail.com>
Subject Re: Mahout lucene UTFDataFormatException: encoded string too long:
Date Fri, 26 Apr 2013 03:29:53 GMT
Hi,

 Afer running the commane,
*
*
*./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o
../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/
-d ../output/fetise/luceneDictionary  -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure*
*
*
My directory strucuture is like

outputFolder

drwxrwxr-x 2 pacman pacman    4096 Apr 25 20:09 centroids
-rw-rw-r-- 1 pacman pacman       0 Apr 26 08:51 clusterdump
drwxrwxr-x 4 pacman pacman    4096 Apr 25 20:09 clusters
-rw-rw-r-- 1 pacman pacman  173057 Apr 25 20:09 luceneDictionary
-rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector
pacman@pacman:~/DownloadedCodes/mahout/output/fetise$ ls -lR
.:
total 3148
drwxrwxr-x 2 pacman pacman    4096 Apr 25 20:09 centroids
-rw-rw-r-- 1 pacman pacman       0 Apr 26 08:51 clusterdump
drwxrwxr-x 4 pacman pacman    4096 Apr 25 20:09 clusters
-rw-rw-r-- 1 pacman pacman  173057 Apr 25 20:09 luceneDictionary
-rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector

./centroids:
total 188
-rwxrwxrwx 1 pacman pacman 191155 Apr 25 20:09 part-randomSeed

./clusters:
total 8
drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-0
drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-1-final

./clusters/clusters-0:
total 324
-rwxrwxrwx 1 pacman pacman  4888 Apr 25 20:09 part-00000
.......
-rwxrwxrwx 1 pacman pacman  4888 Apr 25 20:09 part-00039
-rwxrwxrwx 1 pacman pacman   207 Apr 25 20:09 _policy

./clusters/clusters-1-final:
total 7212
-rwxrwxrwx 1 pacman pacman 7377533 Apr 25 20:09 part-r-00000
-rwxrwxrwx 1 pacman pacman     207 Apr 25 20:09 _policy
-rwxrwxrwx 1 pacman pacman       0 Apr 25 20:09 _SUCCESS


*So I am confused while running clusterdump what  is cluster points and
cluster directory??*


Thanks,
Nishant



On Thu, Apr 25, 2013 at 1:37 PM, nishant rathore <
nishant.rathore12@gmail.com> wrote:

> Hi Ted,
>
> That was a stupid mistake. Thanks a lot for quick reply and pointing out
> the issue.
>
> I have change the idfield to link of the document.
> *./bin/mahout lucene.vector -d
> /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
> --idField link  -o ../output/fetise/luceneVector --field text -w TFIDF
> --dictOut ../output/fetise/luceneDictionary -err 0.10*
>
> and ran the fkmeans clustering using command:
> *bin/mahout fkmeans -i ../output/fetise/luceneVector -c
> ../output/fetise/fetise-fkmeans-centroids -o
> ../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm
> org.apache.mahout.common.distance.TanimotoDistanceMeasure*
>
> But when running cluster dumper
> *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o
> ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/
> -d ../output/fetise/luceneDictionary  -dm
> org.apache.mahout.common.distance.TanimotoDistanceMeasure*
>
> got the following error
> Exception in thread "main" java.lang.ClassCastException:
> org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable
>  at
> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306)
> at
> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252)
>  at
> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155)
> at
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100)
>
>
>  *./bin/mahout seqdumper -i
> ../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more*
> Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed
> Key class: *class org.apache.hadoop.io.Text* Value Class: class
> org.apache.mahout.clustering.iterator.ClusterWritable
> Key: 662: Value:
> org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf
> Key: 1014: Value:
> org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf
>
> Why am i getting the key in centroids as Text?
>
>
> Thanks,
> Nishant
>
>
>
>
> On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning <ted.dunning@gmail.com>wrote:
>
>> This sounds pretty fishy.
>>
>> What this is saying is that you have a document in your index whose name
>> is
>> longer than 65,535 characters.
>>
>> That doesn't sound very plausible.  Don't you have a more appropriate ID
>> column?
>>
>> The problem starts where you say "--idField text".  Pick a better field.
>>
>>
>>
>> On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore <
>> nishant.rathore12@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I am trying to import vector from lucene using the command,
>> >
>> > ./bin/mahout lucene.vector -d
>> >
>> >
>> /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
>> > --idField text -o ../output/fetise/luceneVector --field text -w TFIDF
>> > --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout
>> > lucene.vector -d
>> >
>> >
>> /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
>> > --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut
>> > ../output/luceneDictionary -err 0.10
>> >
>> > But i am getting following error
>> > Exception in thread "main" java.io.UTFDataFormatException: encoded
>> string
>> > too long: 94944 bytes
>> > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
>> > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
>> > at
>> >
>> org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188)
>> > at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84)
>> > at
>> >
>> >
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
>> > at
>> >
>> >
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
>> > at
>> >
>> >
>> org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190)
>> > at
>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
>> > at
>> >
>> >
>> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49)
>> > at
>> >
>> org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111)
>> > at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > at
>> >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:601)
>> > at
>> >
>> >
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>> >
>> > I understand that Since its a UTF format and it can not be greater than
>> > 64KB.  But confused how to deal with this. I change the mahout to read
>> and
>> > write using Byte rather than UTF. But later while doing clustering, I
>> get
>> > the error of byte mismatch.
>> >
>> > So I reverted the changes. What can i do to circumvent the UTF
>> limitation
>> > issue? I wonder this seems to be too obvious issue to get solve inside
>> > mahout only.
>> >
>> >
>> > Thanks,
>> > Nishant
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message