Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7189796B5 for ; Wed, 25 Jul 2012 07:00:09 +0000 (UTC) Received: (qmail 63068 invoked by uid 500); 25 Jul 2012 07:00:08 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 62867 invoked by uid 500); 25 Jul 2012 07:00:06 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 62844 invoked by uid 99); 25 Jul 2012 07:00:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Jul 2012 07:00:06 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of goksron@gmail.com designates 209.85.161.170 as permitted sender) Received: from [209.85.161.170] (HELO mail-gg0-f170.google.com) (209.85.161.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Jul 2012 06:59:59 +0000 Received: by ggnf2 with SMTP id f2so645388ggn.1 for ; Tue, 24 Jul 2012 23:59:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=yLJxCiAAql4tEfxHP2ABy3bGlSIXI7V8pWqm9+TXPic=; b=YLHOR68Q3rxNAyPnO5R2trx9YHwJ5ZDibvFRgsi8TReoFjUFLSiLgZ/EwNowFNdiiS jWFG0/ZWyrQMZHIXExloOzKqFjFbHHN/Txh5Rcv4jObiGSLUOatx4CY9U3KrvoOQ1dZ0 yU/+/b+jgYSKj8gVTaZsFY2k3UgTLGGT9oXIOFdw8KnncXx/+WLcbhcMIXaUMOCsdRol j0pxxWvdUnw6NfOr48GKfiJs/gS8ZGfe5lc0t3ho0dLGZ8ui3hvyqQ02ys9vqKBUBbic uNi6J5Bytu2ZllUe4tGtqOCDZUlMLB6GqvX4+Tzn2NAuUUmzDt3sj8cWRd4+L5i1FrT6 D8Uw== MIME-Version: 1.0 Received: by 10.42.145.7 with SMTP id d7mr21992653icv.45.1343199555769; Tue, 24 Jul 2012 23:59:15 -0700 (PDT) Received: by 10.50.41.234 with HTTP; Tue, 24 Jul 2012 23:59:15 -0700 (PDT) In-Reply-To: References: Date: Tue, 24 Jul 2012 23:59:15 -0700 Message-ID: Subject: Re: .txt to vector From: Lance Norskog To: user@mahout.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable You're making progress! Run "bin/mahout lucene.vector" and look at the help message: --maxPercentErrorDocs (-err) maxPercentErrorDocs The max percentage of docs that can have a = null term vector. These ar= e noise document and ca= n occur if the analyzer used strips out all t= erms in the target field. = This percentage is express= ed as a value between 0 = and 1. The default is 0. You want .3, not 30 ! On Tue, Jul 24, 2012 at 1:27 AM, Videnova, Svetlana wrote: > I find this : http://comments.gmane.org/gmane.comp.apache.mahout.devel/16= 422 > > When I run this : apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir= ./toto/index_bananas/ -o ./toto/lucene_vector_test/tom_indexes_output --ma= xPercentErrorDocs 30 --field bananas -t ./toto/lucene_vector_test/dictionna= ry/ -n 2 > > I have this error : > 12/07/24 09:25:22 WARN util.NativeCodeLoader: Unable to load native-hadoo= p library for your platform... using builtin-java classes where applicable > 12/07/24 09:25:22 INFO compress.CodecPool: Got brand-new compressor > Exception in thread "main" java.lang.IllegalArgumentException > > -----Message d'origine----- > De : Videnova, Svetlana [mailto:svetlana.videnova@logica.com] > Envoy=C3=A9 : mardi 24 juillet 2012 09:16 > =C3=80 : user@mahout.apache.org > Objet : RE: .txt to vector > > Hi Lance, > > My dir contains now : _0.tvf and the others. > > With the command: > apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir ./toto/index_bana= nas/ -o ./toto/lucene_vector_test/tom_indexes_output --field bananas -t ./t= oto/lucene_vector_test/dictionnary/ -n 2 the output is: > ... > 12/07/24 08:13:01 ERROR lucene.LuceneIterator: There are too many documen= ts that do not have a term vector for bananas Exception in thread "main" ja= va.lang.IllegalStateException: There are too many documents that do not hav= e a term vector for bananas ... > > > Still can't understand the error ... > > Thank you > > > -----Message d'origine----- > De : Lance Norskog [mailto:goksron@gmail.com] Envoy=C3=A9 : mardi 24 juil= let 2012 04:28 =C3=80 : user@mahout.apache.org Objet : Re: .txt to vector > > You have to add termvectors to the field type you want to use. Then, you = have to reindex all of the data. You will now have another file in the inde= x with the suffix .tvf. This has the data which the Mahout lucene job looks= for. > > On Mon, Jul 23, 2012 at 8:03 AM, Videnova, Svetlana wrote: >> Hello again, >> >> I have got my indexed files from solr in windows and copy them into a di= rectory in ubuntu. >> They are like this : >> ### >> index_test$ ls >> _4d.fdt _4d.frq _4d.tis _4e.fdx _4e.frq _4e.prx _4e.tis segme= nts.gen >> _4d.fdx _4d.prx _4e.fdt _4e.fnm _4e.nrm _4e.tii segments_55 ### >> >> _4d.tis looks like: >> ### >> ]0 - PA =E2=80=93 savoir o=C3=B9 se trouve un panier = workflow, statut >> ### >> >> >> Then i'm using mahout like that: >> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir >> ./toto/index_test/ -o ./toto/lucene_vector_test/tom_indexes_output --fie= ld PA -t ./toto/lucene_vector_test/dictionnary/ -n 2 The output is: >> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >> running locally >> SLF4J: Class path contains multiple SLF4J bindings. >> SLF4J: Found binding in >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-exam >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency/ >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency/ >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an expl= anation. >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] >> 12/07/23 15:50:09 INFO lucene.Driver: Output File: >> ./toto/lucene_vector_test/tom_indexes_output >> 12/07/23 15:50:10 WARN util.NativeCodeLoader: Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 12/07/23 15:50:10 INFO compress.CodecPool: Got brand-new compressor >> 12/07/23 15:50:10 ERROR lucene.LuceneIterator: There are too many >> documents that do not have a term vector for PA Exception in thread "mai= n" java.lang.IllegalStateException: There are too many documents that do no= t have a term vector for PA >> at org.apache.mahout.utils.vectors.lucene.LuceneIterator.compute= Next(LuceneIterator.java:118) >> at org.apache.mahout.utils.vectors.lucene.LuceneIterator.compute= Next(LuceneIterator.java:41) >> at com.google.common.collect.AbstractIterator.tryToComputeNext(A= bstractIterator.java:143) >> at com.google.common.collect.AbstractIterator.hasNext(AbstractIt= erator.java:138) >> at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.w= rite(SequenceFileVectorWriter.java:44) >> at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Dri= ver.java:109) >> at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.jav= a:250) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcces= sorImpl.java:39) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMet= hodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invok= e(ProgramDriver.java:68) >> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.jav= a:139) >> at >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >> >> >> >> I'm looking for field =3D "PA" which is using in a lot of files so I don= =E2=80=99t understand why the exception tell me "too many documents that do= not have a term vector for PA". >> >> Somebody can explain me how I have to use the command lucene.vector beca= use apparently I'm missing something... >> >> Thank you all! >> >> >> -----Message d'origine----- >> De : Videnova, Svetlana [mailto:svetlana.videnova@logica.com] >> Envoy=C3=A9 : lundi 23 juillet 2012 10:18 >> =C3=80 : user@mahout.apache.org >> Objet : RE: .txt to vector >> >> I'm using mahout on ubuntu and solr on windows i guess with a web servic= e I can get the indexed files from solr and then thanks to java program In = the web service call mahout library to classify/clusterize and categorize m= y database. >> For the moment im just training with a directory on ubuntu (my dir conta= ins : .xml,.txt,.csv), because I don=E2=80=99t know where can I get the ind= exed files from solr on ubuntu...?! >> Also I'm using the last version calls : apache-mahout-d6d6ee8 >> >> When I'm using lucene.vector like : $ ./bin/mahout lucene.vector -d >> ./toto/lucene_vector_test/ -o ./toto/lucene_vector_test/ -t ./toto/ -f >> content -n 2 Exception in thread "main" >> org.apache.lucene.index.IndexNotFoundException: no segments* file >> found in >> org.apache.lucene.store.NIOFSDirectory@/usr/local/apache-mahout-d6d6ee >> 8/toto/lucene_vector_test >> lockFactory=3Dorg.apache.lucene.store.NativeFSLockFactory@157aa53: >> files: [] >> >> >> Thank you >> >> >> >> -----Message d'origine----- >> De : Lance Norskog [mailto:goksron@gmail.com] Envoy=C3=A9 : samedi 21 >> juillet 2012 05:55 =C3=80 : user@mahout.apache.org Objet : Re: .txt to >> vector >> >> Solr creates Lucene index files. You can query it for content in several= formats. You will have to fetch the data with a program. >> >> bin/mahout lucene.vector >> creates vector sequencefiles from a lucene index. I have not tried >> this. You have to configure Solr to create termvectors for the field >> you want. This is in the field type declaration, see the Introduction >> in: >> http://wiki.apache.org/solr/TermVectorComponent >> >> I don't know if lucene.vector is in the Mahout 0.5 release. >> >> For cluster outputs, the current cluster dumper supports 'graphml' >> format. Giraph is an interactive graph browsers. You can look at small c= luster jobs. >> >> On Thu, Jul 19, 2012 at 11:34 PM, Videnova, Svetlana wrote: >>> Hi, >>> I already have mahout in action, but nothing working with mahout last v= ersion.. >>> I will see again.. >>> For "taming text" does it treat .xml, json files too, cause my goal is = to take the output of solr (which is .xml, json or php)? >>> >>> >>> >>> Regards >>> >>> >>> >>> -----Message d'origine----- >>> De : Lance Norskog [mailto:goksron@gmail.com] Envoy=C3=A9 : vendredi 20 >>> juillet 2012 03:16 =C3=80 : user@mahout.apache.org Objet : Re: .txt to >>> vector >>> >>> There are two books out for Mahout and text processing. "Mahout in Acti= on" covers all of the apps in Mahout. "Taming Text" gives a good detailed e= xplanation of the text processing programs in Mahout, and otherwise covers = other text processing problems. >>> >>> Mahout in Action is very good, and can help you use most of the Mahout = features. >>> >>> http://www.manning.com/owen >>> http://www.manning.com/ingersoll >>> >>> On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana wrote: >>>> Hi again, >>>> Just finished. >>>> That's what I done: >>>> >>>> Mahout .txt to seqfile >>>> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html >>>> Converting directory of documents to SequenceFile format >>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>> seqdirectory --input /usr/local/apache-mahout-d6d6ee8/toto >>>> --output /usr/local/apache-mahout-d6d6ee8/examples/output/ >>>> -This first step will create chunk-0 file in the output path that >>>> you gave Creating Vectors from SequenceFile ./bin/mahout seq2sparse >>>> --input ./examples/output/chunk-0 --output ./toto/output/ >>>> -maxNGramSize *Don't forget to put ./toto/output full right -this >>>> second step will take the chunk-0 created by the first step and will >>>> create output dir where you specified in the --output option >>>> >>>> Creating vector with kmeans >>>> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c >>>> ./toto/centroides_kmeans/ -cl -o ./toto/cluster_kmeans/ -k 20 -ow >>>> -x >>>> 10 >>>> >>>> Transform vectors to human redable (does not work yet) >>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>> clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ -o >>>> ./toto/clusters-dump/ -of TEXT -d ./toto/output/dictionary.file-0 >>>> -dt sequencefile -b 100 -n 20 --evaluate -dm >>>> org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir >>>> ./toto/cluster_kmeans/clusteredPoints/ >>>> *-s got changed to -i for mahout 0.7 >>>> * works : ./bin/mahout clusterdump -i >>>> ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/ >>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/ >>>> >>>> >>>> >>>> >>>> >>>> Can somebody please explain me belows files? What exactly they contect= how to use them ect... >>>> dictionary.file-0 ; tfidf-vectors ; tokenized-documents; df-count = ; frequency.file-0 ; tf-vectors ; wordcount >>>> >>>> >>>> What is the chunk-0 file exactly? >>>> >>>> >>>> What represent clusters-dump at the end created by using the command = clusterdump? >>>> >>>> >>>> Thank you all! >>>> >>>> >>>> -----Message d'origine----- >>>> De : Videnova, Svetlana [mailto:svetlana.videnova@logica.com] >>>> Envoy=C3=A9 : jeudi 19 juillet 2012 15:07 =C3=80 : user@mahout.apache.= org >>>> Objet : RE: .txt to vector >>>> >>>> The problem was that i gave as input file to seq2sparse the directory = and no -chunk directly. >>>> Also I didn't got write rights for "group" and "others" to my output f= ile. >>>> >>>> After running the command -> ./bin/mahout seq2sparse --input >>>> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize 3 I >>>> have got -> 12/07/19 13:57:10 INFO driver.MahoutDriver: Program took >>>> 57093 ms (Minutes: 0.95155) >>>> >>>> >>>> So I went to my output and there is -> >>>> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls >>>> df-count frequency.file-0 tf-vectors wordcount >>>> dictionary.file-0 tfidf-vectors tokenized-documents >>>> >>>> How should the vectors files looking like? >>>> And can somebody please explain me what represents each directory of t= he output above? >>>> >>>> >>>> >>>> Thank you >>>> >>>> -----Message d'origine----- >>>> De : Videnova, Svetlana [mailto:svetlana.videnova@logica.com] >>>> Envoy=C3=A9 : jeudi 19 juillet 2012 14:26 =C3=80 : user@mahout.apache.= org >>>> Objet : RE: .txt to vector >>>> >>>> Yes that i was saying. >>>> >>>> But I have no idea where in the code mahout calls/creates the data tha= t I don't have. >>>> And the clusters that I have (especially clusters-8) are old and not g= enerate by seqdirectory either by seq2sparse... >>>> Should I make other manipulations before seqdirectory or seq2sparse st= ep? >>>> >>>> >>>> Thank you >>>> >>>> >>>> -----Message d'origine----- >>>> De : Alexander Aristov [mailto:alexander.aristov@gmail.com] But >>>> Envoy=C3=A9 >>>> : jeudi 19 juillet 2012 12:05 =C3=80 : user@mahout.apache.org Objet : = Re: >>>> .txt to vector >>>> >>>> you've got another problem now >>>> >>>> Exception in thread "main" java.io.FileNotFoundException: File file:/u= sr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data does not exi= st. >>>> >>>> Best Regards >>>> Alexander Aristov >>>> >>>> >>>> On 19 July 2012 12:30, Videnova, Svetlana wrote: >>>> >>>>> Hi Lance, >>>>> >>>>> Thank you for your fast answer. >>>>> I was changing my : >>>>> CLASSPATH=3D/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0= /lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.= 0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:. >>>>> >>>>> And put 3.6.0 in the pom.xml >>>>> >>>>> >>>>> But: >>>>> >>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>> seq2sparse --input ./examples/output/ --output ./toto/output/ >>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>>> running locally >>>>> SLF4J: Class path contains multiple SLF4J bindings. >>>>> SLF4J: Found binding in >>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-e >>>>> x >>>>> a m >>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>> SLF4J: Found binding in >>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen >>>>> c y / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>> SLF4J: Found binding in >>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen >>>>> c >>>>> y / >>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>>>> explanation. >>>>> SLF4J: Actual binding is of type >>>>> [org.slf4j.impl.Log4jLoggerFactory] >>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> Maximum n-gram size is: 1 >>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> Minimum LLR value: 1.0 >>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> Number of reduce tasks: 1 >>>>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths to >>>>> process >>>>> : 15 >>>>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging >>>>> area >>>>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_loca >>>>> l >>>>> _ >>>>> 0 >>>>> 001 Exception in thread "main" java.io.FileNotFoundException: File >>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/da >>>>> t >>>>> a >>>>> does not exist. >>>>> at >>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSys= tem.java:371) >>>>> at >>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.= java:245) >>>>> at >>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listSta= tus(SequenceFileInputFormat.java:63) >>>>> at >>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileI= nputFormat.java:252) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) >>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:85= 4) >>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:80= 7) >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> at javax.security.auth.Subject.doAs(Subject.java:396) >>>>> at >>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma= tion.java:1059) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:8= 07) >>>>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) >>>>> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java= :495) >>>>> at >>>>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(Docu= mentProcessor.java:93) >>>>> at >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Spars= eVectorsFromSequenceFiles.java:255) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>> at >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(Spar= seVectorsFromSequenceFiles.java:55) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method= ) >>>>> at >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.= java:39) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces= sorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra= mDriver.java:68) >>>>> at >>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>> at >>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >>>>> >>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/output/c >>>>> l >>>>> u >>>>> s >>>>> ters-8$ >>>>> ls >>>>> _logs part-r-00000 _policy _SUCCESS >>>>> >>>>> There is no >>>>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here= ! >>>>> >>>>> >>>>> Thank you >>>>> >>>>> -----Message d'origine----- >>>>> De : Lance Norskog [mailto:goksron@gmail.com] Envoy=C3=A9 : jeudi 19 >>>>> juillet 2012 09:33 =C3=80 : user@mahout.apache.org Objet : Re: .txt t= o >>>>> vector >>>>> >>>>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0. I >>>>> suggest using an earlier one. Mahout uses with Lucene in a very >>>>> simple way, and it is OK to use any earlier Lucene from 3.1 to 3.6. >>>>> >>>>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana < >>>>> svetlana.videnova@logica.com> wrote: >>>>> > Hi Sean, >>>>> > >>>>> > In fact i was using lucene version 3.6.0 (saw that in the >>>>> > pom.xml) But in my classpath I was using lucene version 4.0.0 >>>>> > >>>>> > I change pom.xml to 4.0.0 =3D> >>>>> > 4.0.0 >>>>> > >>>>> > But still the same error: >>>>> > ### >>>>> > Exception in thread "main" java.lang.VerifyError: class >>>>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final >>>>> > method >>>>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucen >>>>> > e >>>>> > / >>>>> > a >>>>> > na >>>>> > lysis/TokenStream; >>>>> > ### >>>>> > >>>>> > Should I change something else? Or may be lucene 4.0 is too >>>>> > recent for >>>>> mahout!? >>>>> > >>>>> > >>>>> > >>>>> > Thank you >>>>> > >>>>> > -----Message d'origine----- >>>>> > De : Sean Owen [mailto:srowen@gmail.com] Envoy=C3=A9 : mercredi 18 >>>>> > juillet >>>>> > 2012 22:52 =C3=80 : user@mahout.apache.org Objet : Re: .txt to vect= or >>>>> > >>>>> > This means you're using it with an incompatible version of Lucene. >>>>> > I >>>>> think we're on 3.1. Check the version that Mahout depends upon and >>>>> use at least that version or later. >>>>> > >>>>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana < >>>>> svetlana.videnova@logica.com> wrote: >>>>> > >>>>> >> I'm working with mahout. I'm trying to do web service in java by >>>>> >> myself who will take the output of solr and give this file to maho= ut. >>>>> >> For the moment I successfully do the recommendation part. >>>>> >> Now I'm trying to clusterise. For this I have to vectorise the >>>>> >> output of solr. >>>>> >> Do you have any idea how to do it please? I was following >>>>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html >>>>> >> BUT : doesn't work very well (at all...). >>>>> >> >>>>> >> I'm trying to find how to transform .txt to vector for mahout in >>>>> >> order to clusterise and categorise my information. Is it possible? >>>>> >> I saw that I have to use seqdirectory And seq2sparse. >>>>> >> >>>>> >> Seqdirectory create a file (with some numbers and everything...) >>>>> >> this step is ok But then when I have to use seq2sparse that >>>>> >> gives me this >>>>> >> error: >>>>> >> >>>>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>> >> seq2sparse --input ./examples/output/ --output ./toto/output/ >>>>> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>>> >> running locally >>>>> >> SLF4J: Class path contains multiple SLF4J bindings. >>>>> >> SLF4J: Found binding in >>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahou >>>>> >> t >>>>> >> - >>>>> >> e >>>>> >> xa m >>>>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.cla >>>>> >> s >>>>> >> s >>>>> >> ] >>>>> >> SLF4J: Found binding in >>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen >>>>> >> d >>>>> >> e >>>>> >> n cy / >>>>> >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>> >> SLF4J: Found binding in >>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen >>>>> >> d >>>>> >> e >>>>> >> n >>>>> >> cy / >>>>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class >>>>> >> ] >>>>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for >>>>> >> an explanation. >>>>> >> SLF4J: Actual binding is of type >>>>> >> [org.slf4j.impl.Log4jLoggerFactory] >>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> >> Maximum n-gram size is: 1 >>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> >> Minimum LLR value: 1.0 >>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> >> Number of reduce tasks: 1 Exception in thread "main" >>>>> >> java.lang.VerifyError: class >>>>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final >>>>> >> method >>>>> >> >>>>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/ >>>>> a >>>>> n >>>>> a >>>>> lysis/TokenStream; >>>>> >> at java.lang.ClassLoader.defineClass1(Native Metho= d) >>>>> >> at >>>>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) >>>>> >> at >>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:615) >>>>> >> at >>>>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java= :141) >>>>> >> at >>>>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283) >>>>> >> at >>>>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58) >>>>> >> at >>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:197) >>>>> >> at >>>>> >> java.security.AccessController.doPrivileged(Native >>>>> >> Method) >>>>> >> at >>>>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190) >>>>> >> at java.lang.ClassLoader.loadClass(ClassLoader.jav= a:306) >>>>> >> at >>>>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) >>>>> >> at java.lang.ClassLoader.loadClass(ClassLoader.jav= a:247) >>>>> >> at >>>>> >> >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Spa >>>>> r >>>>> s >>>>> e >>>>> VectorsFromSequenceFiles.java:199) >>>>> >> at >>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> >> at >>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>> >> at >>>>> >> >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(Sp >>>>> a >>>>> r >>>>> s >>>>> eVectorsFromSequenceFiles.java:55) >>>>> >> at >>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>> >> Method) >>>>> >> at >>>>> >> >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. >>>>> j >>>>> ava:39) >>>>> >> at >>>>> >> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc >>>>> e >>>>> s >>>>> s >>>>> orImpl.java:25) >>>>> >> at java.lang.reflect.Method.invoke(Method.java:597= ) >>>>> >> at >>>>> >> >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Prog >>>>> r >>>>> a >>>>> m >>>>> Driver.java:68) >>>>> >> at >>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139= ) >>>>> >> at >>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195 >>>>> >> ) >>>>> >> >>>>> >> im using only lucene 4.0! >>>>> >> >>>>> >> >>>>> CLASSPATH=3D/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:= /opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-= ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4= .0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:. >>>>> >> >>>>> >> Please where im wrong? >>>>> >> >>>>> >> >>>>> >> Thank you all >>>>> >> Regards >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> Think green - keep it on the screen. >>>>> >> >>>>> >> This e-mail and any attachment is for authorised use by the >>>>> >> intended >>>>> >> recipient(s) only. It may contain proprietary material, >>>>> >> confidential information and/or be subject to legal privilege. >>>>> >> It should not be copied, disclosed to, retained or used by, any >>>>> >> other party. If you are not an intended recipient then please >>>>> >> promptly delete this e-mail and any attachment and all copies and = inform the sender. Thank you. >>>>> >> >>>>> >> >>>>> > >>>>> > Think green - keep it on the screen. >>>>> > >>>>> > This e-mail and any attachment is for authorised use by the >>>>> > intended >>>>> recipient(s) only. It may contain proprietary material, >>>>> confidential information and/or be subject to legal privilege. It >>>>> should not be copied, disclosed to, retained or used by, any other >>>>> party. If you are not an intended recipient then please promptly >>>>> delete this e-mail and any attachment and all copies and inform the s= ender. Thank you. >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> Lance Norskog >>>>> goksron@gmail.com >>>>> >>>>> >>>>> Think green - keep it on the screen. >>>>> >>>>> This e-mail and any attachment is for authorised use by the >>>>> intended >>>>> recipient(s) only. It may contain proprietary material, >>>>> confidential information and/or be subject to legal privilege. It >>>>> should not be copied, disclosed to, retained or used by, any other >>>>> party. If you are not an intended recipient then please promptly >>>>> delete this e-mail and any attachment and all copies and inform the s= ender. Thank you. >>>>> >>>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended r= ecipient(s) only. It may contain proprietary material, confidential informa= tion and/or be subject to legal privilege. It should not be copied, disclos= ed to, retained or used by, any other party. If you are not an intended rec= ipient then please promptly delete this e-mail and any attachment and all c= opies and inform the sender. Thank you. >>>> >>>> >>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended r= ecipient(s) only. It may contain proprietary material, confidential informa= tion and/or be subject to legal privilege. It should not be copied, disclos= ed to, retained or used by, any other party. If you are not an intended rec= ipient then please promptly delete this e-mail and any attachment and all c= opies and inform the sender. Thank you. >>>> >>>> >>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended r= ecipient(s) only. It may contain proprietary material, confidential informa= tion and/or be subject to legal privilege. It should not be copied, disclos= ed to, retained or used by, any other party. If you are not an intended rec= ipient then please promptly delete this e-mail and any attachment and all c= opies and inform the sender. Thank you. >>>> >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> goksron@gmail.com >>> >>> >>> Think green - keep it on the screen. >>> >>> This e-mail and any attachment is for authorised use by the intended re= cipient(s) only. It may contain proprietary material, confidential informat= ion and/or be subject to legal privilege. It should not be copied, disclose= d to, retained or used by, any other party. If you are not an intended reci= pient then please promptly delete this e-mail and any attachment and all co= pies and inform the sender. Thank you. >>> >> >> >> >> -- >> Lance Norskog >> goksron@gmail.com >> >> >> Think green - keep it on the screen. >> >> This e-mail and any attachment is for authorised use by the intended rec= ipient(s) only. It may contain proprietary material, confidential informati= on and/or be subject to legal privilege. It should not be copied, disclosed= to, retained or used by, any other party. If you are not an intended recip= ient then please promptly delete this e-mail and any attachment and all cop= ies and inform the sender. Thank you. >> >> >> Think green - keep it on the screen. >> >> This e-mail and any attachment is for authorised use by the intended rec= ipient(s) only. It may contain proprietary material, confidential informati= on and/or be subject to legal privilege. It should not be copied, disclosed= to, retained or used by, any other party. If you are not an intended recip= ient then please promptly delete this e-mail and any attachment and all cop= ies and inform the sender. Thank you. >> > > > > -- > Lance Norskog > goksron@gmail.com > > > Think green - keep it on the screen. > > This e-mail and any attachment is for authorised use by the intended reci= pient(s) only. It may contain proprietary material, confidential informatio= n and/or be subject to legal privilege. It should not be copied, disclosed = to, retained or used by, any other party. If you are not an intended recipi= ent then please promptly delete this e-mail and any attachment and all copi= es and inform the sender. Thank you. > > > Think green - keep it on the screen. > > This e-mail and any attachment is for authorised use by the intended reci= pient(s) only. It may contain proprietary material, confidential informatio= n and/or be subject to legal privilege. It should not be copied, disclosed = to, retained or used by, any other party. If you are not an intended recipi= ent then please promptly delete this e-mail and any attachment and all copi= es and inform the sender. Thank you. > --=20 Lance Norskog goksron@gmail.com