mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: .txt to vector
Date Wed, 25 Jul 2012 06:59:15 GMT
You're making progress! Run "bin/mahout lucene.vector" and look at the
help message:
  --maxPercentErrorDocs (-err) maxPercentErrorDocs    The max percentage of
                                                      docs that can have a null
                                                      term vector. These are
                                                      noise document and can
                                                      occur if the analyzer
                                                      used strips out all terms
                                                      in the target field. This
                                                      percentage is expressed
                                                      as a value between 0 and
                                                      1. The default is 0.

You want .3, not 30 !

On Tue, Jul 24, 2012 at 1:27 AM, Videnova, Svetlana
<svetlana.videnova@logica.com> wrote:
> I find this : http://comments.gmane.org/gmane.comp.apache.mahout.devel/16422
>
> When I run this : apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir ./toto/index_bananas/
-o ./toto/lucene_vector_test/tom_indexes_output --maxPercentErrorDocs 30 --field bananas -t
./toto/lucene_vector_test/dictionnary/ -n 2
>
> I have this error :
> 12/07/24 09:25:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
> 12/07/24 09:25:22 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IllegalArgumentException
>
> -----Message d'origine-----
> De : Videnova, Svetlana [mailto:svetlana.videnova@logica.com]
> Envoyé : mardi 24 juillet 2012 09:16
> À : user@mahout.apache.org
> Objet : RE: .txt to vector
>
> Hi Lance,
>
> My dir contains now : _0.tvf and the others.
>
> With the command:
> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir ./toto/index_bananas/ -o ./toto/lucene_vector_test/tom_indexes_output
--field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2 the output is:
> ...
> 12/07/24 08:13:01 ERROR lucene.LuceneIterator: There are too many documents that do not
have a term vector for bananas Exception in thread "main" java.lang.IllegalStateException:
There are too many documents that do not have a term vector for bananas ...
>
>
> Still can't understand the error ...
>
> Thank you
>
>
> -----Message d'origine-----
> De : Lance Norskog [mailto:goksron@gmail.com] Envoyé : mardi 24 juillet 2012 04:28 À
: user@mahout.apache.org Objet : Re: .txt to vector
>
> You have to add termvectors to the field type you want to use. Then, you have to reindex
all of the data. You will now have another file in the index with the suffix .tvf. This has
the data which the Mahout lucene job looks for.
>
> On Mon, Jul 23, 2012 at 8:03 AM, Videnova, Svetlana <svetlana.videnova@logica.com>
wrote:
>> Hello again,
>>
>> I have got my indexed files from solr in windows and copy them into a directory in
ubuntu.
>> They are like this :
>> ###
>> index_test$ ls
>> _4d.fdt  _4d.frq  _4d.tis  _4e.fdx  _4e.frq  _4e.prx  _4e.tis      segments.gen
>> _4d.fdx  _4d.prx  _4e.fdt  _4e.fnm  _4e.nrm  _4e.tii  segments_55 ###
>>
>> _4d.tis looks like:
>> ###
>>              ]0 - PA – savoir où se trouve un panier        workflow, statut
>> ###
>>
>>
>> Then i'm using mahout like that:
>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir
>> ./toto/index_test/ -o ./toto/lucene_vector_test/tom_indexes_output --field PA -t
./toto/lucene_vector_test/dictionnary/ -n 2 The output is:
>>
>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>> running locally
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-exam
>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency/
>> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency/
>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 12/07/23 15:50:09 INFO lucene.Driver: Output File:
>> ./toto/lucene_vector_test/tom_indexes_output
>> 12/07/23 15:50:10 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 12/07/23 15:50:10 INFO compress.CodecPool: Got brand-new compressor
>> 12/07/23 15:50:10 ERROR lucene.LuceneIterator: There are too many
>> documents that do not have a term vector for PA Exception in thread "main" java.lang.IllegalStateException:
There are too many documents that do not have a term vector for PA
>>         at org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:118)
>>         at org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41)
>>         at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>         at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>         at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:44)
>>         at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:109)
>>         at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>         at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>         at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>
>>
>>
>> I'm looking for field = "PA" which is using in a lot of files so I don’t understand
why the exception tell me "too many documents that do not have a term vector for PA".
>>
>> Somebody can explain me how I have to use the command lucene.vector because apparently
I'm missing something...
>>
>> Thank you all!
>>
>>
>> -----Message d'origine-----
>> De : Videnova, Svetlana [mailto:svetlana.videnova@logica.com]
>> Envoyé : lundi 23 juillet 2012 10:18
>> À : user@mahout.apache.org
>> Objet : RE: .txt to vector
>>
>> I'm using mahout on ubuntu and solr on windows i guess with a web service I can get
the indexed files from solr and then thanks to java program In the web service call mahout
library to classify/clusterize and categorize my database.
>> For the moment im just training with a directory on ubuntu (my dir contains : .xml,.txt,.csv),
because I don’t know where can I get the indexed files from solr on ubuntu...?!
>> Also I'm using the last version calls : apache-mahout-d6d6ee8
>>
>> When I'm using lucene.vector like : $ ./bin/mahout lucene.vector -d
>> ./toto/lucene_vector_test/ -o ./toto/lucene_vector_test/ -t ./toto/ -f
>> content -n 2 Exception in thread "main"
>> org.apache.lucene.index.IndexNotFoundException: no segments* file
>> found in
>> org.apache.lucene.store.NIOFSDirectory@/usr/local/apache-mahout-d6d6ee
>> 8/toto/lucene_vector_test
>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@157aa53:
>> files: []
>>
>>
>> Thank you
>>
>>
>>
>> -----Message d'origine-----
>> De : Lance Norskog [mailto:goksron@gmail.com] Envoyé : samedi 21
>> juillet 2012 05:55 À : user@mahout.apache.org Objet : Re: .txt to
>> vector
>>
>> Solr creates Lucene index files. You can query it for content in several formats.
You will have to fetch the data with a program.
>>
>> bin/mahout lucene.vector
>> creates vector sequencefiles from a lucene index. I have not tried
>> this. You have to configure Solr to create termvectors for the field
>> you want. This is in the field type declaration, see the Introduction
>> in:
>> http://wiki.apache.org/solr/TermVectorComponent
>>
>> I don't know if lucene.vector is in the Mahout 0.5 release.
>>
>> For cluster outputs, the current cluster dumper supports 'graphml'
>> format. Giraph is an interactive graph browsers. You can look at small cluster jobs.
>>
>> On Thu, Jul 19, 2012 at 11:34 PM, Videnova, Svetlana <svetlana.videnova@logica.com>
wrote:
>>> Hi,
>>> I already have mahout in action, but nothing working with mahout last version..
>>> I will see again..
>>> For "taming text" does it treat .xml, json files too, cause my goal is to take
the output of solr (which is .xml, json or php)?
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>> -----Message d'origine-----
>>> De : Lance Norskog [mailto:goksron@gmail.com] Envoyé : vendredi 20
>>> juillet 2012 03:16 À : user@mahout.apache.org Objet : Re: .txt to
>>> vector
>>>
>>> There are two books out for Mahout and text processing. "Mahout in Action" covers
all of the apps in Mahout. "Taming Text" gives a good detailed explanation of the text processing
programs in Mahout, and otherwise covers other text processing problems.
>>>
>>> Mahout in Action is very good, and can help you use most of the Mahout features.
>>>
>>> http://www.manning.com/owen
>>> http://www.manning.com/ingersoll
>>>
>>> On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana <svetlana.videnova@logica.com>
wrote:
>>>> Hi again,
>>>> Just finished.
>>>> That's what I done:
>>>>
>>>> Mahout .txt to seqfile
>>>> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>>> Converting directory of documents to SequenceFile format
>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>> seqdirectory  --input /usr/local/apache-mahout-d6d6ee8/toto
>>>> --output /usr/local/apache-mahout-d6d6ee8/examples/output/
>>>> -This first step will create chunk-0 file in the output path that
>>>> you gave Creating Vectors from SequenceFile ./bin/mahout seq2sparse
>>>> --input ./examples/output/chunk-0 --output ./toto/output/
>>>> -maxNGramSize *Don't forget to put  ./toto/output full right -this
>>>> second step will take the chunk-0 created by the first step and will
>>>> create output dir where you specified in the --output option
>>>>
>>>> Creating vector with kmeans
>>>> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c
>>>> ./toto/centroides_kmeans/ -cl  -o ./toto/cluster_kmeans/ -k 20 -ow
>>>> -x
>>>> 10
>>>>
>>>> Transform vectors to human redable (does not work yet)
>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>> clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ -o
>>>> ./toto/clusters-dump/ -of TEXT -d ./toto/output/dictionary.file-0
>>>> -dt sequencefile -b 100 -n 20 --evaluate -dm
>>>> org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir
>>>> ./toto/cluster_kmeans/clusteredPoints/
>>>> *-s got changed to -i for mahout 0.7
>>>> * works : ./bin/mahout clusterdump -i
>>>> ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/
>>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Can somebody please explain me belows files? What exactly they contect how
to use them ect...
>>>> dictionary.file-0 ; tfidf-vectors   ;  tokenized-documents; df-count  ; 
       frequency.file-0 ; tf-vectors     ;      wordcount
>>>>
>>>>
>>>> What is the chunk-0 file exactly?
>>>>
>>>>
>>>>  What represent clusters-dump at the end created by using the command clusterdump?
>>>>
>>>>
>>>> Thank you all!
>>>>
>>>>
>>>> -----Message d'origine-----
>>>> De : Videnova, Svetlana [mailto:svetlana.videnova@logica.com]
>>>> Envoyé : jeudi 19 juillet 2012 15:07 À : user@mahout.apache.org
>>>> Objet : RE: .txt to vector
>>>>
>>>> The problem was that i gave as input file to seq2sparse the directory and
no -chunk directly.
>>>> Also I didn't got write rights for "group" and "others" to my output file.
>>>>
>>>> After running the command -> ./bin/mahout seq2sparse --input
>>>> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize 3 I
>>>> have got -> 12/07/19 13:57:10 INFO driver.MahoutDriver: Program took
>>>> 57093 ms (Minutes: 0.95155)
>>>>
>>>>
>>>> So I went to my output and there is ->
>>>> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls
>>>> df-count           frequency.file-0  tf-vectors           wordcount
>>>> dictionary.file-0  tfidf-vectors     tokenized-documents
>>>>
>>>> How should the vectors files looking like?
>>>> And can somebody please explain me what represents each directory of the
output above?
>>>>
>>>>
>>>>
>>>> Thank you
>>>>
>>>> -----Message d'origine-----
>>>> De : Videnova, Svetlana [mailto:svetlana.videnova@logica.com]
>>>> Envoyé : jeudi 19 juillet 2012 14:26 À : user@mahout.apache.org
>>>> Objet : RE: .txt to vector
>>>>
>>>> Yes that i was saying.
>>>>
>>>> But I have no idea where in the code mahout calls/creates the data that I
don't have.
>>>> And the clusters that I have (especially clusters-8) are old and not generate
by seqdirectory either by seq2sparse...
>>>> Should I make other manipulations before seqdirectory or seq2sparse step?
>>>>
>>>>
>>>> Thank you
>>>>
>>>>
>>>> -----Message d'origine-----
>>>> De : Alexander Aristov [mailto:alexander.aristov@gmail.com] But
>>>> Envoyé
>>>> : jeudi 19 juillet 2012 12:05 À : user@mahout.apache.org Objet : Re:
>>>> .txt to vector
>>>>
>>>> you've got another problem now
>>>>
>>>> Exception in thread "main" java.io.FileNotFoundException: File file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data
does not exist.
>>>>
>>>> Best Regards
>>>> Alexander Aristov
>>>>
>>>>
>>>> On 19 July 2012 12:30, Videnova, Svetlana <svetlana.videnova@logica.com>wrote:
>>>>
>>>>> Hi Lance,
>>>>>
>>>>> Thank you for your fast answer.
>>>>> I was changing my :
>>>>> CLASSPATH=/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0/lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:.
>>>>>
>>>>> And put 3.6.0 in the pom.xml
>>>>>
>>>>>
>>>>> But:
>>>>>
>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>>> seq2sparse --input ./examples/output/ --output ./toto/output/
>>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>>> running locally
>>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>>> SLF4J: Found binding in
>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-e
>>>>> x
>>>>> a m
>>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: Found binding in
>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen
>>>>> c y / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: Found binding in
>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen
>>>>> c
>>>>> y /
>>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>>> explanation.
>>>>> SLF4J: Actual binding is of type
>>>>> [org.slf4j.impl.Log4jLoggerFactory]
>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> Maximum n-gram size is: 1
>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> Minimum LLR value: 1.0
>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> Number of reduce tasks: 1
>>>>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths to
>>>>> process
>>>>> : 15
>>>>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging
>>>>> area
>>>>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_loca
>>>>> l
>>>>> _
>>>>> 0
>>>>> 001 Exception in thread "main" java.io.FileNotFoundException: File
>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/da
>>>>> t
>>>>> a
>>>>> does not exist.
>>>>>         at
>>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
>>>>>         at
>>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>>>>>         at
>>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>>>>         at
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>>>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854)
>>>>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807)
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>>         at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807)
>>>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>>>>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
>>>>>         at
>>>>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:93)
>>>>>         at
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:255)
>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>>         at
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55)
>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>         at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>         at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>         at
>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>         at
>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>         at
>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>>
>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/output/c
>>>>> l
>>>>> u
>>>>> s
>>>>> ters-8$
>>>>> ls
>>>>> _logs  part-r-00000  _policy  _SUCCESS
>>>>>
>>>>>  There is no
>>>>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here!
>>>>>
>>>>>
>>>>> Thank you
>>>>>
>>>>> -----Message d'origine-----
>>>>> De : Lance Norskog [mailto:goksron@gmail.com] Envoyé : jeudi 19
>>>>> juillet 2012 09:33 À : user@mahout.apache.org Objet : Re: .txt to
>>>>> vector
>>>>>
>>>>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0. I
>>>>> suggest using an earlier one. Mahout uses with Lucene in a very
>>>>> simple way, and it is OK to use any earlier Lucene from 3.1 to 3.6.
>>>>>
>>>>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana <
>>>>> svetlana.videnova@logica.com> wrote:
>>>>> > Hi Sean,
>>>>> >
>>>>> > In fact i was using lucene version 3.6.0 (saw that in the
>>>>> > pom.xml) But in my classpath I was using lucene version 4.0.0
>>>>> >
>>>>> > I change pom.xml to 4.0.0 =>
>>>>> > <lucene.version>4.0.0</lucene.version>
>>>>> >
>>>>> > But still the same error:
>>>>> > ###
>>>>> > Exception in thread "main" java.lang.VerifyError: class
>>>>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final
>>>>> > method
>>>>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucen
>>>>> > e
>>>>> > /
>>>>> > a
>>>>> > na
>>>>> > lysis/TokenStream;
>>>>> > ###
>>>>> >
>>>>> > Should I change something else? Or may be lucene 4.0 is too
>>>>> > recent for
>>>>> mahout!?
>>>>> >
>>>>> >
>>>>> >
>>>>> > Thank you
>>>>> >
>>>>> > -----Message d'origine-----
>>>>> > De : Sean Owen [mailto:srowen@gmail.com] Envoyé : mercredi 18
>>>>> > juillet
>>>>> > 2012 22:52 À : user@mahout.apache.org Objet : Re: .txt to vector
>>>>> >
>>>>> > This means you're using it with an incompatible version of Lucene.
>>>>> > I
>>>>> think we're on 3.1. Check the version that Mahout depends upon and
>>>>> use at least that version or later.
>>>>> >
>>>>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana <
>>>>> svetlana.videnova@logica.com> wrote:
>>>>> >
>>>>> >> I'm working with mahout. I'm trying to do web service in java
by
>>>>> >> myself who will take the output of solr and give this file to
mahout.
>>>>> >> For the moment I successfully do the recommendation part.
>>>>> >> Now I'm trying to clusterise. For this I have to vectorise the
>>>>> >> output of solr.
>>>>> >> Do you have any idea how to do it please? I was following
>>>>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>>>> >> BUT : doesn't work very well (at all...).
>>>>> >>
>>>>> >> I'm trying to find how to transform .txt to vector for mahout
in
>>>>> >> order to clusterise and categorise my information. Is it possible?
>>>>> >> I saw that I have to use seqdirectory And seq2sparse.
>>>>> >>
>>>>> >> Seqdirectory create a file (with some numbers and everything...)
>>>>> >> this step is ok But then when I have to use seq2sparse that
>>>>> >> gives me this
>>>>> >> error:
>>>>> >>
>>>>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>>> >> seq2sparse --input ./examples/output/ --output ./toto/output/
>>>>> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>>> >> running locally
>>>>> >> SLF4J: Class path contains multiple SLF4J bindings.
>>>>> >> SLF4J: Found binding in
>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahou
>>>>> >> t
>>>>> >> -
>>>>> >> e
>>>>> >> xa m
>>>>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.cla
>>>>> >> s
>>>>> >> s
>>>>> >> ]
>>>>> >> SLF4J: Found binding in
>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen
>>>>> >> d
>>>>> >> e
>>>>> >> n cy /
>>>>> >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> >> SLF4J: Found binding in
>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen
>>>>> >> d
>>>>> >> e
>>>>> >> n
>>>>> >> cy /
>>>>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class
>>>>> >> ]
>>>>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings
for
>>>>> >> an explanation.
>>>>> >> SLF4J: Actual binding is of type
>>>>> >> [org.slf4j.impl.Log4jLoggerFactory]
>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> >> Maximum n-gram size is: 1
>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> >> Minimum LLR value: 1.0
>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> >> Number of reduce tasks: 1 Exception in thread "main"
>>>>> >> java.lang.VerifyError: class
>>>>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final
>>>>> >> method
>>>>> >>
>>>>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/
>>>>> a
>>>>> n
>>>>> a
>>>>> lysis/TokenStream;
>>>>> >>                 at java.lang.ClassLoader.defineClass1(Native
Method)
>>>>> >>                 at
>>>>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>>>>> >>                 at
>>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>>>>> >>                 at
>>>>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>>>>> >>                 at
>>>>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>>>>> >>                 at
>>>>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>>>>> >>                 at
>>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>>>>> >>                 at
>>>>> >> java.security.AccessController.doPrivileged(Native
>>>>> >> Method)
>>>>> >>                 at
>>>>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>>>> >>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>>> >>                 at
>>>>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>>>> >>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>>> >>                 at
>>>>> >>
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Spa
>>>>> r
>>>>> s
>>>>> e
>>>>> VectorsFromSequenceFiles.java:199)
>>>>> >>                 at
>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> >>                 at
>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>> >>                 at
>>>>> >>
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(Sp
>>>>> a
>>>>> r
>>>>> s
>>>>> eVectorsFromSequenceFiles.java:55)
>>>>> >>                 at
>>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>> >> Method)
>>>>> >>                 at
>>>>> >>
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
>>>>> j
>>>>> ava:39)
>>>>> >>                 at
>>>>> >>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
>>>>> e
>>>>> s
>>>>> s
>>>>> orImpl.java:25)
>>>>> >>                 at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> >>                 at
>>>>> >>
>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Prog
>>>>> r
>>>>> a
>>>>> m
>>>>> Driver.java:68)
>>>>> >>                 at
>>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>> >>                 at
>>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195
>>>>> >> )
>>>>> >>
>>>>> >> im using only lucene 4.0!
>>>>> >>
>>>>> >>
>>>>> CLASSPATH=/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:.
>>>>> >>
>>>>> >> Please where im wrong?
>>>>> >>
>>>>> >>
>>>>> >> Thank you all
>>>>> >> Regards
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Think green - keep it on the screen.
>>>>> >>
>>>>> >> This e-mail and any attachment is for authorised use by the
>>>>> >> intended
>>>>> >> recipient(s) only. It may contain proprietary material,
>>>>> >> confidential information and/or be subject to legal privilege.
>>>>> >> It should not be copied, disclosed to, retained or used by,
any
>>>>> >> other party. If you are not an intended recipient then please
>>>>> >> promptly delete this e-mail and any attachment and all copies
and inform the sender. Thank you.
>>>>> >>
>>>>> >>
>>>>> >
>>>>> > Think green - keep it on the screen.
>>>>> >
>>>>> > This e-mail and any attachment is for authorised use by the
>>>>> > intended
>>>>> recipient(s) only. It may contain proprietary material,
>>>>> confidential information and/or be subject to legal privilege. It
>>>>> should not be copied, disclosed to, retained or used by, any other
>>>>> party. If you are not an intended recipient then please promptly
>>>>> delete this e-mail and any attachment and all copies and inform the sender.
Thank you.
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>>>
>>>>>
>>>>> Think green - keep it on the screen.
>>>>>
>>>>> This e-mail and any attachment is for authorised use by the
>>>>> intended
>>>>> recipient(s) only. It may contain proprietary material,
>>>>> confidential information and/or be subject to legal privilege. It
>>>>> should not be copied, disclosed to, retained or used by, any other
>>>>> party. If you are not an intended recipient then please promptly
>>>>> delete this e-mail and any attachment and all copies and inform the sender.
Thank you.
>>>>>
>>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended recipient(s)
only. It may contain proprietary material, confidential information and/or be subject to legal
privilege. It should not be copied, disclosed to, retained or used by, any other party. If
you are not an intended recipient then please promptly delete this e-mail and any attachment
and all copies and inform the sender. Thank you.
>>>>
>>>>
>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended recipient(s)
only. It may contain proprietary material, confidential information and/or be subject to legal
privilege. It should not be copied, disclosed to, retained or used by, any other party. If
you are not an intended recipient then please promptly delete this e-mail and any attachment
and all copies and inform the sender. Thank you.
>>>>
>>>>
>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended recipient(s)
only. It may contain proprietary material, confidential information and/or be subject to legal
privilege. It should not be copied, disclosed to, retained or used by, any other party. If
you are not an intended recipient then please promptly delete this e-mail and any attachment
and all copies and inform the sender. Thank you.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>>
>>> Think green - keep it on the screen.
>>>
>>> This e-mail and any attachment is for authorised use by the intended recipient(s)
only. It may contain proprietary material, confidential information and/or be subject to legal
privilege. It should not be copied, disclosed to, retained or used by, any other party. If
you are not an intended recipient then please promptly delete this e-mail and any attachment
and all copies and inform the sender. Thank you.
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>>
>> Think green - keep it on the screen.
>>
>> This e-mail and any attachment is for authorised use by the intended recipient(s)
only. It may contain proprietary material, confidential information and/or be subject to legal
privilege. It should not be copied, disclosed to, retained or used by, any other party. If
you are not an intended recipient then please promptly delete this e-mail and any attachment
and all copies and inform the sender. Thank you.
>>
>>
>> Think green - keep it on the screen.
>>
>> This e-mail and any attachment is for authorised use by the intended recipient(s)
only. It may contain proprietary material, confidential information and/or be subject to legal
privilege. It should not be copied, disclosed to, retained or used by, any other party. If
you are not an intended recipient then please promptly delete this e-mail and any attachment
and all copies and inform the sender. Thank you.
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended recipient(s) only.
It may contain proprietary material, confidential information and/or be subject to legal privilege.
It should not be copied, disclosed to, retained or used by, any other party. If you are not
an intended recipient then please promptly delete this e-mail and any attachment and all copies
and inform the sender. Thank you.
>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended recipient(s) only.
It may contain proprietary material, confidential information and/or be subject to legal privilege.
It should not be copied, disclosed to, retained or used by, any other party. If you are not
an intended recipient then please promptly delete this e-mail and any attachment and all copies
and inform the sender. Thank you.
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message