mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: Stopwords work for Solr but not for Mahout
Date Sat, 02 Jan 2010 16:34:53 GMT
I re-indexed but I cannot find a way to use the VectorDumper w/ Dictionary,
I am using mahout v 0.2 and not the very latest trunk code since the latter
was not compiling and I had to use older code.

On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <gsingers@apache.org> wrote:

> I assume you re-indexed and you used the VectorDumper (along with the
> dictionary) to dump out the Vectors that were converted and verified no stop
> words?
>
> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
>
> > this is my Solr config:
> >
> >   <field name="msg_body" type="text" termVectors="true" indexed="true"
> > stored="true"/>
> >
> > and the type text is as configured by default:
> >
> >    <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <!-- in this example, we will only use synonyms at query time
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >        -->
> >        <!-- Case insensitive stop word removal.
> >          add enablePositionIncrements=true in both the index and query
> >          analyzers to leave a 'gap' for more accurate phrase queries.
> >        -->
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >    </fieldType>
> >
> > and I have entered quite some stopwords in the stopwords.txt file
> >
> > my SolrToMahout.sh file:
> >
> > #!/bin/bash
> > set -x
> > cd /store/dev/inst/mahout-0.2
> > java -classpath
> > /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
> > /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
> /:/g')
> > org.apache.mahout.utils.vectors.lucene.Driver --dir
> > /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
> >   --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
> > --field msg_body --dictOut
> > /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
> >
> > Best regards,
> > Bogdan
> >
> > On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <gsingers@apache.org>
> wrote:
> >
> >> What do the relevant pieces of your Solr setup look like and how are you
> >> invoking the Lucene driver?
> >>
> >> -Grant
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Bogdan Vatkov
email: bogdan.vatkov@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message