mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: Stopwords work for Solr but not for Mahout
Date Sat, 02 Jan 2010 14:03:57 GMT
this is my Solr config:

   <field name="msg_body" type="text" termVectors="true" indexed="true"
stored="true"/>

and the type text is as configured by default:

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>

and I have entered quite some stopwords in the stopwords.txt file

my SolrToMahout.sh file:

#!/bin/bash
set -x
cd /store/dev/inst/mahout-0.2
java -classpath
/store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
/store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ /:/g')
org.apache.mahout.utils.vectors.lucene.Driver --dir
/store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
   --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
--field msg_body --dictOut
/store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict

Best regards,
Bogdan

On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <gsingers@apache.org> wrote:

> What do the relevant pieces of your Solr setup look like and how are you
> invoking the Lucene driver?
>
> -Grant

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message