mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: Stopwords not working as expected
Date Sun, 03 Jan 2010 14:08:20 GMT
Yesterday I had issues with mapping cluster results to dictionary entries -
it happened that I was using different dictionary - therefore the result
clusters shown really strange results.
But once I fixed all the commands, input/output files, etc. I got very good
result from clusterization POV (I mean clusters are quite correct having in
mind the input documents) but unfortunately the clusters contained mostly
words which I would like to stop - and which words I placed in the
stopwords.txt in Solr (re-indexed, restarted Solr, etc.).

Where do you suggest I debug the vector creation? Seems Solr respects the
stopwords but not the vector creation (then clustering).

On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <gsingers@apache.org> wrote:

>
> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
>
> > I have stopwords.txt file with 1200+ words, i did not understand this
> with
> > the stemming - you mean my stopwords are somehow ignored due to some
> > stemming or ?
>
> No, stopword removal happens before stemming so it is possible that a word
> that was not stopped was then stemmed to a stopword.
>
> I thought you said yesterday you got it straightened out.
>
> >
> > On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <gsingers@apache.org>
> wrote:
> >
> >> Are you sure you have stopwords and it is not the result of stemming
> some
> >> other word?
> >>
> >> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
> >>
> >>> my Solr config is like the default one:
> >>>
> >>>  <field name="msg_body" type="text" termVectors="true" indexed="true"
> >>> stored="true"/>
> >>>
> >>>  <fieldType name="text" class="solr.TextField"
> >> positionIncrementGap="100">
> >>>     <analyzer type="index">
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <filter class="solr.StopFilterFactory"
> >>>               ignoreCase="true"
> >>>               words="stopwords.txt"
> >>>               enablePositionIncrements="true"
> >>>               />
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.SnowballPorterFilterFactory"
> >> language="English"
> >>> protected="protwords.txt"/>
> >>>     </analyzer>
> >>>     <analyzer type="query">
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >>> ignoreCase="true" expand="true"/>
> >>>       <filter class="solr.StopFilterFactory"
> >>>               ignoreCase="true"
> >>>               words="stopwords.txt"
> >>>               enablePositionIncrements="true"
> >>>               />
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.SnowballPorterFilterFactory"
> >> language="English"
> >>> protected="protwords.txt"/>
> >>>     </analyzer>
> >>>   </fieldType>
> >>
> >>
> >
> >
> > --
> > Best regards,
> > Bogdan
>
>


-- 
Best regards,
Bogdan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message