lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Problem using stop words
Date Mon, 22 Aug 2011 16:37:46 GMT
Ahh, you're right. I was waaaay off base there....

So I guess the question is how you know the words aren't being removed? A common
problem is to look at *stored* fields rather than what's actually in
the inverted index.
The TermsComponent can help here:
http://wiki.apache.org/solr/TermsComponent

Erick

On Mon, Aug 22, 2011 at 11:28 AM, Alexei Martchenko
<alexei@superdownloads.com.br> wrote:
> That very txt said "A Spanish stop word list. Comments begin with vertical
> bar. Each stop word is at the start of a line."
>
> Solr's comments are #s not pipes.
>
> Brazilian stopwords file is kinda raw...
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt
>
> 2011/8/22 Alexei Martchenko <alexei@superdownloads.com.br>
>
>> Funny thing is that stopwords files in the examples shown in
>> http://wiki.apache.org/solr/LanguageAnalysis#Spanish are actually using
>> pipe and other terms. See the spanish one in
>> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt
>>
>> I never saw this format before.
>>
>> Lucas, try to use only one word per line, no pipes, no trailing spaces. and
>> you can use all spanish accents too. Don't forget to save encoded as
>> UTF-8... u can do that in Eclipse or even Windows Word can open and save
>> txts in UTF-8.
>>
>>
>>
>> 2011/8/22 Erick Erickson <erickerickson@gmail.com>
>>
>>> What does the admin/analysis page show? And if you're really
>>> putting the pipe symbol (|)  in you stopwords file, I have no clue what
>>> Solr will make of it. The stopwords file format is usually just one
>>> word per line.....
>>>
>>> I'm assuming your name of "string" for the field type is just a
>>> placeholder
>>> or you've replaced the example "string" fieldType, right?
>>>
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Aug 22, 2011 at 6:24 AM, Lucas Miguez <lucas.miguez@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I am trying to use spanish stop words, but the stop words are not
>>> working:
>>> >
>>> > Part of the schema.xml file:
>>> >
>>> > <fieldtype name="string"  class="solr.TextField"
>>> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>> >   <analyzer type="index">
>>> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> >                <filter class="solr.LowerCaseFilterFactory" />
>>> >                <filter class="solr.SnowballPorterFilterFactory"
>>> language="Spanish" />
>>> >                <filter class="solr.StopFilterFactory"
>>> words="spanish_stop.txt"
>>> > enablePositionIncrements="true" ignoreCase="true" />
>>> >   </analyzer>
>>> >   <analyzer type="query">
>>> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>> >                <filter class="solr.LowerCaseFilterFactory" />
>>> >                <filter class="solr.SnowballPorterFilterFactory"
>>> language="Spanish" />
>>> >                <filter class="solr.StopFilterFactory"
>>> words="spanish_stop.txt"
>>> > enablePositionIncrements="true"  ignoreCase="true" />
>>> >        </analyzer>
>>> >   </fieldtype>
>>> >
>>> ___________________________________________________________________________
>>> >
>>> > A piece of the stopwords file:
>>> >
>>> > de             |  from, of
>>> > la             |  the, her
>>> > que            |  who, that
>>> > el             |  the
>>> > en             |  in
>>> > y              |  and
>>> > a              |  to
>>> > los            |  the, them
>>> > del            |  de + el
>>> > se             |  himself, from him etc
>>> > las            |  the, them
>>> > por            |  for, by, etc
>>> > un             |  a
>>> > para           |  for
>>> > con            |  with
>>> > no             |  no
>>> > una            |  a
>>> > su             |  his, her
>>> > al             |  a + el
>>> >  | es         from SER
>>> > lo             |  him
>>> >
>>> >
>>> > Any idea? Thanks!
>>> >
>>>
>>
>>
>>
>> --
>>
>> *Alexei Martchenko* | *CEO* | Superdownloads
>> alexei@superdownloads.com.br | alexei@martchenko.com.br | (11)
>> 5083.1018/5080.3535/5080.3533
>>
>>
>
>
> --
>
> *Alexei Martchenko* | *CEO* | Superdownloads
> alexei@superdownloads.com.br | alexei@martchenko.com.br | (11)
> 5083.1018/5080.3535/5080.3533
>

Mime
View raw message