lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Svensson <si...@devhost.se>
Subject Re: Spanish analyzer in ravendb
Date Thu, 14 Jun 2012 17:44:37 GMT
It's easy to write analyzers, you basically chain together a few 
TokenFilters and call it a day. And to back up that statement I provide 
an example spanish analyzer written by someone who basically threw his 
complete Spanish vocabulary into the stop word list. DictionaryLoader is 
a class which loads your hunspell dictionaries (.aff and .dic files) 
from your storage (filesystem, embedded resources, etc). There are some 
further development that can be done, like overriding/implementing 
ReusableTokenStream and verify that the filters are in the correct order.

using System;
using System.Collections;
using System.IO;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Hunspell;
using Lucene.Net.Analysis.Standard;
using Version = Lucene.Net.Util.Version;

public class SpanishHunspellAnalyzer : Analyzer {
     private static readonly HunspellDictionary Dictionary = 
DictionaryLoader.Load(@"es_ES");
     private static readonly Hashtable Stopwords = new Hashtable {
         { "Me", null }, { "no", null }, { "habla", null }, { "espaƱol", 
null }
     };

     public override TokenStream TokenStream(String fieldName, 
TextReader reader) {
         var stream = new StandardTokenizer(Version.LUCENE_29, reader);

         TokenFilter filter = new LowerCaseFilter(stream);
         filter = new HunspellStemFilter(filter, Dictionary);
         filter = new StopFilter(true, filter, Stopwords, true);
         return filter;
     }
}

// Simon

On 2012-06-14 18:44, vicente garcia wrote:
> Thank you Simon, you can specify a
> "Raven.Database.Indexing.Collation.Cultures.EsCollationAnalyzer,
> Raven.Database" but you can't perform full text search queries because
> this index don't tokenize the content.
> http://ravendb.net/docs/client-api/querying/static-indexes/customizing-results-order
>
> I saw that there is not a SpanishAnalyzer, we only have a
> SpanishStemmer, but I don't need an stammer, I need a spanish analyzer
> with its stops words, etc.
>
> Has someones another idea on how to index Spanish content?
>
> Thank you very much :)
>
> On Thu, Jun 14, 2012 at 4:59 PM, Simon Svensson<sisve@devhost.se>  wrote:
>> Welcome,
>>
>> See Configuring index options[1] to specify a custom analyzer that can
>> handle spanish content.
>>
>> A quick check shows that Contrib.Analyzers does not contain a spanish
>> analyzer. There is a SpanishStemmer available in the Snowball contrib. You
>> could also use a spanish hunspell dictionary for stemming[2].
>>
>> // Simon
>>
>> [1]
>> http://ravendb.net/docs/client-api/querying/static-indexes/configuring-index-options
>> [2] https://github.com/sisve/Lucene.Net.Analysis.Hunspell
>>
>>
>> On 2012-06-14 16:49, vicente garcia wrote:
>>> Hi to all, this is my first mail to this list :)
>>>
>>> I'd like to index spanish content in raven db, I have been searching a
>>> lot, but I don't know how I can do it.
>>>
>>> Could someones help me please?
>>>
>>> Thanks :)
>>>
>
>

Mime
View raw message