lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerhard Schwarz <gerhard.schw...@fpg.de>
Subject Re: czech support
Date Thu, 05 Sep 2002 15:39:59 GMT
Leos Literak wrote:
> But I need some enhancements, that are related to
> my language.
[...]
> but we have 7 shapes for each of singular
> and plural:
> 
> pes,psa,psovi,psu,pse,psovi,psem
> psi,psu,psum,psy,psi,psech,psy
> 
> I'd like to be able to search for all of this
> variants in first shape: pes
> 
> eg. whenever I encounter "psa" index it as "pes"

How are those suffixes related to each other? It makes most sense to
reduce every plural to it's singular form first. After that you can
strip unnecessary suffixes.

> 2\ another issue is with diacritics.
> 
> for example lávka (la'vka)
> 
> some people use it, some write words without it.
> 
> so I'd like to be able to look up both
> lavka and lávka. easy way is to index words without
> diacritics, because it is common denominator.

If you eleminate the diacritics for indexing and searching
you have the advantage that everyone can search even if he
is not familiar with czech diacritics. Or maybe is not able
to use diacritics (wrong codepage, font, whatever).
 
> so my question is: what kind of interfaces/classes
> shall I implement/overwrite? i have no idea of
> relations between classes in Lucene and their
> purpose. so it would take me lot of time to find that.

You should first look at the classes Analyzer and Filter. You need
an Analyzer that pipes the content of a document trough a suitable
Tokenizer (StandardTokenizer should fit for czech) and then trough
the needed filters. One of those filters is yours that makes the
needed grammatically changes.


Greets, Gerhard

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message