lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: LowerCaseFilter fails one letter (I) of Turkish alphabet
Date Mon, 30 Nov 2009 20:14:46 GMT
On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera <serera@gmail.com> wrote:

> Robert, what if I need to do additional filtering after CollationKeyFilter,
> like stopwords removal, abbreviations handling, stemming etc? Will that be
> possible if I use CollationKeyFilter?
>
>
Shai, great point. This won't work with Collation, because it makes binary
keys. This is why I think unicode case folding is still useful. I have a
patch to do this located in LUCENE-1488 (and it looks now like i should add
back support for alternate mapping)

Because currently, it looks to me like theres no real way to do turkish
stemming in lucene. The snowball turkish stemmer expects "properly"
lowercased terms, and won't modify uppercase terms.

  public void testNounGenitive() throws IOException {
    // AĞACIN in turkish lowercases to ağacın, but with lowercase filter
ağacin.
    // this happens to work, even though its lowercased wrong before
stemming.
    // this is because the stemmer also recognizes and removes -in
    assertAnalyzesTo(analyzer, "ağacın", new String[] { "ağaç" });
    assertAnalyzesTo(analyzer, "AĞACIN", new String[] { "ağaç" });
    assertAnalyzesTo(analyzer, "ağacin", new String[] { "ağaç" });
  }

  public void testNounAccusative() throws IOException {
    // AĞACI in turkish lowercases to ağacı, but with lowercase filter
ağaci.
    // this fails due to wrong casing, because the stemmer
    // will only remove -ı, not -i
    assertAnalyzesTo(analyzer, "ağacı", new String[] { "ağaç" });
    assertAnalyzesTo(analyzer, "AĞACI", new String[] { "ağaci" });
  }


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message