lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hugo Lafayette <>
Subject Re: Bad behaviors of FrenchAnalyzer
Date Tue, 11 Oct 2005 14:52:31 GMT
Erik Hatcher wrote:

> Rather than changing StandardAnalyzer, you could create a custom  
> Analyzer that is something along the lines of StandardTokenizer  ->  
> custom apostrophe splitting filter -> ISOLatinFilter. 

Why do not include that in the FrenchStemFilter "next()" method itself ?
It will be a bad design ?

And I'm quite concerned with performance issue, but it seem's to me that
your solution will only affect "APOSTROPHE" typed token, so the overhead
will be unexistant, right ?

> You get a special type for words with interior apostrophes from 
> StandardTokenizer (look at StandardFilter to see how that works). You
> could create a simple TokenFilter that splits apostrophe'd tokens 
> into two.

I'm not sure to figure out to do that efficiently. Is it something like
that ? :


private Stack subTokens; //previously initialized

public final Token next() throws IOException {
  Token t = null;
  if (subTokens != null && !subTokens.empty) {
    t = subTokens.pop();
  } else {
    t =;
    if (t != null)
      String type = t.type();
      if (type == APOSTROPHE_TYPE) {
	tokenizeApostrophe(t, subTokens);
  return t;


with "tokenizeApostrophe(Token, Stack)" that split on conditions the
token into 2 others, and push them on the stack.

> Maybe it's simple enough also to expand "j" and "l" into "je" and
> "le" in the same step too?

It will be simple, but I'm not sure yet I want to expand them back.
Maybe it will be useful to index the "j" token after all.

Anyway thanks for your quick answer,


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message