lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hugo Lafayette <hugo.lafaye...@temis-group.com>
Subject Re: Bad behaviors of FrenchAnalyzer
Date Tue, 11 Oct 2005 14:52:31 GMT
Erik Hatcher wrote:

> Rather than changing StandardAnalyzer, you could create a custom  
> Analyzer that is something along the lines of StandardTokenizer  ->  
> custom apostrophe splitting filter -> ISOLatinFilter. 

Why do not include that in the FrenchStemFilter "next()" method itself ?
It will be a bad design ?

And I'm quite concerned with performance issue, but it seem's to me that
your solution will only affect "APOSTROPHE" typed token, so the overhead
will be unexistant, right ?

> You get a special type for words with interior apostrophes from 
> StandardTokenizer (look at StandardFilter to see how that works). You
> could create a simple TokenFilter that splits apostrophe'd tokens 
> into two.

I'm not sure to figure out to do that efficiently. Is it something like
that ? :

<code>

private Stack subTokens; //previously initialized

public final Token next() throws IOException {
  Token t = null;
  if (subTokens != null && !subTokens.empty) {
    t = subTokens.pop();
  } else {
    t = input.next();
    if (t != null)
    {
      String type = t.type();
      if (type == APOSTROPHE_TYPE) {
	tokenizeApostrophe(t, subTokens);
      }
    }
  }
  return t;
}

</code>

with "tokenizeApostrophe(Token, Stack)" that split on conditions the
token into 2 others, and push them on the stack.

> Maybe it's simple enough also to expand "j" and "l" into "je" and
> "le" in the same step too?

It will be simple, but I'm not sure yet I want to expand them back.
Maybe it will be useful to index the "j" token after all.

Anyway thanks for your quick answer,

-- 
Hugo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message