lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <>
Subject Re: derive tokens from single token
Date Mon, 29 Sep 2003 16:48:24 GMT
what about bi/tri-grams + some sort of hit filtering? It will do the 
job. I just saw some ineffective implementation of 1-grams for CJK on 
lucene-dev@. It could be a good starting point for full n-gram 
support... Just a thought.


Erik Hatcher wrote:

> On Monday, September 29, 2003, at 11:01  AM, Hackl, Rene wrote:
>>> except that you'll be indexing a ton of
>>> terms I'd guess.  If there is some other way to split these words by
>>> separating by prefix ("hexa", "hepta") and suffix ("alene", "alin") it
>>> would likely be better.  But maybe its not practical to do so.
>> There'll be at least two indexes, one "normal" one and another 
>> bloated one.
>> Dan suggested splitting, too, but, unfortunately, if users search for 
>> e.g.
>> "9-Oxabicyclo[3.3.1]nona-2,6-diene"
>> they don't want anything else than that substance, as opposed to
>> "*-Oxabicyclo[3.3.1]nona*"
>> where they'd be interested in substances from that family - whatever the
>> numbers are.
> But consider the same type of thing like a phrase query.  If two 
> documents are indexed with a field containing "a b c" and "x b y"
> If searching for "b" is done, both documents are returned.  If 
> searching for "a b" is done, then only the first document is 
> returned.  So I think with a domain aware analyzer, you might be able 
> to split things up into separate terms and then on the querying side 
> of things the same type of analysis would be done  Certainly not a 
> trivial thing, and maybe not even the right approach, but it seems 
> that intelligent analysis can make things a lot easier on users and 
> performance for searching.  Maybe?  Just food for thought.
>> If you're interested, once I've some hard performance results at hand, I
>> could post them around.
> Definitely interested!
>     Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message