lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: derive tokens from single token
Date Mon, 29 Sep 2003 15:46:58 GMT
On Monday, September 29, 2003, at 11:01  AM, Hackl, Rene wrote:
>> except that you'll be indexing a ton of
>> terms I'd guess.  If there is some other way to split these words by
>> separating by prefix ("hexa", "hepta") and suffix ("alene", "alin") it
>> would likely be better.  But maybe its not practical to do so.
> There'll be at least two indexes, one "normal" one and another bloated 
> one.
> Dan suggested splitting, too, but, unfortunately, if users search for 
> e.g.
> "9-Oxabicyclo[3.3.1]nona-2,6-diene"
> they don't want anything else than that substance, as opposed to
> "*-Oxabicyclo[3.3.1]nona*"
> where they'd be interested in substances from that family - whatever 
> the
> numbers are.

But consider the same type of thing like a phrase query.  If two 
documents are indexed with a field containing "a b c" and "x b y"

If searching for "b" is done, both documents are returned.  If 
searching for "a b" is done, then only the first document is returned.  
So I think with a domain aware analyzer, you might be able to split 
things up into separate terms and then on the querying side of things 
the same type of analysis would be done  Certainly not a trivial thing, 
and maybe not even the right approach, but it seems that intelligent 
analysis can make things a lot easier on users and performance for 
searching.  Maybe?  Just food for thought.

> If you're interested, once I've some hard performance results at hand, 
> I
> could post them around.

Definitely interested!


View raw message