lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Itamar Syn-Hershko" <ita...@divrei-tora.com>
Subject RE: Contrib Highlighter and Phrase search
Date Wed, 19 Mar 2008 22:52:29 GMT
I'm building this for my application which uses (will at least) query
inflation - no stemming, just basic tokenizing.

I'm using Radix and letter based lookup (which is how Radix works) since I
want to execute the highlighting on large documents too, possibly with a lot
of terms (since I'm inflating the query). Does this make sense?

Itamar.

-----Original Message-----
From: Daniel Noll [mailto:daniel@nuix.com] 
Sent: Thursday, March 20, 2008 12:44 AM
To: java-user@lucene.apache.org
Subject: Re: Contrib Highlighter and Phrase search

On Wednesday 19 March 2008 18:28:15 Itamar Syn-Hershko wrote:
> 1. Build a Radix tree (PATRICIA) and populate it with all search terms.
> Phrase queries will be considered as one big string, regardless their 
> spaces.
>
> 2. Iterate through your text ignoring spaces and punctuation marks, 
> and for each word start a Radix lookup by letter, e.g. for the word 
> John you will initialize a "tavel" upon the tree with j, then o, h, 
> and so on, so if "John Kennedy" was your term it will not ignore the 
> space (I'd say you'd want to keep only one instance of it and ignore 
> the rest). On a dead end (no relevant branch on the tree) just skip to 
> the next space or punctuation mark and restart the process.
>
> 3. If an iteration has touched base - completed a lookup successfully, 
> the MyHighlighter object will boldify or surround the text with a 
> span. It is then possible to give the span different IDs based on the 
> term (each term a different ID) so each term will be highlighted with a
different color.
>
> This allows for fast and exact highlighting of large texts as well as 
> smaller ones. I would love to hear any comments on the above.

I guess this would work, on the assumption that no analysis was performed on
the text.  Otherwise stemming would break it instantly.

But why not perform the same logic but with terms instead of letters?  Or
would that be what the current highlighter is already doing?

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message