lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Larry Ogrodnek" <>
Subject RE: Lucene and SIPs
Date Thu, 22 Jun 2006 14:49:02 GMT
I didn't make too much progress, and kind of ended up dropping it.


One thing that I played with was creating multiple phrase indexes, one
each for 2, 3, 4, and 5 words.  I wrote a tokenizer that would batch up
the words, so, for the input string:


The quick brown fox jumps over the slow lazy dog.


The tokenizer for 3 words would return:


The quick brown

Quick brown fox

Brown fox jumps

Fox jumps over



This seemed like a reasonably start... the problem is resolving the
overlap for display, and figuring out which words are the most
important, e.g. if the above sentence itself was pretty rare, and you're
looking at the phrase-index-3, each one of its sub-phrases would end up
being significant.... Which one do you show?  Or do you combine them
into a longer phrase?  If so, where do you stop?


It seemed like an easy first-approach to try out, but I'm not sure it's
even in the right direction...






From: Nader Akhnoukh [] 
Sent: Wednesday, June 21, 2006 8:14 PM
To: Larry Ogrodnek
Subject: Lucene and SIPs


Hi Lawrence, I saw a posting to the Lucene group you made in February
concerning using Lucene to find SIPs.

Did you make any progress with this?  I'm able to find significant
single terms, but am stumped by phrases. 


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message