lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ruslan Sivak <>
Subject Re: Advice regarding fuzzy phrase searching
Date Tue, 11 Dec 2007 22:43:59 GMT
Look into SpanNearQuery.  It has a slop which lets you say how close you 
want the terms to be.  For a single document, if you are going to be 
doing a lot of these searches, I recommend using a MemoryIndex.


Jose Luna wrote:
> Hello,
> I am looking for some advice regarding which tools I might use to 
> solve my problem.  I apologize ahead of time for the long explanation.
> Problem Description:  I would like to index a set of very large HTML 
> documents.  I would then be able to run two different kinds of 
> queries: proximity queries, and fuzzy phrase queries.   I would like 
> to get the exact positions of the matching results from the query (I 
> need to modify the original documents at these positions.)  I will 
> only need to search one document at a time, i.e., I already know which 
> document I'll be looking in, so what's important is finding the 
> positions of the hits within that document.
> For example,  for a fuzzy search, I may want to search for "arterial 
> oxygen saturation".   I would want this to match "arterial oxygen 
> saturate", and I would want to get the position of where it matches.  
> I would also like to do proximity searches, with these broken into 
> separate terms.  So, I may be searching for "arterial", "oxygen", and 
> "saturate" all within 10 terms of each other, and get the positions of 
> the cases that match.
> To the best of my understanding, Lucene is not a good choice to solve 
> this problem (please correct me if I'm wrong).   As far as I can tell, 
> Lucene breaks up a document into a set of terms, and indexes these in 
> some sort of structure.  My guess is a B+ tree, but I'm curious to 
> learn more about it -- I couldn't find much in the documentation about 
> the underlying index structure.   Anyway, this means that the 
> keys->pointer pairs in the index are basically term->documenID pairs.  
> So this isn't very suitable for my problem. I already know which 
> document I want to search, I'm interested in the position of hits.    
> If I were to search for the phrase "arterial oxygen saturation", this 
> would be broken into terms and I could iterate through all of the 
> TermPositions for a given term in the document, and try to find out 
> where these terms are adjacent in the document.  Considering that my 
> document is very large, the phrases can be 10+ terms, and I need to do 
> this hundreds of times, this doesn't sound like a very good solution.  
> If we introduce the idea of fuzzy matches and proximity searches, it 
> seems like this task of iterating through TermPositions becomes very 
> complicated.
> I've spent time reading the docs, creating a test program, and reading 
> the mailing list.  As far as I can tell, Lucene is geared towards 
> document based queries, and isn't the ideal tool for my problem.  I 
> think an index based on a suffix tree (or variation of) would better 
> meet my needs, but I'm not sure how well these perform with fuzzy and 
> proximity searches.  I've looked around, and I can't seem to find a 
> good opensource indexing framework like lucene that's based on a 
> suffix tree.  Are there any suggestions for tools that would help with 
> this problem?  Does anyone have any suggestions on how I might bend 
> Lucene to meet my needs?
> Thanks in advance,
> JLuna

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message