lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mile Rosu <>
Subject Re: Searching for a phrase which spans on 2 pages
Date Wed, 12 Jul 2006 09:09:32 GMT
Hello Erick,

I have been trying on Google Books some scenarios and apparently found a 
Google bug ...
It looks like they use number 2 approach, as this query illustrates it.

The phrase returns 2 hits, but if you look at the documents, only in the 
first one the phrase is visible.

Anyway, it makes possible finding something like:
The returned page is the first one on which the phrase spans (but no 
more highlighting).

It seems we are really close to a good solution, now looking for a way 
to implementing it in terms of index structure.

Thanks again,
Mile Rosu

Erick Erickson wrote:
> I can think of several approaches, but the experts will no doubt show 
> me up
> <G>..
> 1> index the entire book as a single document. Also, index the 
> beginning and
> ending offset of each page in separate "documents". Assuming you can find
> the offset in the big doc of each matching phrase, you can also find out
> what pages each match starts on and ends on, and if they are different 
> you'd
> know to display two pages. Not sure what this does to relevancy.......
> 2> Index, say, the 10 words on the previous page and 10 words on the next
> page with the current page. You'd have to make sure your match wasn't
> entirely within the 10 words you prepended or appended to the "match" 
> page
> (again by match position) when you returned data.
> 3> Have a series of "joiner" "documents". One for the 9 words of page 
> n, and
> 9 words of  page n + 1 (along with the page number). Another set for 8
> before and 8 after. etc. down to 1. If your phrase was 10 words, you'd
> search your normal pages, and the 9 word "joiner" pages. Any match in the
> joiners would be a page spanner. Again, what does that do to relevancy?
> Note that there is no requirement that every document have the same 
> fields,
> so your searches can be disjoint. Also, I'm assuming that you can 
> reasonably
> decide that, say, 10 word phrases are the max you'll respect, which 
> may not
> be true.
> I have no idea whether these are reasonable approaches given your problem
> domain....
> Best
> Erick

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message