lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Document proximity
Date Wed, 30 Mar 2005 13:02:53 GMT
DM Smith wrote:
> Hi,
> 
> I hope I am posting to the right list.

Yes.

> 
> We (sword and jsword at crosswire.org) are indexing bibles with each 
> verse becoming a document, with the verse text being indexed and the 
> verse reference being stored. This way we can search the text and get 
> which verses have hits.
> 
> The problem is that verse is an artifical document boundary.

You could "smear" the document boundary by adding a number of tokens 
from adjacent verses, directly preceding or following a given verse. 
Perhaps even adding a full verse from each side.

If you wish, you could also artificially lower their score by adding 
gaps (token.setPositionIncrement()), but then exact matches would not 
work across boundaries, in such case you would have to add a phrase 
query with a slop to your main query.

> 
> Frequently, verses cut a paragraph into parts, a poem into stanzas, ... 
> and the significant parts are across verses. (But we usually don't have 
> these in our markup)
> 
> Is there any thought of adding a NEAR operator that will work across 
> documents?
 >
 > Specifically, find x NEAR y, where the distance given to near is not
 > understood as words but documents.
 >

I assume that you also add fields for books and chapters. While the 
chapter boundary is sometimes disputed, the book boundaries are pretty 
accurate ;-). You could create an equivalent of the "near" operator by 
limiting your search within a single book (by adding a required clause), 
and then from the list of hits (which should be pretty small in that 
case) you could programmatically select verses that match your proximity 
criteria.

> It would also be good to have the ability to have search automatically 
> consider that adjacent documents are flowing unless some token in the 
> doucment interrupts the flow. In this case, search would return a 
> compound document as a hit.

Lucene doesn't have a notion of compound documents, it's up to the 
application to do that. However, it's easy to retrieve documents that 
precede or follow a given document. It's also easy to retieve documents 
that contain a given term (similar to a primary key), let's say "John 
1:12". You could also add a field to flag a given document as the "end 
of chapter", or "end of book".

I would be more than happy to help you find a good solution - I'm a 
born-again Christian, and I use the Sword application from time to time...

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message