lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Document proximity
Date Wed, 30 Mar 2005 18:25:39 GMT
We already have a solution, and it is external to Lucene. We look for
hits on things that are to be adjacent, get their "canonical"
reference and then compare the distances between these. While this
works well, I was hoping for a solution within Lucene.

This does not give us the ability to look for phrases across verse boundaries.

As to storing book or chapter in the index, we don't do that, just the
whole reference.
This is worth looking into as it would help in doing range restricted
searches. Today, we do the restriction after the search.


On Wed, 30 Mar 2005 15:02:53 +0200, Andrzej Bialecki <ab@getopt.org> wrote:
> DM Smith wrote:
> > Hi,
> >
> > I hope I am posting to the right list.
> 
> Yes.
> 
> >
> > We (sword and jsword at crosswire.org) are indexing bibles with each
> > verse becoming a document, with the verse text being indexed and the
> > verse reference being stored. This way we can search the text and get
> > which verses have hits.
> >
> > The problem is that verse is an artifical document boundary.
> 
> You could "smear" the document boundary by adding a number of tokens
> from adjacent verses, directly preceding or following a given verse.
> Perhaps even adding a full verse from each side.
> 
> If you wish, you could also artificially lower their score by adding
> gaps (token.setPositionIncrement()), but then exact matches would not
> work across boundaries, in such case you would have to add a phrase
> query with a slop to your main query.
> 
> >
> > Frequently, verses cut a paragraph into parts, a poem into stanzas, ...
> > and the significant parts are across verses. (But we usually don't have
> > these in our markup)
> >
> > Is there any thought of adding a NEAR operator that will work across
> > documents?
>  >
>  > Specifically, find x NEAR y, where the distance given to near is not
>  > understood as words but documents.
>  >
> 
> I assume that you also add fields for books and chapters. While the
> chapter boundary is sometimes disputed, the book boundaries are pretty
> accurate ;-). You could create an equivalent of the "near" operator by
> limiting your search within a single book (by adding a required clause),
> and then from the list of hits (which should be pretty small in that
> case) you could programmatically select verses that match your proximity
> criteria.
> 
> > It would also be good to have the ability to have search automatically
> > consider that adjacent documents are flowing unless some token in the
> > doucment interrupts the flow. In this case, search would return a
> > compound document as a hit.
> 
> Lucene doesn't have a notion of compound documents, it's up to the
> application to do that. However, it's easy to retrieve documents that
> precede or follow a given document. It's also easy to retieve documents
> that contain a given term (similar to a primary key), let's say "John
> 1:12". You could also add a field to flag a given document as the "end
> of chapter", or "end of book".
> 
> I would be more than happy to help you find a good solution - I'm a
> born-again Christian, and I use the Sword application from time to time...
> 
> --
> Best regards,
> Andrzej Bialecki
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message