lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <ykin...@xs4all.nl>
Subject Re: indexing documents that arrive in pieces
Date Sun, 13 Oct 2002 12:44:29 GMT
On Sunday 13 October 2002 04:18, you wrote:
> What is the cleanest way in Lucene to add documents to
> an index, if the entire document is not readily
> available at one time?
>
> E.g., I want to index the text as well as the
> anchor-text of a stream of html pages, where the
> anchor-text terms get associated with the page _being
> pointed to_.  For a document d_i, I don't know all the
> terms that should be added to its "anchor" field,
> until I've seen all documents d_j that link to d_i.

Mr. Codd would normalize the anchor texts as an attribute
of the link from the pointing page to the pointed page.
So a clean way is to store these links in a separate (lucene) db,
putting the anchor text in a stored field so it can be retrieved when
needed.
Since lucene is a free format database, you can store these links
in any lucene db. It depends on the circumstances (ie. what operation
is most time critical) what the best place is.
This also depends on how you want to search your anchor fields: should
they be in the same lucene document as the pointed to page, or could
you just allow searching for anchors in a separate 'links' db?

Have fun,
Ype

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message