lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: Investigating Lucene for Applicability to [Unusual?] Use Case
Date Wed, 13 Jun 2007 19:19:04 GMT
Hi Brad,

Brad Harper wrote:
> The use case involves so-called print streams. Imagine 20,000 statements
> concatenated into one large file suitable for delivery to a print system.
> The document formats vary, but include AFP (an IBM printer format), PCL (an
> HP format), Postscript, PDF, and even "plain-text".
> 
> The indexing application must track the total page count of the embedded
> statements. On a hit, the search application must extract and return the
> [possibly multi-page] statement embedded within the larger print-stream
> file.
> 
> How would the search application know (be informed by the Lucene/indexer)
> the extent of the internal document(s)?

You'll get faster/better responses to questions like this if you direct
them to the java-user list.

One solution is to use a Lucene stored field (call it "source")
containing the name of the print stream file (stored, I assume,
externally to the indexer), along with the document's extent within that
file, maybe in a format like "filename:beg:end".  Of course, you could
also use three separate fields, one for each piece of information.

Then when the search app gets a hit, the "source" field can be retrieved
and consulted for the information you want.

Steve

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Mime
View raw message