lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Investigating Lucene's Applicability to [Unusual?] Use Case
Date Wed, 13 Jun 2007 20:12:38 GMT
sounds easy (I said sounds :),
e.g. 
your Statement becomes Document in Lucene lingo, you make it with 3-4 Lucene fields, 
CONTENT (Tokenized, not stored)
OFFSET(not indexed, stored) - offset in file of the first byte of your statement
DOC_LENGTH(not indexed, stored) - if you have no END-OF-Statement to identify end of it
PAGE_COUNT(not indexed, stored) - you have to know it before add(Document), but that is your
domain. Lucene has no idea  what your "pages" are

you can search on CONTENT and than fetch OFFSET/LENGTH for hits, so you can easily identify
your statement

as an optimization, you could pack offset/length/page count to one binary field... but that
would be to early :)

have fun and be surprised how fast this can be 
e.

----- Original Message ----
From: Brad Harper <brad.harper@epsiia.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 13 June, 2007 9:39:02 PM
Subject: Investigating Lucene's Applicability to [Unusual?] Use Case


Hello: 

[This is not an intentional cross-posting. I posted this same question to
the 'general' lucene list; replies there suggested that I'd have
better/quicker responses using this list instead.]

I'm investigating Lucene as a replacement for a special-purpose search
technology that was developed long before Lucene (or any of the current IR
libraries) became available. 

The use case involves so-called print streams. Imagine 20,000 statements
concatenated into one large file suitable for delivery to a print system.
The document formats vary, but include AFP (an IBM printer format), PCL (an
HP format), Postscript, PDF, and even "plain-text". 

The indexing application must track the total page count of the embedded
statements. On a hit, the search application must extract and return the
[possibly multi-page] statement embedded within the larger print-stream
file. 

How would the search application know (be informed by the Lucene/indexer)
the extent of the internal document(s)? 

I'm not seeing this scenario discussed in forums or books. Does anyone have
comments or thoughts on Lucene's applicability as a solution? 

Thanks. 

Brad
-- 
View this message in context: http://www.nabble.com/Investigating-Lucene%27s-Applicability-to--Unusual---Use-Case-tf3917320.html#a11107323
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message