There is a proposal to extend indexing (item #11 in the API Changes
section):
http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
An excerpt:
11. (Hard) Make indexing more flexible, so that one could
e.g., not store positions or even frequencies, or alternately,
to store extra information with each position, or to even use
different posting compression algorithms.
I'm pretty sure that an implementation of this proposal would allow you
to store the positioning information with each position/token.
Doug Cutting posted recently about this (at the bottom of the message):
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200505.mbox/%3c4291FDB7.30500@apache.org%3e
Steve
Corey Keith wrote:
> I am involved in a project which is trying to provide searching and
> hit highlighting on the scanned image of historical newspapers. We
> have an XML based OCR format. A sample is below. We need to index the
> CONTENT attribute of the String element which is the easy part. We
> would like to be able find the "hits" within this XML document in
> order to use the positioning information to draw the highlight boxes
> on the image. It doesn't make a lot of sense to just extract the
> CONTENT and index that because we loose the positioning information.
> My second thought was to make a custom analyzer which dropped
> everything except for the content element and then used the
> highlighting class in the sandbox to reanalyze the XML document and
> mark the hits. With the marked hits in the XML we could find the
> position information and draw on the image. Has anyone else worked
> with OCR information and lucene. What was your approach? Does this
> approach seem sound? Any recommendations?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|