lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corey Keith" <>
Subject Re: Indexing and Hit Highlighting OCR Data
Date Fri, 03 Jun 2005 12:50:26 GMT
With this approach all work is done at the word level.  When we have a phrase query the results
will contain pages with the entire phrase but when we go to highlight the document _all_ words
in the phrase regardless of being in the phrase will be highlighted.  Is that correct?  It
would also be difficult to get the best fragment in a similar way to the current highlighter?

>>> 06/02/05 9:02 PM >>>

This is a pretty interesting problem.  I envy you.

I would avoid the existing highlighter for your purposes -- highlighting
in token space is a very differnet problem from "highlihgting" in 2D

based on the XML sample you provided, it looks like your XML files
are allready a "tokenized" form of the orriginal OCR data -- by which i
mean the page has allready been tokenized into words who position is

I would parse these XML docs to generate two things:
    1) a stream of words for analysis/filtering (ie: stop words, stemming,
    2) a datastructure mapping words to lists of positions (ie: if the
       same word apears in multiple places, list the word once, followed
       by each set of coordinates)

use #1 in the usual way, and add a serialized form of #2 to your index as
a Stored Keyword -- at query time, the words from your initial query can
be looked up in that data strucutre to find the regions to "highlight"

: I am involved in a project which is trying to provide searching and hit highlighting on
the scanned image of historical newspapers.  We have an XML based OCR format.  A sample is
below.  We need to index the CONTENT attribute of the String element which is the easy part.
 We would like to be able find the "hits" within this XML document in order to use the positioning
information to draw the highlight boxes on the image.  It doesn't make a lot of sense to just
extract the CONTENT and index that because we loose the positioning information.  My second
thought was to make a custom analyzer which dropped everything except for the content element
and then used the highlighting class in the sandbox to reanalyze the XML document and mark
the hits.  With the marked hits in the XML we could find the position information and draw
on the image.  Has anyone else worked with OCR information and lucene.  What was your approach?
 Does this approach seem sound?  Any recommendations?
: Thanks, Corey
:      <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0" VPOS="123644.0">
:       <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0" HPOS="1316.0" VPOS="123644.0"
CONTENT="The" WC="1.0"/>
:       <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0" HPOS="1664.0" VPOS="123711.0"
CONTENT="female" WC="1.0"/>
:       <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
:       <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0" HPOS="2192.0" VPOS="123711.0"
CONTENT="lays" WC="1.0"/>
:       <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0" HPOS="2528.0" VPOS="123711.0"
CONTENT="about" WC="1.0"/>
:       <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0" HPOS="3000.0" VPOS="123770.0"
CONTENT="140" WC="1.0"/>
:       <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
:       <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0" HPOS="3316.0" VPOS="124223.0"
CONTENT="eggs" WC="1.0"/>
:       <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
:      </TextLine>


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message