lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corey Keith" <>
Subject Indexing and Hit Highlighting OCR Data
Date Thu, 02 Jun 2005 18:39:22 GMT
I am involved in a project which is trying to provide searching and hit highlighting on the
scanned image of historical newspapers.  We have an XML based OCR format.  A sample is below.
 We need to index the CONTENT attribute of the String element which is the easy part.  We
would like to be able find the "hits" within this XML document in order to use the positioning
information to draw the highlight boxes on the image.  It doesn't make a lot of sense to just
extract the CONTENT and index that because we loose the positioning information.  My second
thought was to make a custom analyzer which dropped everything except for the content element
and then used the highlighting class in the sandbox to reanalyze the XML document and mark
the hits.  With the marked hits in the XML we could find the position information and draw
on the image.  Has anyone else worked with OCR information and lucene.  What was your approach?
 Does this approach seem sound?  Any recommendations?
Thanks, Corey
     <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0" VPOS="123644.0">
      <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0" HPOS="1316.0" VPOS="123644.0"
CONTENT="The" WC="1.0"/>
      <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
      <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0" HPOS="1664.0" VPOS="123711.0"
CONTENT="female" WC="1.0"/>
      <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
      <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0" HPOS="2192.0" VPOS="123711.0"
CONTENT="lays" WC="1.0"/>
      <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0" HPOS="2528.0" VPOS="123711.0"
CONTENT="about" WC="1.0"/>
      <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0" HPOS="3000.0" VPOS="123770.0"
CONTENT="140" WC="1.0"/>
      <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0" HPOS="3316.0" VPOS="124223.0"
CONTENT="eggs" WC="1.0"/>
      <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message