lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Indexing and Hit Highlighting OCR Data
Date Fri, 03 Jun 2005 09:16:59 GMT

On Jun 2, 2005, at 9:02 PM, Chris Hostetter wrote:
> This is a pretty interesting problem.  I envy you.
>
> I would avoid the existing highlighter for your purposes --  
> highlighting
> in token space is a very differnet problem from "highlihgting" in 2D
> space.
>
> based on the XML sample you provided, it looks like your XML files
> are allready a "tokenized" form of the orriginal OCR data -- by  
> which i
> mean the page has allready been tokenized into words who position is
> recorded.
>
> I would parse these XML docs to generate two things:
>     1) a stream of words for analysis/filtering (ie: stop words,  
> stemming,
>        synonyms)
>     2) a datastructure mapping words to lists of positions (ie: if the
>        same word apears in multiple places, list the word once,  
> followed
>        by each set of coordinates)
>
> use #1 in the usual way, and add a serialized form of #2 to your  
> index as
> a Stored Keyword -- at query time, the words from your initial  
> query can
> be looked up in that data strucutre to find the regions to "highlight"

Chris - that is great recommendation.  I second it.  The only minor  
thing I'll add is that you probably should use an unindexed field for  
#2 rather than literally a Field.Keyword - no point in indexing it as  
you would never search on that data structure.

     Erik

>
>
>
> : I am involved in a project which is trying to provide searching  
> and hit highlighting on the scanned image of historical  
> newspapers.  We have an XML based OCR format.  A sample is below.   
> We need to index the CONTENT attribute of the String element which  
> is the easy part.  We would like to be able find the "hits" within  
> this XML document in order to use the positioning information to  
> draw the highlight boxes on the image.  It doesn't make a lot of  
> sense to just extract the CONTENT and index that because we loose  
> the positioning information.  My second thought was to make a  
> custom analyzer which dropped everything except for the content  
> element and then used the highlighting class in the sandbox to  
> reanalyze the XML document and mark the hits.  With the marked hits  
> in the XML we could find the position information and draw on the  
> image.  Has anyone else worked with OCR information and lucene.   
> What was your approach?  Does this approach seem sound?  Any  
> recommendations?
> :
> : Thanks, Corey
> :
> :      <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0"  
> VPOS="123644.0">
> :       <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0"  
> HPOS="1316.0" VPOS="123644.0" CONTENT="The" WC="1.0"/>
> :       <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0"  
> HPOS="1664.0" VPOS="123711.0" CONTENT="female" WC="1.0"/>
> :       <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0"  
> HPOS="2192.0" VPOS="123711.0" CONTENT="lays" WC="1.0"/>
> :       <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0"  
> HPOS="2528.0" VPOS="123711.0" CONTENT="about" WC="1.0"/>
> :       <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0"  
> HPOS="3000.0" VPOS="123770.0" CONTENT="140" WC="1.0"/>
> :       <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
> :       <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0"  
> HPOS="3316.0" VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
> :       <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
> :      </TextLine>
> :
> :
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message