lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Use of tika for parsing, offsets questions
Date Thu, 03 Sep 2009 13:51:28 GMT

On Sep 2, 2009, at 5:40 AM, David Causse wrote:

> Hi,
>
> If I use tika for parsing HTML code and inject parsed String to a  
> lucene
> analyzer. What about the offset information for KWIC and return to  
> text
> (like the google cache view)? how can I keep track of the offsets
> between tika parser and lucene analyzer?
>
> What are the solutions/ideas to do a sort of google cache view with
> tika and lucene analyzer API?
>
> With the provided API I can't keep the original content as a cache, I
> need to cache the tika output and result in degraded cache view. I
> didn't look too closely at tika but there is maybe a way with SAX
> Locators? Build an associative array of tika parsed string offsets vs
> actual offsets and use a sort of token filter to rectify
> OffsetAttribute?
>

Hmm, maybe you could implement the ContentHandler for Tika that  
instead of creating a string for the Document, creates a TokenStream.   
Then, you can have it add the offsets as payloads so that you then  
have those offsets later when rendering your view.


> -- 
> David Causse
> Spotter
> http://www.spotter.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message