lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Causse <>
Subject Use of tika for parsing, offsets questions
Date Wed, 02 Sep 2009 12:40:06 GMT

If I use tika for parsing HTML code and inject parsed String to a lucene
analyzer. What about the offset information for KWIC and return to text
(like the google cache view)? how can I keep track of the offsets
between tika parser and lucene analyzer?

What are the solutions/ideas to do a sort of google cache view with
tika and lucene analyzer API?

With the provided API I can't keep the original content as a cache, I
need to cache the tika output and result in degraded cache view. I
didn't look too closely at tika but there is maybe a way with SAX
Locators? Build an associative array of tika parsed string offsets vs
actual offsets and use a sort of token filter to rectify

David Causse

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message