lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Causse <>
Subject Re: Use of tika for parsing, offsets questions
Date Fri, 04 Sep 2009 08:29:32 GMT
On Thu, Sep 03, 2009 at 03:07:18PM +0200, Jukka Zitting wrote:
> Hi,
> On Wed, Sep 2, 2009 at 2:40 PM, David Causse<> wrote:
> > If I use tika for parsing HTML code and inject parsed String to a lucene
> > analyzer. What about the offset information for KWIC and return to text
> > (like the google cache view)? how can I keep track of the offsets
> > between tika parser and lucene analyzer?
> Currently Tika doesn't expose that information but the Tika Parser API
> was designed for such use in mind, so it will be possible to add the
> offset information. Please file a Tika feature request [1] for this.

I created TIKA-272, the idea behind is to be able to use unmodified
lucene analyzers with tika and keep offset correctness.

Thank you.

David Causse

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message