lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: Use of tika for parsing, offsets questions
Date Thu, 03 Sep 2009 13:26:59 GMT
An additional good solution for Lucene (from 2.9 on), would be to create a
special TIKA analyzer that can be used to directly add TIKA-parseable
content and metadata to the Tokenstream as Attributes (using the new API) or
only text and offset data (old Lucene TokenStream API).

I wrote something similar for XML files that added the current XML element
path as an additional Token Attribute. It also set the SAX parsers current
position as offset. This attribute could then later be used to construct
additional indexing setting (in our case the field name to index into).

Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen

> -----Original Message-----
> From: Jukka Zitting []
> Sent: Thursday, September 03, 2009 3:07 PM
> To:; David Causse
> Subject: Re: Use of tika for parsing, offsets questions
> Hi,
> On Wed, Sep 2, 2009 at 2:40 PM, David Causse<> wrote:
> > If I use tika for parsing HTML code and inject parsed String to a lucene
> > analyzer. What about the offset information for KWIC and return to text
> > (like the google cache view)? how can I keep track of the offsets
> > between tika parser and lucene analyzer?
> Currently Tika doesn't expose that information but the Tika Parser API
> was designed for such use in mind, so it will be possible to add the
> offset information. Please file a Tika feature request [1] for this.
> [1]
> BR,
> Jukka Zitting
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message