lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Causse <dcau...@spotter.com>
Subject Re: Use of tika for parsing, offsets questions
Date Fri, 04 Sep 2009 08:29:32 GMT
On Thu, Sep 03, 2009 at 03:07:18PM +0200, Jukka Zitting wrote:
> Hi,
> 
> On Wed, Sep 2, 2009 at 2:40 PM, David Causse<dcausse@spotter.com> wrote:
> > If I use tika for parsing HTML code and inject parsed String to a lucene
> > analyzer. What about the offset information for KWIC and return to text
> > (like the google cache view)? how can I keep track of the offsets
> > between tika parser and lucene analyzer?
> 
> Currently Tika doesn't expose that information but the Tika Parser API
> was designed for such use in mind, so it will be possible to add the
> offset information. Please file a Tika feature request [1] for this.

I created TIKA-272, the idea behind is to be able to use unmodified
lucene analyzers with tika and keep offset correctness.

Thank you.

-- 
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message