lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <>
Subject Re: Indexing with Lucene
Date Wed, 20 Jul 2011 21:32:32 GMT
On Wed, Jul 20, 2011 at 3:17 PM, raphael812 <> wrote:
> Hello everyone,
> I am quite new to lucene and i am using the book lucene in action to learn.
> I need help in extracting the body content of a html page using tika. The
> implementation from the book only extracts the html's metadata not the main
> body content which i need. Is it possible to extract body content from htmls
> and pdfs and how.
> Thanks for you help.

 this seems to be a tika / extraction specific question. you should
try to ask this question on the tika list, I bet you get a quick
response there!

> Raphael
> --
> View this message in context:
> Sent from the Lucene - General mailing list archive at

View raw message