lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Indexing with Lucene
Date Wed, 20 Jul 2011 21:32:32 GMT
On Wed, Jul 20, 2011 at 3:17 PM, raphael812 <oro30@eecs.qmul.ac.uk> wrote:
> Hello everyone,
>
> I am quite new to lucene and i am using the book lucene in action to learn.
> I need help in extracting the body content of a html page using tika. The
> implementation from the book only extracts the html's metadata not the main
> body content which i need. Is it possible to extract body content from htmls
> and pdfs and how.
> Thanks for you help.

hey,
 this seems to be a tika / extraction specific question. you should
try to ask this question on the tika list, I bet you get a quick
response there!

simon
>
> Raphael
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-with-Lucene-tp3185409p3185409.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

Mime
View raw message