lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From raphael812 <or...@eecs.qmul.ac.uk>
Subject Indexing with Lucene
Date Wed, 20 Jul 2011 13:17:21 GMT
Hello everyone,

I am quite new to lucene and i am using the book lucene in action to learn.
I need help in extracting the body content of a html page using tika. The
implementation from the book only extracts the html's metadata not the main
body content which i need. Is it possible to extract body content from htmls
and pdfs and how.
Thanks for you help.

Raphael

--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-with-Lucene-tp3185409p3185409.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Mime
View raw message