lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From raphael812 <>
Subject Indexing with Lucene
Date Wed, 20 Jul 2011 13:17:21 GMT
Hello everyone,

I am quite new to lucene and i am using the book lucene in action to learn.
I need help in extracting the body content of a html page using tika. The
implementation from the book only extracts the html's metadata not the main
body content which i need. Is it possible to extract body content from htmls
and pdfs and how.
Thanks for you help.


View this message in context:
Sent from the Lucene - General mailing list archive at

View raw message