lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: PDF documents with "MoreLikeThis" class
Date Thu, 20 Jul 2006 11:33:27 GMT
>>Do I have to extract text from PDF file and then pass an InputStream with the text
inside? 
Yes. 
Although technically you could pass the content unparsed it will contain a lot of unintelligible
garbage in the form of markup and images.

All Lucene classes deliberately try and avoid the mucky business of parsing different specific
document types.
This keeps the core engine very tightly focused on indexing and searching without having to
deal with the ever-changing range of document formats.



----- Original Message ----
From: Davide <davidin81@libero.it>
To: java-user@lucene.apache.org
Sent: Thursday, 20 July, 2006 10:41:03 AM
Subject: PDF documents with "MoreLikeThis" class

Hi,
I'm using MoreLikeThis class to find similar documents... but I'm not
sure if it is correct to pass as argument a Pdf file to
*MoreLikeThis.like()* method.

Trying to be more clear:

1) In my Lucene index I add some PDF files (I use PDFBox to extract text
and add fields to index)
2) Now I want to search similar documents from a specific PDF file and I
have the PDF file name (C:\\Example.pdf)


*My question is: What is the correct way to call like() method when I
have to find similar PDF files?*

I use:
-------------------------------------------------------
MoreLikeThis mlt = new MoreLikeThis(IndexReader);        

Query query = mlt.like(*new File("C:\\Example.pdf")*);
-------------------------------------------------------

I don't sure It is the correct way because I think if I pass a file to
the like() method It is expected to receive a text file and not a PDF
file where the text is not visible...

Do I have to extract text from PDF file and then pass an InputStream
with the text inside? Or my way is ok?

Thanks for any suggestion,
Davide.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message