lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kalani Ruwanpathirana" <kala...@gmail.com>
Subject Re: Pdf in Lucene?
Date Thu, 04 Dec 2008 13:47:01 GMT
Hi Tiziano,

What is the error you got? I think you can get the text easily using the
code shown below.


FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));
cd.close();

After getting the value for text you can simply create the Lucene document.

Document doc = new Document();
doc.add(new Field("content", text,Field.Store.NO <http://field.store.no/>,
Field.Index.TOKENIZED));
>
>
>
>
> On Thu, Dec 4, 2008 at 6:20 PM, tiziano bernardi <dk1982@hotmail.it>wrote:
>
>>
>> Thanks very kind ...
>> But I've tried that code but I do not work ...
>> You could send me a simple working class that uses it please?
>> Thanks> Date: Thu, 4 Dec 2008 15:19:26 +0530> From: kalanir@gmail.com>
>> To: java-user@lucene.apache.org> Subject: Re: Pdf in Lucene?> > Hi,>
> In
>> my case I used PDFBox, just to extract the text from PDF document and> then
>> I created the Lucene document giving the extracted text. (I didn't use> the
>> PDFBox built in Lucene search engine). So I didn't get any> incompatibility
>> problems.> > This blog post shows the way.>
>> http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html>
>> > It worked perfect for me.> > Thanks.
>> _________________________________________________________________
>> Ci sai fare con l'italiano? Scoprilo con Typectionary!
>> http://typectionary.it.msn.com/
>>
>
>
>
> --
> Kalani Ruwanpathirana
> Department of Computer Science & Engineering
> University of Moratuwa
>



-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message