lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kalani Ruwanpathirana" <kala...@gmail.com>
Subject Re: Pdf in Lucene?
Date Thu, 04 Dec 2008 13:45:39 GMT
Hi Tiziano,

What is the error you got? I think you can get the text easily using the
code shown below.


FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));
cd.close();

After getting the value for text you can simply create the Lucene document.

Document doc = new Document();
            doc.add(new Field("id", "2", Field.Store.YES,
Field.Index.TOKENIZED));
            doc.add(new Field("content", docText,Field.Store.NO,
Field.Index.TOKENIZED));


On Thu, Dec 4, 2008 at 6:20 PM, tiziano bernardi <dk1982@hotmail.it> wrote:

>
> Thanks very kind ...
> But I've tried that code but I do not work ...
> You could send me a simple working class that uses it please?
> Thanks> Date: Thu, 4 Dec 2008 15:19:26 +0530> From: kalanir@gmail.com> To:
> java-user@lucene.apache.org> Subject: Re: Pdf in Lucene?> > Hi,> > In
my
> case I used PDFBox, just to extract the text from PDF document and> then I
> created the Lucene document giving the extracted text. (I didn't use> the
> PDFBox built in Lucene search engine). So I didn't get any> incompatibility
> problems.> > This blog post shows the way.>
> http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html>
> > It worked perfect for me.> > Thanks.
> _________________________________________________________________
> Ci sai fare con l'italiano? Scoprilo con Typectionary!
> http://typectionary.it.msn.com/
>



-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message