lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <>
Subject Re: xpdf parser usage for lucene
Date Tue, 25 Feb 2003 22:11:50 GMT
Pinky Iyer wrote:

>Hi !
>   I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter
a file with .pdf extension, i call the pdftotext script to convert to text, which in turn
uses the file system and leaves the same file with .txt extension in same dir. How can i get
this as a stream and not use the file system at all. Also How do i access the summary and
title info.

xpdf has an option to turn the PDF into an HTML instead of txt, which 
allows you to use an HTMLParser
for populating the fields.

Concerning the extension: when you create your Lucene document, you 
could replace the txt extension
by the pdf extension in the case of the "uri" field.



> Anybody who has done this before, please help!
>Pinky Iyer
>Do you Yahoo!?
>Yahoo! Tax Center - forms, calculators, tips, and more

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message