lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pinky Iyer <>
Subject Re: xpdf parser usage for lucene
Date Tue, 25 Feb 2003 22:26:23 GMT

THis means that i have to use the htmlparser again on the converted document. Is that right?
Also is there a way to use these without utilizing the filesystem, by way of streams or so.
 Michael Wechner <> wrote:Pinky Iyer wrote:

>Hi !
> I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter a
file with .pdf extension, i call the pdftotext script to convert to text, which in turn uses
the file system and leaves the same file with .txt extension in same dir. How can i get this
as a stream and not use the file system at all. Also How do i access the summary and title

xpdf has an option to turn the PDF into an HTML instead of txt, which 
allows you to use an HTMLParser
for populating the fields.

Concerning the extension: when you create your Lucene document, you 
could replace the txt extension
by the pdf extension in the case of the "uri" field.



> Anybody who has done this before, please help!
>Pinky Iyer
>Do you Yahoo!?
>Yahoo! Tax Center - forms, calculators, tips, and more

To unsubscribe, e-mail:
For additional commands, e-mail:

Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, and more
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message