lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pinky Iyer <pinkyi...@yahoo.com>
Subject Re: xpdf parser usage for lucene
Date Tue, 25 Feb 2003 22:26:23 GMT

THis means that i have to use the htmlparser again on the converted document. Is that right?
Also is there a way to use these without utilizing the filesystem, by way of streams or so.
 Michael Wechner <michael.wechner@wyona.org> wrote:Pinky Iyer wrote:

>Hi !
> I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter a
file with .pdf extension, i call the pdftotext script to convert to text, which in turn uses
the file system and leaves the same file with .txt extension in same dir. How can i get this
as a stream and not use the file system at all. Also How do i access the summary and title
info.
>

xpdf has an option to turn the PDF into an HTML instead of txt, which 
allows you to use an HTMLParser
for populating the fields.

Concerning the extension: when you create your Lucene document, you 
could replace the txt extension
by the pdf extension in the case of the "uri" field.

HTH

Michael

> Anybody who has done this before, please help!
>Thanks!
>Pinky Iyer
> 
>
>
>
>---------------------------------
>Do you Yahoo!?
>Yahoo! Tax Center - forms, calculators, tips, and more
> 
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, and more
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message