lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.org>
Subject Re: xpdf parser usage for lucene
Date Tue, 25 Feb 2003 22:11:50 GMT
Pinky Iyer wrote:

>Hi !
>   I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter
a file with .pdf extension, i call the pdftotext script to convert to text, which in turn
uses the file system and leaves the same file with .txt extension in same dir. How can i get
this as a stream and not use the file system at all. Also How do i access the summary and
title info.
>

xpdf has an option to turn the PDF into an HTML instead of txt, which 
allows you to use an HTMLParser
for populating the fields.

Concerning the extension: when you create your Lucene document, you 
could replace the txt extension
by the pdf extension in the case of the "uri" field.

HTH

Michael

> Anybody who has done this before, please help!
>Thanks!
>Pinky Iyer
>  
>
>
>
>---------------------------------
>Do you Yahoo!?
>Yahoo! Tax Center - forms, calculators, tips, and more
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message