lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Ritchie <>
Subject Re: xpdf parser usage for lucene
Date Tue, 25 Feb 2003 22:31:46 GMT

If you had actually read the documentation that came with pdftotext you would know that if
you pass 
in a - (dash) as the output filename it will stream the text to stdout. This is exactly what
code Matt Tucker showed you before did which is copied below. It's all there in his message.

As for summary and title info, you'll probably have to use a pdf parsing library to gain access
that from the pdf.

String[] cmd = new String[] {
         PATH_TO_XPDF, "-enc", "UTF-8", "-q", PDF_FILE_TO_PARSE, "-"};
Process p = Runtime.getRuntime().exec(cmd);
BufferedInputStream bis = new BufferedInputStream(p.getInputStream());
InputStreamReader reader = new InputStreamReader(bis, "UTF-8");
StringWriter out = new StringWriter();
char [] buf = new char[512];
int len;
while ((len = >= 0) {
     out.write(buf, 0, len);

You should of course wrap this in a try/catch block, etc.


Bruce Ritchie

Pinky Iyer wrote:
> Hi !
> I am trying to use xpdf for pdf parser, the problem i encounter is when 
 > i encounter a file with .pdf extension, i call the pdftotext script to convert
 > to text, which in turn uses the file system and leaves the same file with
 > .txt extension in same dir. How can i get this as a stream and not use
 > the file system at all. Also How do i access the summary and title info.
> Anybody who has done this before, please help!
> Thanks!
> Pinky Iyer

AOL - bruceritchie101
ICQ - 9929791

View raw message