lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelvin Tan <kelvin-li...@relevanz.com>
Subject Re: PDF Text extraction
Date Fri, 27 Dec 2002 07:02:26 GMT
Try this?

InputStream input = new FileInputStream(file);
                COSDocument document = parseDocument(input);
                PDFTextStripper stripper = new PDFTextStripper();
                StringWriter output = new StringWriter()
                stripper.writeText(document, output);
System.out.println(output.toString())

errmmm...the code may not be 100% correct, but you get the idea.

Regards,
Kelvin

--------
The book giving manifesto     - http://how.to/sharethisbook


On Fri, 27 Dec 2002 12:04:11 +0530, Suhas Indra said:
>Hello List
>
>I am using PDFBox to index some of the PDF documents. The parser
>works fine and I can read the summary. But the contents are
>displayed as java.io.InputStream.
>
>When I try the following:
>System.out.println(doc.getField("contents")) (where doc is the
>Document object)
>
>The result will be:
>
>Text<contents:java.io.InputStreamReader@127dc0>
>
>I want to print the extracted data.
>
>Can anyone please let me know how to extract the contents?
>
>Regards
>
>Suhas
>
>
>
>--------------------------------------------------------------
>Robosoft Technologies - Partners in Product Development
>
>
>
>
>
>
>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-
>unsubscribe@jakarta.apache.org> For additional commands, e-mail:
><mailto:lucene-user-
>help@jakarta.apache.org>




--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message