lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <...@csh.rit.edu>
Subject Re: PDF Text extraction
Date Fri, 27 Dec 2002 14:03:47 GMT

You need to do something like

//first get the document field
Field contentsField = doc.getField( "contents" );

//Then get the reader from the field
BufferedReader contentsReader =
    new BufferedReader( contentsField.readerValue() );

//finally dump the contents of the reader to System.out
String line = null;
while( (line = contentsReader.readLine() ) != null )
{
    System.out.println( line );
}

I have not tested if this compiles but it should be pretty close.

Ben Litchfield


On Fri, 27 Dec 2002, Suhas Indra wrote:

> Hello List
>
> I am using PDFBox to index some of the PDF documents. The parser works fine
> and I can read the summary. But the contents are displayed as
> java.io.InputStream.
>
> When I try the following:
> System.out.println(doc.getField("contents")) (where doc is the Document
> object)
>
> The result will be:
>
> Text<contents:java.io.InputStreamReader@127dc0>
>
> I want to print the extracted data.
>
> Can anyone please let me know how to extract the contents?
>
> Regards
>
> Suhas
>
>
>
> --------------------------------------------------------------
> Robosoft Technologies - Partners in Product Development
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>

-- 


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message