lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Bhagat <>
Subject RE: Extracting Complete Text from PDF using Lucene and JPEDAL!!!!
Date Tue, 15 Oct 2002 09:56:53 GMT

  I am trying to read multiple pages from PDF , for that i changed the start
and end parameter in the ExtractTextObjects class. But it gives the
following erro aftter reading successfully the text from the first page.

Processing content from page 2
Reading resources object 2 0 R
Reading fonts
        at org.jpedal.fonts.PdfFontsData.putWidth(
        at org.jpedal.PdfObjects.readFonts(
        at org.jpedal.PdfObjects.readResources(
        at org.jpedal.PdfDecoder.decodePage(
Exception java.lang.NullPointerException reading font

 It reads the first page without any problem, but while it iterates for the
subsequent pages it does not work and gives the NullPointer Exception. has
anyone encountered something liek this,,, am i missing something. At the
moment i ma hardCoding the start as 
start = 1
end =10

 for the number of pages. But it gives the error. I tried to use the
getPageCount() method declared in , but this method returns
0 always as count. I am using the following code :::
			//decode_pdf = new PdfDecoder( false );
			decode_pdf = new PdfDecoder( true );
			pageCount = decode_pdf.getPageCount();
			if (pageCount > start)
			{ end = pageCount;
			System.out.println( "TOTAL PAGE COUNT IS
=================== :" + pageCount );
			 * open the file (and read metadata including pages
in  file)
			System.out.println( "Opening NEW file :" + file_name
			decode_pdf.openPdfFile( file_name );
		catch( Exception e )
			System.err.println( "Exception " + e + " in pdf
code" );
			System.exit( 1 );

I flush each page object at the end 
				decode_pdf.flushObjectValues( true );

 Will appritiate for your positive and quick reply. 

 Best Regards.

-----Original Message-----
From: Mikael Söderman []
Sent: Monday, October 14, 2002 12:37 PM
To: Lucene Users List
Subject: Re: Extracting Complete Text from PDF using Lucene and

Hi Vin!

With JPedal you process one page at a time by calling the method decodePage
and supply the number of the page you want to process as argument.

In the example ExtractTextObjects the total number of pages is hard-coded to
1 (the variable end is set to 1 in the constructor), try to set the number
of pages by using the getPageCount method instead.

Best regards

Mikael Söderman

PS. Don't forget to always call flushObjectValues when done with a page.
This will make JPedal reuse memory.

----- Original Message -----
From: "Vinod Bhagat" <>
To: "'Lucene Users List'" <>
Sent: Monday, October 14, 2002 11:26 AM
Subject: Extracting Complete Text from PDF using Lucene and JPEDAL!!!!

> Dear People
>   I am using Lucene and one of the requirement is to index PDF. I am using
> JPEDAL's  API to extract text from PDF.  Till now i manage to get the text
> of the first page, I am using the class to do the
> above. But i want to extract the complete text of the PDF file. Have
> done this and possible could guide me towards it.
>  Appritiate for your positive and quick reply.
>  Cheers
> Vin.
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message