lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Anderson <Eric.Ander...@LanRx.com>
Subject Re: [ANN] PDFBox 0.6.0
Date Thu, 06 Mar 2003 14:28:34 GMT
When it throws the exception, the indexer fails, so I cannot continue the index.

It appears that it's only related to some files, as I have been able to remove 
some of the files, and it will continue past that point, but if it encounters 
one of these files, the index fails.

Eric Anderson
LanRx Network Solutions
815-505-6132


Quoting Ben Litchfield <ben@csh.rit.edu>:

> In this release I have changed how I parsed the document, which may have
> introduced this bug.  I have received another report of this and will have
> it fixed for the next point release.
> 
> You said you tried with reasonably sized PDF repository.  Did you stop
> indexing at this error or did you continue?  If you continued, is this the
> only error that you got?
> 
> -Ben
> 
> 
> 
> 
> -- 
> 
> On Thu, 6 Mar 2003, Eric Anderson wrote:
> 
> > Ben-
> > In attempting to use the PDFBox-0.6.0, I rec'd the following error when
> > attempting to scan a reasonably sized PDF repository.
> >
> > Any thoughts?
> >
> >
> >  caught a class java.io.EOFException
> >  with message: Unexpected end of ZLIB input stream
> >
> >
> > Eric Anderson
> > LanRx Network Solutions
> >
> >
> > Quoting Ben Litchfield <ben@csh.rit.edu>:
> >
> > > I would like to announce the next release of PDFBox.  PDFBox allows for
> > > PDF documents to be indexed using lucene through a simple interface.
> > > Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
> > > which will extract all text and PDF document summary properties as
> lucene
> > > fields.
> > >
> > > You can obtain the latest release from http://www.pdfbox.org
> > >
> > > Please send all bug reports to me and attach the PDF document when
> > > possible.
> > >
> > > RELEASE 0.6.0
> > > -Massive improvements to memory footprint.
> > > -Must call close() on the COSDocument(LucenePDFDocument does this for
> you)
> > > -Really fixed the bug where small documents were not being indexed.
> > > -Fixed bug where no whitespace existed between obj and start of object.
> > >     Exception in thread "main" java.io.IOException: expected='obj'
> > >     actual='obj<</Pro
> > > -Fixed issue with spacing where textLineMatrix was not being copied
> > >  properly
> > > -Fixed 'bug' where parsing would fail with some pdfs with double endobj
> > >  definitions
> > > -Added PDF document summary fields to the lucene document
> > >
> > >
> > > Thank you,
> > > Ben Litchfield
> > > http://www.pdfbox.org
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> > LanRx Network Solutions, Inc.
> > Providing Enterprise Level Solutions...On A Small Business Budget
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

LanRx Network Solutions, Inc.
Providing Enterprise Level Solutions...On A Small Business Budget

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message