lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <jans...@parc.com>
Subject Re: Indexing is hung or doesn't complete
Date Wed, 13 Oct 2010 22:32:32 GMT
Ching <zching01@gmail.com> wrote:

> I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
> which tools do you use to extract text from pdf? Thanks.

Ching, in UpLib I use a patched version of xpdf which reports the
bounding box and font information for each word (as well as the Unicode
characters of the word :-).  It works fine with the Long_9.pdf file
that's attached to the issue you pointed out.  You might want to try
that.  (The same wordbox info is available from PDFBox, but you have to
write some code to access it.)

To build the patched xpdf:

Download and unpack xpdf-3.02pl4 from http://www.foolabs.com/xpdf/download.html.

Download UpLib from
http://uplib.parc.com/uplib/releases/1.7.10/uplib-1.7.10.tgz and untar
it.  You'll find a file called "patches/xpdf-3.02-PATCH".  Patch the
xpdf sources with that patch:

  cd xpdf-3.02
  patch -p0 < /tmp/uplib-1.7.10/patches/xpdf-3.02-PATCH

Then configure and build xpdf.  You can now extract words and wordbox info
from the PDF file with "pdftotext".  

There's still the issue that neither PDFBox or xpdf or the built-in
Windows, OS X, or Linux utilities for extracting text from PDFs does
layout analysis properly, so the words you get will often be in the
wrong order.  But that's usually only a problem if you are doing things
with multi-word phrases.

Bill

> On Wed, Oct 13, 2010 at 11:36 AM, Fabiano Nunes <fabiano@nunes.me> wrote:
> 
> > What version of PDFBox are you running?
> > PDFBox 0.72 does not work properly with some pdf documents. See more in
> > https://issues.apache.org/jira/browse/PDFBOX-361.
> > So, I wrote a extractor (a copy of the original, in fact) based on trunk
> > version (1.2.1, actually). Furthermore, this version is faster e more
> > efficient than 0.72, preventing from OutOfMemory. Nowadays, I don't use
> > PDFBox to extract text. I prefer poppler tools.
> >
> > On Wed, Oct 13, 2010 at 2:22 PM, Ching <zching01@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Thank you for your suggestions. I found the reason which is that PDFBox
> > > seems having problem parsing large document (20MB), I have a few of them
> > > within those 2000 docs, those are the ones throwing OutOfMemory errors.
> > The
> > > app does exit, and JVM died. I am running on 32bit machine.
> > >
> > > -- Ching
> > >
> > > On Tue, Oct 12, 2010 at 9:42 PM, Anshum <anshumg@gmail.com> wrote:
> > >
> > > > Hi Ching,
> > > > Does the app exit or hang and stay there? as in does the JVM stay alive
> > > and
> > > > idle?
> > > > Also, can you make sure that its not the pdfbox? as in, try commenting
> > > the
> > > > indexwriter part and just read the pdfs, does that work fine.
> > > > Can you also post info on your environment?
> > > > Index Size? Lucene Version? Machine  and JVM (32/64 bit)?
> > > > This most probably seems like a code level issue rather than lucene,
> > but
> > > I
> > > > may be wrong.
> > > >
> > > > --
> > > > Anshum Gupta
> > > > http://ai-cafe.blogspot.com
> > > >
> > > >
> > > > On Wed, Oct 13, 2010 at 8:08 AM, Ching <zching01@gmail.com> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Can anyone help with this issue? I have about 2000 pdf files that
I
> > use
> > > > > PDFBox to extract its text, then index them using for loop. The
> > > indexing
> > > > > stopped after the fdt file reaches at 7,061 KB in size. There is
no
> > > > error,
> > > > > the indexing simply stopped.  Thanks in advance for any help.
> > > > >
> > > > > Ching
> > > > >
> > > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message