lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ching <zchin...@gmail.com>
Subject Re: Indexing is hung or doesn't complete
Date Thu, 14 Oct 2010 02:24:15 GMT
I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
which tools do you use to extract text from pdf? Thanks.

On Wed, Oct 13, 2010 at 11:36 AM, Fabiano Nunes <fabiano@nunes.me> wrote:

> What version of PDFBox are you running?
> PDFBox 0.72 does not work properly with some pdf documents. See more in
> https://issues.apache.org/jira/browse/PDFBOX-361.
> So, I wrote a extractor (a copy of the original, in fact) based on trunk
> version (1.2.1, actually). Furthermore, this version is faster e more
> efficient than 0.72, preventing from OutOfMemory. Nowadays, I don't use
> PDFBox to extract text. I prefer poppler tools.
>
> On Wed, Oct 13, 2010 at 2:22 PM, Ching <zching01@gmail.com> wrote:
>
> > Hi,
> >
> > Thank you for your suggestions. I found the reason which is that PDFBox
> > seems having problem parsing large document (20MB), I have a few of them
> > within those 2000 docs, those are the ones throwing OutOfMemory errors.
> The
> > app does exit, and JVM died. I am running on 32bit machine.
> >
> > -- Ching
> >
> > On Tue, Oct 12, 2010 at 9:42 PM, Anshum <anshumg@gmail.com> wrote:
> >
> > > Hi Ching,
> > > Does the app exit or hang and stay there? as in does the JVM stay alive
> > and
> > > idle?
> > > Also, can you make sure that its not the pdfbox? as in, try commenting
> > the
> > > indexwriter part and just read the pdfs, does that work fine.
> > > Can you also post info on your environment?
> > > Index Size? Lucene Version? Machine  and JVM (32/64 bit)?
> > > This most probably seems like a code level issue rather than lucene,
> but
> > I
> > > may be wrong.
> > >
> > > --
> > > Anshum Gupta
> > > http://ai-cafe.blogspot.com
> > >
> > >
> > > On Wed, Oct 13, 2010 at 8:08 AM, Ching <zching01@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Can anyone help with this issue? I have about 2000 pdf files that I
> use
> > > > PDFBox to extract its text, then index them using for loop. The
> > indexing
> > > > stopped after the fdt file reaches at 7,061 KB in size. There is no
> > > error,
> > > > the indexing simply stopped.  Thanks in advance for any help.
> > > >
> > > > Ching
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message