lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brandon Waterloo <Brandon.Water...@matrix.msu.edu>
Subject RE: Problems indexing very large set of documents
Date Mon, 11 Apr 2011 16:59:20 GMT
I found a simpler command-line method to update the PDF files.  On some documents it does so
perfect, the result is a pixel-for-pixel match and none of the OCR text (which is what all
these PDFs are, newspaper articles that have been passed through OCR) is lost.  However, on
other documents the result is considerably blurrier and some of the OCR text is lost.

We've decided to skip any documents that Tika cannot index for now.

As Lance stated, it's not specifically the version that causes the problem but rather some
quirks caused by different PDF writers, a few tests have confirmed this, so we can't use version
to determine which should be skipped.  I'm examining the XML responses from the queries, and
I cannot figure out how to tell from the XML response whether or not a document was successfully
indexed.  The status value seems to be 0 regardless of whether indexing was successful or
not.

So my question is, how can I tell from the response whether or not indexing was actually successful?

~Brandon Waterloo

________________________________________
From: Lance Norskog [goksron@gmail.com]
Sent: Sunday, April 10, 2011 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Problems indexing very large set of documents

There is a library called iText. It parses and writes PDFs very very
well, and a simple program will let you do a batch conversion.  PDFs
are made by a wide range of programs, not just Adobe code. Many of
these do weird things and make small mistakes that Tika does not know
to handle. In other words there is "dirty PDF" just like "dirty HTML".

A percentage of PDFs will fail and that's life. One site that gets
press releases from zillions of sites (and thus a wide range of PDF
generators) has a 15% failure rate with Tika.

Lance

On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo
<Brandon.Waterloo@matrix.msu.edu> wrote:
> I think I've finally found the problem.  The files that work are PDF version 1.6.  The
files that do NOT work are PDF version 1.4.  I'll look into updating all the old documents
to PDF 1.6.
>
> Thanks everyone!
>
> ~Brandon Waterloo
> ________________________________
> From: Ezequiel Calderara [ezechico@gmail.com]
> Sent: Friday, April 08, 2011 11:35 AM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> Maybe those files are created with a different Adobe Format version...
>
> See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <Brandon.Waterloo@matrix.msu.edu<mailto:Brandon.Waterloo@matrix.msu.edu>>
wrote:
> A second test has revealed that it is something to do with the contents, and not the
literal filenames, of the second set of files.  I renamed one of the second-format files and
tested it and Solr still failed.  However, the problem still only applies to those files of
the second naming format.
> ________________________________________
> From: Brandon Waterloo [Brandon.Waterloo@matrix.msu.edu<mailto:Brandon.Waterloo@matrix.msu.edu>]
> Sent: Friday, April 08, 2011 10:40 AM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> I had some time to do some research into the problems.  From what I can tell, it appears
Solr is tripping up over the filename.  These are strictly examples, but, Solr handles this
filename fine:
>
> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>
> However, it fails with either a parsing error or an EOF exception on this filename:
>
> 32-130-A08-84-al.sff.document.nusa197102.pdf
>
> The only significant difference is that the second filename contains multiple periods.
 As there are about 1700 files whose filenames are similar to the second format it is simply
not possible to change their filenames.  In addition they are being used by other applications.
>
> Is there something I can change in Solr configs to fix this issue or am I simply SOL
until the Solr dev team can work on this? (assuming I put in a ticket)
>
> Thanks again everyone,
>
> ~Brandon Waterloo
>
>
> ________________________________________
> From: Chris Hostetter [hossman_lucene@fucit.org<mailto:hossman_lucene@fucit.org>]
> Sent: Tuesday, April 05, 2011 3:03 PM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> : It wasn't just a single file, it was dozens of files all having problems
> : toward the end just before I killed the process.
>       ...
> : That is by no means all the errors, that is just a sample of a few.
> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
> : every file succeeded before about the 2200-files-mark, and nearly every
> : file after that failed.
>
> ..the root question is: do those files *only* fail if you have already
> indexed ~2200 files, or do they fail if you start up your server and index
> them first?
>
> there may be a resource issued (if it only happens after indexing 2200) or
> it may just be a problem with a large number of your PDFs that your
> iteration code just happens to get to at that point.
>
> If it's the former, then there may e something buggy about how Solr is
> using Tika to cause the problem -- if it's the later, then it's a straight
> Tika parsing issue.
>
> : > now, commit is set to false to speed up the indexing, and I'm assuming that
> : > Solr should be auto-committing as necessary.  I'm using the default
> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
>
> solr does no autocommitting by default, you need to check your
> solrconfig.xml
>
>
> -Hoss
>
>
>
> --
> ______
> Ezequiel.
>
> Http://www.ironicnet.com
>



--
Lance Norskog
goksron@gmail.com

Mime
View raw message