pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Håkansson (JIRA) <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-1305) Text extraction takes huge amount of time on some files
Date Wed, 09 May 2012 15:01:50 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Roger Håkansson updated PDFBOX-1305:
------------------------------------

    Attachment: 20020101ab3x012a.pdf
    
> Text extraction takes huge amount of time on some files
> -------------------------------------------------------
>
>                 Key: PDFBOX-1305
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1305
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Same phenomena on Windows 7, Solaris 10 and CentOS 5.7. Same result
with JDK 7u4 and JDK 6u32
>            Reporter: Roger Håkansson
>         Attachments: 20020101ab3x012a.pdf
>
>
> I've got 1.2M single-page PDF files which I'm indexing using Solr (which is using Tika,
which is using PDFBox) and some of them takes between 20min up to an hour to index.
> This is a huge problem for me, in 48hours I've indexed about 45k files and 19 hours of
that time was spent on just 279 files.
> I've traced it to PDFBox taking a lot of time extracting the text from the documents.
> I've tested extracting the text using pdfbox-app's ExtractText with the same result,
the text is extracted but it takes forever...
> The attached file took about 23min (using ExtractText) and from the result I can see
a lot of "rubbish text" which I don't see in the text extracted from files that takes a normal
amount of time (up to a few seconds per file) to parse.
> When running truss (on Solaris, strace on Linux) on the java-process, I can see a lot
of SEGV due to FLTBOUNDS, which I don't know if its related to this problem but I just want
to mention it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message