pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pulkit Kapur <pka...@seas.upenn.edu>
Subject Re: Fwd: Trouble reading IEEE pdf
Date Thu, 02 Feb 2017 19:26:57 GMT
Thanks. Thats what i would expect to read.
Also thanks for pointing to the latest version. I pointed to the
pdfbox-app-2.0.4.jar and the fontbox-2.0.4.jar files.

Since i want to read over 1000 pdf documents programmatically in matlab, i
am not using the command line, but using the java library in matlab.
Not sure why i am still *not *getting the text using getText()
{code}
pdfdoc = org.pdfbox.pdmodel.PDDocument;
pdfdoc.close;
reader = org.pdfbox.util.PDFTextStripper;

% list all the pdf files in the current folder
% listing = dir('**/*.pdf');
listing = dir('*.pdf');

    pdfdoc = pdfdoc.load(fullfile(listing(i).folder,listing(i).name));
    pdfdoc.isEncrypted;

    %% text, with planty of padding
    pdfstr = reader.getText(pdfdoc);                 %#ok
    pdfdoc.close
{\code}



On Thu, Feb 2, 2017 at 2:08 PM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> Am 02.02.2017 um 19:59 schrieb Pulkit Kapur:
>
>> My apologies. This was very careless of me. I did not realize scribd would
>> want you to register to download.
>>
>> I have uploaded the document here: http://www.filedropper.com/002
>> 4iros2016
>>
>> My code is in Matlab (and not command line interface) and i am using
>> *PDFBox-0.7.3.jar* and *FontBox-0.1.0.jar*
>>
>
>
> Thank you, now I could read the file.
>
> The current PDFBox version is 2.0.4.  Your version 0.7.3 is from 2006,
> thus over 10 years old. Many bugs (probably over 2000) have been fixed
> since then.
>
> I ran the ExtractText command line utility with your file and the result
> is 40KB large. It starts with this:
>
> A Reactive Stepping Algorithm Based on Preview Controller with Observer
> for Biped Robots
> Oliver Urbann, Matthias Hofmann
> Abstract— Reactive stepping is an important utility to regain
> balance when bipedal walking motions are disturbed. This
> paper sheds light on the reasons for humanoid robots to
> fall down. It presents a method to calculate modifications
> of predefined foot placements with the objective to minimize
> deviations of the Zero Moment Point from a reference without
> interrupting the walk. The calculation is in closed-form, and
> is embedded into a well-evaluated preview controller with
> observer based on the 3D Linear Inverted Pendulum Mode
> (3D-LIPM). Experiments in simulation and on a physical robot
> prove the benefit of the proposed system.
>
> (...)
>
> Tilman
>
>
> I am using the *getText *function and using the java library with Matlab.
>>
>> Here is my matlab code snippet:
>> //-------------------------------------------------------------------
>> pdfdoc = pdfdoc.load(fullfile(listing(i).folder,listing(i).name));
>>      pdfdoc.isEncrypted;
>>
>>      %% text, with planty of padding
>>      pdfstr = reader.getText(pdfdoc);                 %#ok
>>      pdfdoc.close
>> //----------------------------------------------------------
>> ---------------------------
>> The text i do get is :
>> "2016 IEEE International Conference on Robotics and Automation (ICRA)
>> Stockholm, Sweden, May 16-21, 2016
>> 978-1-4673-8026-3/16/$31.00 ©2016 IEEE 1366
>> 1367
>> 1368
>> 1369
>> 1370
>> 1371
>> 1372
>> 1373
>> "
>> Which is the header on the first page and the footer on all the pages
>> (page
>> numbers).
>>
>> Thanks,
>>
>> Pulkit
>>
>> On Thu, Feb 2, 2017 at 11:03 AM, Tilman Hausherr <THausherr@t-online.de>
>> wrote:
>>
>> Am 02.02.2017 um 16:10 schrieb Pulkit Kapur:
>>>
>>> Hi
>>>>
>>>> I have uploaded the pdf here:
>>>> https://www.scribd.com/document/338221804/0024-iros-2016
>>>>
>>>> Hello Pulkit,
>>>
>>> This site requires registration. This is a "don't" from the list:
>>> https://pdfbox.apache.org/support.html
>>>
>>> I don't want to register.
>>>
>>> Please find a sharehoster that doesn't require registration to download.
>>>
>>> If the XObject that Karl Heinz Kremer mentioned is a form then text
>>> extraction should work, especially if it was possible to extract with
>>> Adobe
>>> Reader. If it is an image then it won't. Apache Tika might help.
>>>
>>> Please mention what you did to get the text with PDFBox, and what version
>>> you were using.
>>>
>>> You wrote "using readText function from the pdfbox library". There is no
>>> "readText" method in PDFBox. Could it be that you used a different
>>> product?
>>>
>>> Tilman
>>>
>>>
>>> I did some more diagnosis last night and it seems that there are two
>>>> layers
>>>> on the pdf. One which is the content and the other with headers and
>>>> footers. Pdf box is only reading the headers and footers.
>>>> I suspect this must be common with all conference proceedings.
>>>>
>>>> Thanks,
>>>>
>>>> Pulkit
>>>>
>>>> On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr <THausherr@t-online.de>
>>>> wrote:
>>>>
>>>> Am 02.02.2017 um 05:55 schrieb Pulkit Kapur:
>>>>
>>>>> Hi
>>>>>
>>>>>> I am trying to read some past years IEEE conference proceedings i
>>>>>> have.
>>>>>> I can read the pdf using acrobat and select the text.
>>>>>>
>>>>>> But when i try to read the text using readText function from the
>>>>>> pdfbox
>>>>>> library, i only get the headers and footers in the pdf.
>>>>>>
>>>>>> I did check the document is not encrypted.
>>>>>> Also my code works on other pdf documents but all IEEE proceedings
>>>>>> that
>>>>>> are downloaded form IEEE fail to work.
>>>>>>
>>>>>> I have attached the pdf document with this message.
>>>>>>
>>>>>> Please upload the pdf somewhere, PDF attachments are not allowed
here.
>>>>>>
>>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message