pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Fwd: Trouble reading IEEE pdf
Date Thu, 02 Feb 2017 19:08:49 GMT
Am 02.02.2017 um 19:59 schrieb Pulkit Kapur:
> My apologies. This was very careless of me. I did not realize scribd would
> want you to register to download.
>
> I have uploaded the document here: http://www.filedropper.com/0024iros2016
>
> My code is in Matlab (and not command line interface) and i am using
> *PDFBox-0.7.3.jar* and *FontBox-0.1.0.jar*


Thank you, now I could read the file.

The current PDFBox version is 2.0.4.  Your version 0.7.3 is from 2006, 
thus over 10 years old. Many bugs (probably over 2000) have been fixed 
since then.

I ran the ExtractText command line utility with your file and the result 
is 40KB large. It starts with this:

A Reactive Stepping Algorithm Based on Preview Controller with Observer
for Biped Robots
Oliver Urbann, Matthias Hofmann
Abstract— Reactive stepping is an important utility to regain
balance when bipedal walking motions are disturbed. This
paper sheds light on the reasons for humanoid robots to
fall down. It presents a method to calculate modifications
of predefined foot placements with the objective to minimize
deviations of the Zero Moment Point from a reference without
interrupting the walk. The calculation is in closed-form, and
is embedded into a well-evaluated preview controller with
observer based on the 3D Linear Inverted Pendulum Mode
(3D-LIPM). Experiments in simulation and on a physical robot
prove the benefit of the proposed system.

(...)

Tilman


> I am using the *getText *function and using the java library with Matlab.
> Here is my matlab code snippet:
> //-------------------------------------------------------------------
> pdfdoc = pdfdoc.load(fullfile(listing(i).folder,listing(i).name));
>      pdfdoc.isEncrypted;
>
>      %% text, with planty of padding
>      pdfstr = reader.getText(pdfdoc);                 %#ok
>      pdfdoc.close
> //-------------------------------------------------------------------------------------
> The text i do get is :
> "2016 IEEE International Conference on Robotics and Automation (ICRA)
> Stockholm, Sweden, May 16-21, 2016
> 978-1-4673-8026-3/16/$31.00 ©2016 IEEE 1366
> 1367
> 1368
> 1369
> 1370
> 1371
> 1372
> 1373
> "
> Which is the header on the first page and the footer on all the pages (page
> numbers).
>
> Thanks,
>
> Pulkit
>
> On Thu, Feb 2, 2017 at 11:03 AM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 02.02.2017 um 16:10 schrieb Pulkit Kapur:
>>
>>> Hi
>>>
>>> I have uploaded the pdf here:
>>> https://www.scribd.com/document/338221804/0024-iros-2016
>>>
>> Hello Pulkit,
>>
>> This site requires registration. This is a "don't" from the list:
>> https://pdfbox.apache.org/support.html
>>
>> I don't want to register.
>>
>> Please find a sharehoster that doesn't require registration to download.
>>
>> If the XObject that Karl Heinz Kremer mentioned is a form then text
>> extraction should work, especially if it was possible to extract with Adobe
>> Reader. If it is an image then it won't. Apache Tika might help.
>>
>> Please mention what you did to get the text with PDFBox, and what version
>> you were using.
>>
>> You wrote "using readText function from the pdfbox library". There is no
>> "readText" method in PDFBox. Could it be that you used a different product?
>>
>> Tilman
>>
>>
>>> I did some more diagnosis last night and it seems that there are two
>>> layers
>>> on the pdf. One which is the content and the other with headers and
>>> footers. Pdf box is only reading the headers and footers.
>>> I suspect this must be common with all conference proceedings.
>>>
>>> Thanks,
>>>
>>> Pulkit
>>>
>>> On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>> Am 02.02.2017 um 05:55 schrieb Pulkit Kapur:
>>>> Hi
>>>>> I am trying to read some past years IEEE conference proceedings i have.
>>>>> I can read the pdf using acrobat and select the text.
>>>>>
>>>>> But when i try to read the text using readText function from the pdfbox
>>>>> library, i only get the headers and footers in the pdf.
>>>>>
>>>>> I did check the document is not encrypted.
>>>>> Also my code works on other pdf documents but all IEEE proceedings that
>>>>> are downloaded form IEEE fail to work.
>>>>>
>>>>> I have attached the pdf document with this message.
>>>>>
>>>>> Please upload the pdf somewhere, PDF attachments are not allowed here.
>>>>
>>>>
>>>> Tilman
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message