pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Bad text extraction result
Date Tue, 01 Mar 2016 21:15:54 GMT
Am 01.03.2016 um 21:53 schrieb Francisco Andrés Fernández:
> I'm sorry. That was only the case when you use pdftotext to extract text.
> My apologize.

No problem... now I understand what this /ActualText thing is about. This

     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC

means that the space is to be replaced with \376\377\000\255, and \255 
is indeed 0xAD. \376\377 is 0xFEFF which is the magic number for unicode.

Tilman


>
> Francisco
>
> El mar., 1 de mar. de 2016 a la(s) 16:56, Francisco Andrés Fernández <
> franaf@gmail.com> escribió:
>
>> Hi Tilman, regarding this issue, I've found a workaround that does not
>> solve pdfbox problem but might help.
>> I've filtered my documents replacing regex '[\xAD]' that is hex for 'soft
>> hyphen', as that seems to be the symbol that gets included between normal
>> characters.
>> After that, texts appears to be as required.
>> Regards,
>>
>> Francisco
>>
>> El jue., 25 de feb. de 2016 a la(s) 14:14, Francisco Andrés Fernández <
>> franaf@gmail.com> escribió:
>>
>>> Thanks Tilman.
>>>
>>> El jue., 25 de feb. de 2016 a la(s) 14:08, Tilman Hausherr <
>>> THausherr@t-online.de> escribió:
>>>
>>>> Thanks. The issue is here:
>>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>>>
>>>> Tilman
>>>>
>>>> Am 25.02.2016 um 12:44 schrieb Francisco Andrés Fernández:
>>>>> As additional information, I've found 2 related posts (about another
>>>> tools)
>>>>> in StackOverflow:
>>>>>
>>>> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
>>>> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
>>>>> Regards
>>>>>
>>>>> El mié., 24 de feb. de 2016 a la(s) 22:50, Francisco Andrés Fernández
<
>>>>> franaf@gmail.com> escribió:
>>>>>
>>>>>> Many thanks Tilman.
>>>>>> I'll try to find a workaround in the meantime.
>>>>>> Cheers,
>>>>>>
>>>>>> Francisco
>>>>>>
>>>>>> El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
>>>>>> THausherr@t-online.de> escribió:
>>>>>>
>>>>>>> I'll create an issue in JIRA later or tomorrow, but don't expect
that
>>>>>>> this will be fixed quickly (unless I missed something obvious).
We
>>>> want
>>>>>>> to release 2.0 before doing any "big" work on text extraction.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message