pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francisco Andrés Fernández <fra...@gmail.com>
Subject Re: Bad text extraction result
Date Tue, 01 Mar 2016 20:53:46 GMT
I'm sorry. That was only the case when you use pdftotext to extract text.
My apologize.

Francisco

El mar., 1 de mar. de 2016 a la(s) 16:56, Francisco Andrés Fernández <
franaf@gmail.com> escribió:

> Hi Tilman, regarding this issue, I've found a workaround that does not
> solve pdfbox problem but might help.
> I've filtered my documents replacing regex '[\xAD]' that is hex for 'soft
> hyphen', as that seems to be the symbol that gets included between normal
> characters.
> After that, texts appears to be as required.
> Regards,
>
> Francisco
>
> El jue., 25 de feb. de 2016 a la(s) 14:14, Francisco Andrés Fernández <
> franaf@gmail.com> escribió:
>
>> Thanks Tilman.
>>
>> El jue., 25 de feb. de 2016 a la(s) 14:08, Tilman Hausherr <
>> THausherr@t-online.de> escribió:
>>
>>> Thanks. The issue is here:
>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>>
>>> Tilman
>>>
>>> Am 25.02.2016 um 12:44 schrieb Francisco Andrés Fernández:
>>> > As additional information, I've found 2 related posts (about another
>>> tools)
>>> > in StackOverflow:
>>> >
>>> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
>>> >
>>> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
>>> > Regards
>>> >
>>> > El mié., 24 de feb. de 2016 a la(s) 22:50, Francisco Andrés Fernández
<
>>> > franaf@gmail.com> escribió:
>>> >
>>> >> Many thanks Tilman.
>>> >> I'll try to find a workaround in the meantime.
>>> >> Cheers,
>>> >>
>>> >> Francisco
>>> >>
>>> >> El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
>>> >> THausherr@t-online.de> escribió:
>>> >>
>>> >>> I'll create an issue in JIRA later or tomorrow, but don't expect
that
>>> >>> this will be fixed quickly (unless I missed something obvious).
We
>>> want
>>> >>> to release 2.0 before doing any "big" work on text extraction.
>>> >>>
>>> >>> Tilman
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> >>>
>>> >>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message