pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Bad text extraction result
Date Wed, 24 Feb 2016 20:36:13 GMT
I tried all the settings and was unsuccessful. I was unable to extract 
"Cada frasco ampolla" which looked pretty obvious, it always appeared as 
"Ca da fras co ampo lla".

Then I looked into the content stream and found this:

     6 0 1.058 6 122.0924 312.51 Tm
     (Ca) Tj
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (da ) -301 (fras) ] TJ
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (co ) -301 (ampo) ] TJ
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (lla ) -301 (con) ] TJ

So there are really spaces there, and we keep them. Adobe is smarter, 
and ignores them because they are overwritten thanks to the "-301" you 
see (that is a positioning).

This /ActualText thing might be of some help, but I don't think we 
process this.

Tilman


Am 24.02.2016 um 20:47 schrieb Francisco Andrés Fernández:
> Hi Tilman, many thanks for your answer.
> I doesn't find any configuration file to tweak this.
> I send you the link to the pdf file to see if you could figure an idea
> about what the problem is.
> https://drive.google.com/file/d/0B0PMZsHkpcJRSEpBSWhtQndKZTg/view?usp=sharing
> Many thanks in advance,
>
> Francisco
>
> El mié., 24 de feb. de 2016 a la(s) 16:29, Tilman Hausherr <
> THausherr@t-online.de> escribió:
>
>> Am 24.02.2016 um 20:17 schrieb Francisco Andrés Fernández:
>>> Hi all,
>>> I'm extracting some text from pdf, through Tika in Solr. As result, some
>>> important words end with spaces between characters.
>>> For example, I could have the word "Subtitle" that I want to detect,
>>> written like "S u b t i t l e".
>> You could try to modify spacingTolerance or averageCharTolerance in
>> PDFTextStripper (find out if TIKA supports this), but it is likely that
>> if spaces are ignored, they would be ignored at other places where you
>> don't want it.
>>
>> If possible, please upload your file somewhere.
>>
>> Tilman
>>
>>> How could I make PdfBox detect this type of word occurrence?
>>> Many thanks,
>>>
>>> Francisco
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message