pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Bad text extraction result
Date Wed, 24 Feb 2016 19:29:25 GMT
Am 24.02.2016 um 20:17 schrieb Francisco Andrés Fernández:
> Hi all,
> I'm extracting some text from pdf, through Tika in Solr. As result, some
> important words end with spaces between characters.
> For example, I could have the word "Subtitle" that I want to detect,
> written like "S u b t i t l e".

You could try to modify spacingTolerance or averageCharTolerance in 
PDFTextStripper (find out if TIKA supports this), but it is likely that 
if spaces are ignored, they would be ignored at other places where you 
don't want it.

If possible, please upload your file somewhere.


> How could I make PdfBox detect this type of word occurrence?
> Many thanks,
> Francisco

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message