pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francisco Andrés Fernández <fra...@gmail.com>
Subject Re: Bad text extraction result
Date Wed, 24 Feb 2016 19:47:56 GMT
Hi Tilman, many thanks for your answer.
I doesn't find any configuration file to tweak this.
I send you the link to the pdf file to see if you could figure an idea
about what the problem is.
https://drive.google.com/file/d/0B0PMZsHkpcJRSEpBSWhtQndKZTg/view?usp=sharing
Many thanks in advance,

Francisco

El mié., 24 de feb. de 2016 a la(s) 16:29, Tilman Hausherr <
THausherr@t-online.de> escribió:

> Am 24.02.2016 um 20:17 schrieb Francisco Andrés Fernández:
> > Hi all,
> > I'm extracting some text from pdf, through Tika in Solr. As result, some
> > important words end with spaces between characters.
> > For example, I could have the word "Subtitle" that I want to detect,
> > written like "S u b t i t l e".
>
> You could try to modify spacingTolerance or averageCharTolerance in
> PDFTextStripper (find out if TIKA supports this), but it is likely that
> if spaces are ignored, they would be ignored at other places where you
> don't want it.
>
> If possible, please upload your file somewhere.
>
> Tilman
>
> > How could I make PdfBox detect this type of word occurrence?
> > Many thanks,
> >
> > Francisco
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message