pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jorgeeflorez <jorgeeduardoflo...@gmail.com>
Subject Re: Text extraction example
Date Tue, 13 Nov 2018 04:50:56 GMT
Hi. Ok. I understand. Nevermind :)
Thanks.

El lun., 12 de nov. de 2018 11:16 p. m., Tilman Hausherr <
THausherr@t-online.de> escribió:

> Am 12.11.2018 um 19:56 schrieb jorgeeflorez:
> > Hi all,
> >
> > first, I want to thank Tilman for his effort getting the text from a page
> > regardless its rotation.
> > (https://issues.apache.org/jira/browse/PDFBOX-4371).
> >
> > second, I want to share with you a small application I created using C#.
> It
> > uses ITextSharp library and a custom text extraction strategy to get the
> > text.
> >
> > Application: here
> > <
> https://drive.google.com/file/d/1CmKvkib_ONTytwaoIrrmMdVyICXO1IPd/view?usp=sharing
> >
> > Class that process text: here
> > <
> https://drive.google.com/file/d/1u3VykdQR8Eh9ooRiqxc4q2_20w3lw8gw/view?usp=sharing
> >
> > Sample PDF files: here
> > <
> https://drive.google.com/file/d/1KdpQEIEbIl5ZETq33C2X8JVM5qfMXlDg/view?usp=sharing
> >
> >
> > I was trying to port the code to Java and make it work using PDFBox
> > objects, but so far, it has been not possible to me.
> >
> > Basically, the magic occurs in method RenderText (Based on other code I
> > found in a web page I don't remember :( ). It uses vectors (origin is
> lower
> > left corner of the page) to determine stuff like if there is a line
> break,
> > or if a whitespace must be put between glyphs.
> >
> > I just hope this code gives you some light to adjust or improve (if you
> > consider it necessary) text extraction.
>
>
> Hi, thanks but sorry, but there are several reasons that I can't use it:
> 1) I don't know itext, 2) I can't use code "found in a web page I don't
> remember" (license!), 3) I don't run exe files.
>
> I think our TextStripper code is similar that it uses some algorithms to
> decide where to insert blanks, and whether glyphs are on a line or not.
>
> Tilman
>
>
> >
> > That's it.
> >
> > Thank you.
> > Best Regards.
> >
> > Jorge Eduardo Flórez
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message