pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilad Denneboom <gilad.denneb...@gmail.com>
Subject Re: Problems with Java PDFBox
Date Sun, 09 Sep 2012 11:01:08 GMT
I believe it's because that text is written in a non-standard font which is
only partially embedded in the file, called "TTE1890348t00"...
You can see it for yourself if you open the file in Acrobat and try to copy
that text using the text selection tool. The result is just a bunch of
unreadable unicode symbols. Other text in the file uses Arial or some other
standard fonts, and therefore can be read easily.

On Sun, Sep 9, 2012 at 11:13 AM, Natalia Gómez García <
natalia.gmz.garcia@gmail.com> wrote:

> Hello,
>
> I am a computer science student and I'm using your library PDFBox in Java
> to extract text data from some pdf files.
>
> In this project, I am having difficulties extracting the text from this
> pdf: http://www.escet.urjc.es/alumnos/horarios/GR_Biologia_2012-13.pdf.
> Specifically, I can't get to extract the text "Semana del 3 al 7 de
> Septiembre de 2012".
>
> Why can this be happening? Could you please give me some directions on how
> to extract this data?
>
> The code I'm using right now is the following:
> pdfDoc = PDDocument.load(url);
> pdfStripper = new PDFTextStripper();
> texto=pdfStripper.getText(pdfDoc);
> pdfDoc.close();
>
> Thanks for your attention
> Natalia
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message