lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cam Bazz" <>
Subject text extraction from pdf
Date Wed, 14 May 2008 09:31:55 GMT
Hello All,

Any suggestions for extracting text from PDF? I have tried pdfbox, but it
works nice, however if the pdf is structured, it wont provide good results.
For example consider the pdf:

P1 Lorem Ipsum Bla bla                                      P3 Lorem2 Ipsum2
P1 bla bla

P2 bla bla bla
P2 bla bla bla

above P1,2 and 3 are meaningful paragraphs or fields. The pdfbox will

P1 Lorem Ipsim Bla bla P3 Lorem2 Ipsum2
P1 bla bla

which is not useful to me.

the unix program pdf2text can convert keeping the text places, but I wanted
to ask you guys if you know something better,


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message