lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: text extraction from pdf
Date Wed, 14 May 2008 11:34:33 GMT
Cam Bazz wrote:
> Hello All,
> 
> Any suggestions for extracting text from PDF? I have tried pdfbox, but it
> works nice, however if the pdf is structured, it wont provide good results.
> For example consider the pdf:
> 
> 
> P1 Lorem Ipsum Bla bla                                      P3 Lorem2 Ipsum2
> P1 bla bla
> 
> P2 bla bla bla
> P2 bla bla bla
> 
> 
> 
> above P1,2 and 3 are meaningful paragraphs or fields. The pdfbox will
> convert
> 
> P1 Lorem Ipsim Bla bla P3 Lorem2 Ipsum2
> P1 bla bla
> 
> which is not useful to me.
> 
> the unix program pdf2text can convert keeping the text places, but I wanted
> to ask you guys if you know something better,

AFAIK, PDFBox has a lower-level API that allows you to get hold of text 
positions.



-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message