pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Getting the PDF box to work
Date Mon, 07 Mar 2016 15:56:35 GMT
This is very ambitious. (we do this in contentmine.org and use PDFBox for
PDF input.). It will depend very much on the quality of your input PDFs.

PDFs do not contain words - only characters (and then not always clearly
identified codepoints). Creating words, phrases, sentences, paragraphs
requires heuristics which are almost always lossy and error-prone. If the
documents contain multicolumns, boxes, tables, etc. the problem is even
harder. Even page numbers can be a problem.

P.


On Mon, Mar 7, 2016 at 3:50 PM, Gopinath Chandroth <
gopinath.chandroth@gmail.com> wrote:

> Thanks.  I will study the links.
> For info, I am trying to create a system which can search for a few
> words/phrases in thousands of documents and copy  all the found text (one
> page before and one page after)  and paste into a Word or some other
> document.
>
> Regards
> Gopi
>
> On Mon, Mar 7, 2016 at 2:54 PM, Hartmann Toël <Toel.Hartmann@elanders.com>
> wrote:
>
> > Hi,
> >
> > That would depend on what you are trying to do with pdfbox.
> >
> > Please check
> > https://pdfbox.apache.org/2.0/getting-started.html
> >
> > The code examples are in
> >
> >
> https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/
> >
> > Hello world example on how to create a simple pdf:
> >
> >
> https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/HelloWorld.java?revision=1703254&view=markup
> >
> >
> > Best regards
> > Toël
> >
> > On 7 mar 2016, at 14:21, Gopinath Chandroth <
> gopinath.chandroth@gmail.com
> > <mailto:gopinath.chandroth@gmail.com>> wrote:
> >
> > hello
> > Can someone point me to a step by step guide to using this please?
> > I have made it available under a project in Eclipse - but can't see any
> > code.
> > Regards
> > Gopi
> >
> >
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message