pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duane Nickull <du...@technoracle-systems.com>
Subject Re: ANN: AMI2-PDF2SVG conversion of PDF to semantic characters and graphics
Date Sat, 17 Nov 2012 19:10:05 GMT
Very cool project!  I did not see any EULA on this declaring a GPL or
similar style license.  What license are you using?  I would like to
introduce this work to some people.

Thank you for sharing!

Duane Nickull
***********************************
Technoracle Advanced Systems Inc.
Consulting and Contracting; Proven Results!
i.  Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
b. http://technoracle.blogspot.com
t.  @duanechaos
"Don't fear the Graph!  Embrace Neo4J"






On 2012-11-16 4:42 PM, "Peter Murray-Rust" <pm286@cam.ac.uk> wrote:

>I am pleased to announce an Open Source project (AMI2) based partly on
>PDFBox which aims to create semantic documents from PDFs. It's in three
>parts of which the first (PDF2SVG) uses a lot of the functionality of
>PDFBox to finally create SVG. The result is a flat document with no
>further
>reliance on the original PDF structure and dictionaries.
>
>First to thank members of this list for help and congratulate the
>developers on a fine product.
>
>The overview is at: https://bitbucket.org/petermr/pdf2svg/overview. I also
>blog on this (look for ami2 in the title, e.g.
>https://blogs.ch.cam.ac.uk/pmr/2012/11/16/ami2-opencontentmining-ami-analy
>ses-more-pdfs-and-gets-useful-help-from-stackoverflow-and-shapecatcher/-
>other blogs may or may not be of interest).
>
>In essence PDF2SVG tries to:
>* normalize all x,y, coordinates to a display page/screen
>* identify all characters with x,y, and Unicode codepoint. These are
>converted to <svg:text x="" y="">text</svg:text>
>* identify all paths and convert to <svg:path d="M d d L d d C d d d
>..."/>
>(i.e. move/line/cubic/quad/close)
>* extract bitmaps.
>* carry out some (but not total) character equivalencing where glyphs are
>essentially interchangeable. Also expand ligatures.
>
>The aim is to be able to turn STM documents (Scientific Technical Medical)
>into semantic objects. These documents are widely found in scholarly
>publications and reports and patents. The two subsequent modules do not
>directly use PDFBox but may be of interest:
>* AMI2-SVGPlus converts isolated characters (output of PDF2SVG) into
>running text, with super- and -subscripts, and paths into higher order
>primitives (svg:rect, svg:circle, svg:polyline, etc.) It includes a
>general
>tool for extracting vectors into graphical plots (e.g. x-y plots with
>curves and points)
>* AMI2-SVG2XML converts the results of SVGPlus into scientific objects
>such
>as chemical reactions, phylogenetic trees, genome, etc.
>These last two have been written and are being refactored.
>
>The main problem we face (and which will be of interest to PDFBoxers) is
>the extraction of reliable Unicode codepoints. In favourable cases the PDF
>document uses PDF-approved fonts (e.g. Helvetica) and Unicode points (BTW
>I
>think all science and maths can be done with Unicode). Unfortunately many
>of the typesetters use non-standard approaches and these include:
>* sets such as Mathematical-Pi which have no public mapping to Unicode
>(see
>my recent question:
>http://stackoverflow.com/questions/13188587/conversion-of-mathematicalpi-s
>ymbol-names-to-unicode).
>There appear to be 2 main others (Symbol, which maps ASCII characters
>to
>Greek letters, for example; and one whose symbols are of the form Cd(dd) -
>thus C3 is asterisk and C6 is plus-minus. Any idea on where this came from
>would be valuable!
>* PDFonts without fontDescriptors
>* and even PDFonts without fontNames (only basefont).
>
>The naming of some fonts is also obscure (e.g. AdvP4C4E74). I suspect
>these
>are specific to various typesetting companies but some may be generated on
>the fly. In the worst case we have only the outline glyphs which we have
>to
>translate to Unicode. (this can be done by heuristics but it is not fun -
>as it's all Open it might be crowdsourceable).  So all-in-all it can be
>difficult to interpret characters and there can be ambiguity. (Would I be
>right in thinking that it will be difficult for a machine reader - e.g.
>for
>unsighted humans - to understand PDFs which had no FontDescriptor and no
>FontName?)
>
>This is an Open collaborative project and we'd be delighted for members of
>this list to use AMI2 and contribute if they wish. We've set up an issue
>tracker for comments. I am sure some of you will have faced the same
>problems and any (even partial) solutions will be useful.
>
>PDFD2SVG is beta; the others are being refactored to alpha.
>
>PDF2SVG may, of course, be of use in other disciplines - character
>processing is configurable through external files.
>
>Enjoy
>-- 
>Peter Murray-Rust
>Reader in Molecular Informatics
>Unilever Centre, Dep. Of Chemistry
>University of Cambridge
>CB2 1EW, UK
>+44-1223-763069



Mime
View raw message