poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject [Bug 60570] Add rudimentary EMF read-only capability
Date Thu, 19 Jan 2017 16:27:36 GMT
https://bz.apache.org/bugzilla/show_bug.cgi?id=60570

Tim Allison <tallison@mitre.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #4 from Tim Allison <tallison@mitre.org> ---
r1779493

This patch adds the capability to perform a rudimentary parse of EMF and
EMFPlus records with the goals of extracting embedded pdfs (and other binary
files) as well as wmfs.

This offers a start towards text extraction, although more work remains,
including: 
1) parsing and tracking the fonts to handle exttextouta and polytexta
2) implementation of the polytexts (I couldn't find examples)

I developed this code with emfs and wmfs extracted from commoncrawl and
govdocs1.  I only included unit tests for emfs/wmfs that I could extract from
POI's test files and/or Tika's test files.

If we're ok adding commoncrawl and/or govdocs1 docs to our unit test suite, I
can add more unit tests.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message