poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Murphy <jmarkmur...@gmail.com>
Subject Re: got docx?
Date Tue, 13 Dec 2016 02:22:19 GMT
Yes, there is no glossary support, and I don't think templates are
supported very well either, if at all. I tried once to read a template and
save it as a document to another file, and things didn't go well. I'm sure
this just scratches the surface. Of course you are looking at things from
an extraction point of view, and I am looking at things from a document
creation point of view. The two are likely very different.

On Mon, Dec 12, 2016 at 7:36 PM, Allison, Timothy B. <tallison@mitre.org>
wrote:

> This is very helpful, Mark.  Thank you.  Y, I'd add handling of the
> glossary document, as well.
>
> As I was working on the SAX parser for Tika, it "feels" more robust from
> an extraction standpoint because it is extracting all "w:t",...with a few
> exceptions (deltext, moveFrom, alternatecontent, etc).  Still needs more
> work, but it sounds from the list you've compiled that the new parser might
> not be a bad idea...if the sole goal is extraction.
>
>
>
> -----Original Message-----
> From: Murphy, Mark [mailto:murphymdev@metalexmfg.com]
> Sent: Monday, December 12, 2016 3:56 PM
> To: 'POI Developers List' <dev@poi.apache.org>
> Subject: RE: got docx?
>
> Lol, just from looking through the code, and standard, there are a number
> of things that I know are not handled or not handled properly in XWPF. A
> quick subset from the top of my head includes:
> * Pictures that are not inlined in the main document, header, or footer
> parts.
> * Sections
> * SDT content
> * Alternate content
> * Many of the shared portions of the spec
> * Tables have problems
> * Versions - This is a tag that gets added to every node telling which
> save (version) it was created for.
> * Revisions - This is the stuff that tells what was changed and how. Which
> nodes were inserted, or changed, or deleted, or moved, and when, and by
> whom.
>
> There are thousands of hours left just to get it to version1 of the spec.
>
> But yes, thanks Dominik for providing this batch of test documents. It
> should help prioritize fixes.
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:tallison@mitre.org]
> Sent: Monday, December 12, 2016 9:58 AM
> To: POI Developers List <dev@poi.apache.org>
> Cc: dev@tika.apache.org
> Subject: RE: got docx?
>
> To close the loop and share my gratitude publicly...
>
> Thank you, Dominik, for transferring 41k, 5GB of docx/dotx to our
> regression corpus!
>
> I’ve already found a number of “areas for improvement” in Tika's
> experimental docx SAX parser, and a few areas for improvement in POI's
> XWPFDocument/DOM parser…all thanks to your documents and your common crawl
> code.
>
> Thank you!
>
>
> Cheers,
>
>         Tim
>
>  B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB�
> � [��X��ܚX�K  K[XZ[
> �  ]�][��X��ܚX�P  �K�\ X� K�ܙ�B��܈ Y  ] [ۘ[  ��[X[�
�  K[XZ[
> �  ]�Z [    �K�\ X� K�ܙ�B�B
>  B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB�
> � [��X��ܚX�K  K[XZ[
> �  ]�][��X��ܚX�P  �K�\ X� K�ܙ�B��܈ Y  ] [ۘ[  ��[X[�
�  K[XZ[
> �  ]�Z [    �K�\ X� K�ܙ�B�B
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message