tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-956) Embedded docs in Word doc are not inlined (text is always added to the end)
Date Sun, 05 Aug 2012 22:36:03 GMT

    [ https://issues.apache.org/jira/browse/TIKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428923#comment-13428923
] 

Michael McCandless commented on TIKA-956:
-----------------------------------------

bq. Instead of the non-standard embedded attribute, it would be better to use a construct
like <div class="embedded" id="_NNNNNN"> for this.

Thanks Jukka, I'll change to that.

bq. An even better approach would be to use something like <img src="embedded:..." alt="...">
or <object data="embedded:..." type="...">...</object>. See the XWPFWordExtractorDecorator
class for an example of how embedded images are handled in OOXML Word documents.

I would love to associate the embedded id with corresponding thumbnail
image, and/or to get the type of the embedded object, but I don't yet
see how to do that w/ the POI APIs.  It must be possible though since
Word obviously knows to associate each thumbnail with the right
embedded doc... but I think we can improve this later on.

Separately, it would be nice to include the Picture.getDescription()
as the image alt text but... it looks like that was only recently
added to POI (ie we'd have to upgrade first).

                
> Embedded docs in Word doc are not inlined (text is always added to the end)
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-956
>                 URL: https://issues.apache.org/jira/browse/TIKA-956
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: TIKA-956.patch, TIKA-956.patch
>
>
> You can see this with the recently added testWORD_embedded_pdf.doc
> (for TIKA-948): the "Bye Bye" text comes before the "Wer
> wjelrwoierj..." text from the embedded PDF, opposite of what you see
> when you open the doc with Word.
> Yet, the thumbnail images do seem to be extracted at the right place
> (inlined).
> This is because WordExtractor.java has a separate pass at the end to
> visit the embedded docs.
> Would it be possible to recurse into an embedded doc at the point when
> it's first encountered instead...?  Or maybe somehow correlate the
> images with their corresponding attachment (right now they are just
> named image1, image2, ...)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message