tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Curtis Warner (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-405) Problems handling Hyperlinks and Tables in Word 97 Docs
Date Mon, 12 Apr 2010 21:12:50 GMT

     [ https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Curtis Warner updated TIKA-405:
-------------------------------

    Attachment: WordDocWithLinksAndTable.doc
                expected.txt
                actual.txt

> Problems handling Hyperlinks and Tables in Word 97 Docs
> -------------------------------------------------------
>
>                 Key: TIKA-405
>                 URL: https://issues.apache.org/jira/browse/TIKA-405
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.7
>         Environment: 32-bit Ubuntu Linux
>            Reporter: Curtis Warner
>         Attachments: actual.txt, expected.txt, WordDocWithLinksAndTable.doc
>
>
> I discovered some odd behavior while running a three-way comparison test between Tika,
Aperture, and Autonomy KeyView. The input file was a test Word 97 Doc (attached) including
a paragraph peppered with hyperlinks and a table filled with dummy text. KeyView generated
the full text, as I expected. Aperture and Tika had identical results to one another (barring
one lost whitespace character), but their outputs yielded significantly fewer tokens than
KeyView's did. I've attached the output text from KeyView and Tika for reference.
> There are two distinct problems I recognized in Tika's text output:
> 1) Hyperlinks from the Word Doc aren't included in the output text. They appear to have
been skipped completely.
> 2) The values in the Word Doc's table are conglomerated all together into a single blob
rather than being emitted separately, which ruins any attempt at tokenizing the table's contents.
> Seeing as both Tika and Aperture had exactly the same issues with this test file, my
guess is that it's a problem with the shared POI library. I thought it would be worth noting,
though, in case there's an easy fix on the Tika end of things.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message