tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files
Date Wed, 01 Aug 2012 12:15:03 GMT

    [ https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426550#comment-13426550
] 

Jukka Zitting commented on TIKA-965:
------------------------------------

I see where you're going, but it's a really tricky path. I tried doing something like that
earlier on, but I found no easy way to keep down the number of false positives.

The ICU4J classes are written with the assumption that the data you're working on is always
text and they just figure out which character encoding is most likely. They fail to take into
account the possibility of the document being in some unknown binary format.

That's why we currently run the full ICU4J encoding detection (using the {{o.a.t.parser.txt.Icu4jEncodingDetector}}
and {{o.a.t.detect.AutoDetectReader}} classes, see TIKA-322 and TIKA-471) only once we already
know by other means that we're dealing with textual data.
                
> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> ----------------------------------------------------
>
>                 Key: TIKA-965
>                 URL: https://issues.apache.org/jira/browse/TIKA-965
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>         Attachments: 0001-TIKA-965-Text-Detection-Fails-on-Mostly-Non-ASCII-UT.patch
>
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 characters the
TextDetector and TextStatistics classes fail to detect it as text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message