tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-965) Text Detection Fails on Mostly Non-ASCII UTF-8 Files
Date Wed, 01 Aug 2012 10:14:03 GMT

    [ https://issues.apache.org/jira/browse/TIKA-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426490#comment-13426490
] 

Jukka Zitting commented on TIKA-965:
------------------------------------

I'm not too big a fan of the {{Charset}} classes in {{o.a.t.parser.txt}}. We borrowed them
from ICU4J, and though they cover a lot of exotic corner cases, they're pretty slow and cumbersome
to use with the vast majority of text out there.

An alternative that should work fairly well is to leverage the existing {{TextStatistics}}
class in {{tika-core}} for a quick check of whether there are as many UTF-8 continuation bytes
in the text as there should be. Something like the following might be a good approximation:

{code}
public boolean looksLikeUTF8() {
    int control = count(0, 0x20);
    int utf8 = count(0x20, 0x80);
    int safe = countSafeControl();

    int expectedContinuation = 0;
    int[] leading = new int[] {
            count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) };
    for (int i = 0; i < leading.length; i++) {
        utf8 += leading[i];
        expectedContinuation += (i + 1) * leading[i];
    }

    int continuation = count(0x80, 0xc0);
    return utf8 > 0
            && continuation <= expectedContinuation
            && continuation >= expectedContinuation - 3
            && count(0xf80, 0x100) == 0
            && (control - safe) * 100 < utf8 * 2;
}
{code}
                
> Text Detection Fails on Mostly Non-ASCII UTF-8 Files
> ----------------------------------------------------
>
>                 Key: TIKA-965
>                 URL: https://issues.apache.org/jira/browse/TIKA-965
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>
> If a file contains relatively few ASCII characters and more 8 bit UTF-8 characters the
TextDetector and TextStatistics classes fail to detect it as text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message