any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hans Brende (JIRA)" <>
Subject [jira] [Updated] (ANY23-418) Take another look at encoding detection
Date Mon, 04 Feb 2019 00:47:00 GMT


Hans Brende updated ANY23-418:
    Fix Version/s:     (was: 2.4)

> Take another look at encoding detection
> ---------------------------------------
>                 Key: ANY23-418
>                 URL:
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Priority: Major
>             Fix For: 2.3
> In order to address various shortcomings of Tika encoding detection, I've had to modify
the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. In the former, I placed
a much greater weight on detected charsets declared in html meta elements & xml declarations.
In the latter, I placed a much greater weight on charsets returned from HTTP Content-Type
> However, after taking a look at TIKA-539, I'm thinking I should reduce this added weight
(for at least html meta elements), and perhaps ignore it altogether (unless it happens to
match UTF-8, since it seems that incorrect declarations usually declare something *other than*
UTF-8, when the correct charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our encoding detection
errors to date have revolved around *something other than UTF-8* being detected when the correct
encoding was actually UTF-8, not the other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the charset is UTF-8
should add to the weight for UTF-8, while any declared hints that the charset is not UTF-8
should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should be ignored,
unless they match UTF-8 and do not match the Content-Type header, in which case all hints,
including the Content-Type header, should be ignored.
>  EDIT: The above 2 points are a simplification of what I've actually implemented (specifically,
I don't necessarily ignore non-UTF-8 hints). See PR 131 for details.

This message was sent by Atlassian JIRA

View raw message