any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <>
Subject [jira] [Commented] (ANY23-418) Take another look at encoding detection
Date Thu, 07 Feb 2019 06:46:00 GMT


Hudson commented on ANY23-418:

SUCCESS: Integrated in Jenkins build Any23-trunk #1654 (See [])
ANY23-418 improve TikaEncodingDetector (hans: rev d64dac9dfe0752c45d3ff9fbca37bbe447e5c55b)
* (edit) encoding/src/main/java/org/apache/any23/encoding/
ANY23-418 add additional unit tests (hans: rev e9f11b4979f491d395f76ad22f11869220099be2)
* (edit) encoding/src/test/java/org/apache/any23/encoding/
* (edit) encoding/src/main/java/org/apache/any23/encoding/
ANY23-418 update f8 artifact, cleanup (hans: rev dce3c098e8a4c0662e663d847d345a67a978e343)
* (edit) encoding/src/test/java/org/apache/any23/encoding/
* (edit) encoding/pom.xml
* (edit) encoding/src/main/java/org/apache/any23/encoding/
ANY23-418 update NOTICE.txt (hans: rev e9c001ffa7bcfb7914a91649d2a190857569d054)
* (edit) NOTICE.txt

> Take another look at encoding detection
> ---------------------------------------
>                 Key: ANY23-418
>                 URL:
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
> In order to address various shortcomings of Tika encoding detection, I've had to modify
the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. In the former, I placed
a much greater weight on detected charsets declared in html meta elements & xml declarations.
In the latter, I placed a much greater weight on charsets returned from HTTP Content-Type
> However, after taking a look at TIKA-539, I'm thinking I should reduce this added weight
(for at least html meta elements), and perhaps ignore it altogether (unless it happens to
match UTF-8, since it seems that incorrect declarations usually declare something *other than*
UTF-8, when the correct charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our encoding detection
errors to date have revolved around *something other than UTF-8* being detected when the correct
encoding was actually UTF-8, not the other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the charset is UTF-8
should add to the weight for UTF-8, while any declared hints that the charset is not UTF-8
should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should be ignored,
unless they match UTF-8 and do not match the Content-Type header, in which case all hints,
including the Content-Type header, should be ignored.
>  EDIT: The above 2 points are a simplification of what I've actually implemented (specifically,
I don't necessarily ignore non-UTF-8 hints). See PR 131 for details.

This message was sent by Atlassian JIRA

View raw message