any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ANY23-418) Take another look at encoding detection
Date Thu, 08 Nov 2018 17:56:00 GMT


ASF GitHub Bot commented on ANY23-418:

Github user lewismc commented on the issue:
    You've brought up an excellent topic for conversation. Tika currently has a batch, regression
job which essentially enables them to run over loads of documents and analyze the output.
The result being that they know how changes to the source are affecting Tika's ability to
do what it claims it is doing over time. We do not have that in Any23 but I think we should
make an effort to build bridges withe the Tika community in this regard with the aim of us
sharing resources (both available computing to run large batch parse jobs, as well as dataset(s)
we can use to run Any23 over.)
    I have been thinking for the longest time now about implementing a ```tika.triplify```
API which would encapsulate Any23 run it on the Tika data streams but I just never got around
to it. Maybe now is a better time to bring that idea back to life. 
    I was thinking we could possibly use common crawl but they do not publish the raw data
AFAIK it is the Nutch segments or some alternative e.g. the WebArchive files.

> Take another look at encoding detection
> ---------------------------------------
>                 Key: ANY23-418
>                 URL:
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Priority: Major
>             Fix For: 2.3
> In order to address various shortcomings of Tika encoding detection, I've had to modify
the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. In the former, I placed
a much greater weight on detected charsets declared in html meta elements & xml declarations.
In the latter, I placed a much greater weight on charsets returned from HTTP Content-Type
> However, after taking a look at TIKA-539, I'm thinking I should reduce this added weight
(for at least html meta elements), and perhaps ignore it altogether (unless it happens to
match UTF-8, since it seems that incorrect declarations usually declare something *other than*
UTF-8, when the correct charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our encoding detection
errors to date have revolved around *something other than UTF-8* being detected when the correct
encoding was actually UTF-8, not the other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the charset is UTF-8
should add to the weight for UTF-8, while any declared hints that the charset is not UTF-8
should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should be ignored,
unless they match UTF-8 and do not match the Content-Type header, in which case all hints,
including the Content-Type header, should be ignored.
>  EDIT: The above 2 points are a simplification of what I've actually implemented (specifically,
I don't necessarily ignore non-UTF-8 hints). See PR 131 for details.

This message was sent by Atlassian JIRA

View raw message