lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: [VOTE] Apache Tika 0.4 Release Candidate 2
Date Wed, 15 Jul 2009 18:32:37 GMT
OK, I change my vote to +1.  I'll update Solr as needed.

On Jul 15, 2009, at 9:30 AM, Jukka Zitting wrote:

> Hi,
>
> On Wed, Jul 15, 2009 at 3:00 PM, Grant  
> Ingersoll<gsingers@apache.org> wrote:
>> 3. Did something change such that CONTENT_LANGUAGE is now not being  
>> set for
>> HTML?  We have a test in Solr that looks for that attribute, and it  
>> was
>> passing with 0.3 but is now not passing in 0.4.
>
> This is because of TIKA-208.
>
> We used to use the ICU4J charset detection mechanism to automatically
> detect the encoding of HTML files. ICU4J would also guess the content
> language based on the detected encoding (e.g. a document encoded in
> KOI8-R is most likely written in Russian).
>
> However, this mechanism wasn't as accurate as the encoding detection
> already present in NekoHtml and language detection based on just the
> encoding is often incorrect.
>
> See TIKA-209 for some ideas on how to make the language detection more
> generic and accurate. For now I think it's better to ship Tika 0.4
> without the earlier flawed CONTENT_LANGUAGE implementation for HTML.
>
> BR,
>
> Jukka Zitting



Mime
View raw message