incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] UTF-8 validity checking
Date Tue, 06 Sep 2011 05:20:44 GMT
Greets,

I've opened up a new issue dealing with UTF-8:

    https://issues.apache.org/jira/browse/LUCY-179

    Tighten UTF-8 validity checks.

Technically, committing the supplied patch is a backwards compatibility break,
though the likelihood of encountering problems seems remote.  To trigger a new
exception, an existing index would have to contain UTF-8-encoded UTF-16
surrogates, or code points above 0x10FFFF.  These would not ordinarily make it
through the usual parsing mechanisms supplied by the Perl core's Encode module.

However, it's also worth noting that the new implementation still passes
"noncharacter" code points such as 0xFFFF.  This is consistent with Lucy's
identity as a library rather than an application.  A conforming UTF-8 decoder
*must* handle noncharacter code points, but an application has the option of
handling them differently.

To craft a malicious index file which exploits an app's handling of
noncharacter code points would require very special circumstances.  In
contrast, allowing non-shortest-form UTF-8 is a classic security risk[1][2];
non-shortest-form ASCII code points are forbidden both before and after the
application of this patch, and other non-shortest-form handling is improved by
the new policy of blocking surrogates.

I plan to commit in a day or so if there are no objections.

In addition... Like the Lemon-based JSON module, this patch replaces a
dependency which we had previously relied on the host language to provide with
a core implementation which can be shared among all host languages.  Slowly
but surely, the effort that it takes to write a host language binding is being
reduced.

Marvin Humphrey

[1] http://lab.gsi.dit.upm.es/semanticwiki/index.php/Using_UTF-8_Encoding_to_Bypass_Validation_Logic

[2] http://unicode.org/reports/tr36/#UTF-8_Exploit


Mime
View raw message