lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-dev] StandardTokenizer has landed
Date Thu, 08 Dec 2011 01:59:27 GMT
On Wed, Dec 07, 2011 at 03:44:13PM +0100, Nick Wellnhofer wrote:
> I can see only two bad things that can happen with invalid UTF-8:
> 1. The tokenizer doesn't detect invalid UTF-8, so it will pass it to  
> other analyzers, possibly creating even more invalid UTF-8.

I think this is OK.  For efficiency reasons, we do not want to require UTF-8
validity at the granularity of the Token during analysis.  How about we
establish these rules for the analysis phase?

  * Input to the analysis chain must be valid UTF-8.
  * Analyzers must be prepared to encounter broken UTF-8 but may either throw
    an exception or produce junk.
  * Broken UTF-8 emitted by an analysis chain should be detected prior to
    Indexer commit.

For the record, we currently perform a UTF-8 validity check on individual
terms within PostingPool.c (during the CB_Mimic_Str() invocations, which
perform internal UTF-8 sanity checking).  This is the right phase for the
check, IMO -- it's after the terms have been sorted and uniqued, so we perform
the validity check once per unique term rather than e.g. once per Token if we
were to enforce UTF-8 validity at the end of the analysis chain.

> 2. If there's invalid UTF-8 near the end of the input buffer, we might  
> read up to three bytes past the end of the buffer.

I think this is OK, too.  First, this is only a problem for broken analysis
chains.  Second, the typical outcome will be a token with a small amount of
random bogus content, and the Indexer will probably throw an exception prior
to commit anyway rather than leak the content into the index.

Marvin Humphrey

View raw message