lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] StandardTokenizer has landed
Date Thu, 08 Dec 2011 21:37:28 GMT
On 08/12/11 02:59, Marvin Humphrey wrote:
> I think this is OK.  For efficiency reasons, we do not want to require UTF-8
> validity at the granularity of the Token during analysis.  How about we
> establish these rules for the analysis phase?
>
>    * Input to the analysis chain must be valid UTF-8.
>    * Analyzers must be prepared to encounter broken UTF-8 but may either throw
>      an exception or produce junk.
>    * Broken UTF-8 emitted by an analysis chain should be detected prior to
>      Indexer commit.

Sounds reasonable.

>> 2. If there's invalid UTF-8 near the end of the input buffer, we might
>> read up to three bytes past the end of the buffer.
>
> I think this is OK, too.  First, this is only a problem for broken analysis
> chains.  Second, the typical outcome will be a token with a small amount of
> random bogus content, and the Indexer will probably throw an exception prior
> to commit anyway rather than leak the content into the index.

But reading past the end of the buffer might cause a segfault. So if we 
want to follow the rules above, we should guard against that.

Nick



Mime
View raw message