lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2384) Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
Date Thu, 08 Apr 2010 11:26:36 GMT


Robert Muir commented on LUCENE-2384:

bq. For JFlex this does not help as the Jflex-generated code always needs a Reader.

This can be fixed. Currently all I/O in all tokenizers is broken and buggy, and does not correctly
handle special cases around their 'buffering'.

The only one that is correct is CharTokenizer, but at what cost? It has so much complexity
because of this Reader issue.

We should stop pretending like we can really stream docs with Reader.
We should stop pretending like 8GB documents or something exist, where we cant just analyze
the whole doc at once and make things simple.
And then we can fix the lucene tokenizers to be correct.

> Reset zzBuffer in StandardTokenizerImpl* when lexer is reset.
> -------------------------------------------------------------
>                 Key: LUCENE-2384
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: Analysis
>    Affects Versions: 3.0.1
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>         Attachments: reset.diff
> When indexing large documents, the lexer buffer may stay large forever. This sub-issue
resets the lexer buffer back to the default on reset(Reader).
> This is done on the enclosing issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message