lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4489) improve LimitTokenCountFilter and/or it's tests
Date Wed, 17 Oct 2012 23:34:05 GMT


Hoss Man commented on LUCENE-4489:

In SOLR-3961 rmuir and i were discussing the tests for LimitTokenCountFilter.  it seems to
me like there are at least 2 things that should be improved in the tests, and one possible
improvement to the code itself...

A) fix TestLimitTokenCountAnalyzer to use MockTokenizer...

bq. 3) the closest thing i see to a test of LimitTokenCountFilter is TestLimitTokenCountAnalyzer
- i realize now the reason it's testLimitTokenCountAnalyzer doesn't get the same failure is
because it's wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use

I think we should always do this!

B) testLimitTokenCountAnalyzer.testLimitTokenCountIndexWriter smells fishy to me, still discussing
with rmuir...

bq. 4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter that uses MockAnalyzer
w/o calling setEnableChecks(false) which seems like it should trigger the same failure i got
since it uses MockTokenizer, but in general that test looks suspicious, as it seems to add
the exact number of tokens that the limit is configured for, and then asserts that the last
token is in the index - but never actaully triggers the limiting logic since exactly the allowed
umber of tokens are used.

Then thats fine, because when LimitTokenCountFilter consumes the whole stream, its a "good
consumer". its only when it actually truncates that it breaks the tokenstream contract.

My concern wasn't that i thought that test was fishy just because it didn't call setEnableChecks(false)
-- my concern was that i don't see the point of the test at all as it's currently written,
because it doesn't actually trigger the token limiting behavior (if it did it would _have_
to call setEnableChecks(false) ... so what exactly is it suppose to be testing?

C) going back to a comment rmuir made earlier in SOLR-3961...

bq. Thats because LimitTokenCountFilter violates the workflow of the standard tokenstream
API ... it cuts off its consumer and calls end() even while that consumer still has more tokens:
this could easily cause unexpected side effects and bugs. ... But we knew about this anyway

This makes me wonder if it would be worth while to add an option to  LimitTokenCountFilter
to make it always consume all the tokens from the stream it wraps -- it could be off by default
so it would be havle exactly like today, but if you are wrapping a TokenStream where it's
really important to you that every token is consumed before end() is called, you could set
it appropriately.

would that make sense?
> improve LimitTokenCountFilter and/or it's tests
> -----------------------------------------------
>                 Key: LUCENE-4489
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Hoss Man
> spinning off a discussion about LimitTokenCountFilter  and it's tests from SOLR-3961
(which was about a specific bug in the LimitTokenCountFilterFactory)

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message