lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: WhiteSpaceTokenizer
Date Fri, 15 Aug 2014 12:24:56 GMT
Sure, that should be a configurable option.

Oh, and I neglected to mention a workaround: use the pattern tokenizer, 
which doesn't have a limit (yet.) But it might be slower.

-- Jack Krupansky

-----Original Message----- 
From: Sheng
Sent: Friday, August 15, 2014 8:13 AM
Subject: Re: WhiteSpaceTokenizer

Thanks, Jack. I haven't added myself to the contributor list yet, will do
that and then login  and comment on that ticket. One quick comment:
wouldn't it be more reasonable to throw exception it a token length is more
than 255, if relaxing that limit is still debatable? This way user would
know immediately something is wrong.

On Friday, August 15, 2014, Jack Krupansky <> wrote:

> Yeah, it should be documented better, and configurable.
> Some discussion of related issues here:
> I actually filed a Jira for this already. No action so far, but PLEASE
> feel free to comment on it:
> -- Jack Krupansky
> -----Original Message----- From: Sheng
> Sent: Thursday, August 14, 2014 11:38 PM
> To:
> Subject: WhiteSpaceTokenizer
> The length of token has to be shorter than 255, otherwise there will
> be unpredictable behaviors for this tokenizer. I see 255 is set as a
> private final in the src code, but there is no documentation to explicitly
> address that. Can we either make that number configurable (if not an
> option, I'd like to know why), or put some notes to its java doc? I had a
> hard time to figure that out...
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message