lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <>
Subject Re: Extending StandardTokenizer Jflex to not split on '/'
Date Fri, 14 Feb 2014 20:55:57 GMT
Welcome Diego,

I think you’re right about MidLetter - adding a char to it should disable splitting on that
char, as long as there is a letter on one side or the other.  (If you’d like that behavior
to be extended to numeric digits, you should use MidNumLet instead.)

I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex (compressed whitespace
diff below):

    -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
    +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})

then running ‘ant jflex’ under lucene/analysis/common/, and the following text was split
as indicated (I tested by adding the method below to

  public void testMidLetterSlash() throws Exception {
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”, 
                                  new String[]{ "one/two/three", "four" });
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”, 
                                 new String[] { "1", "two", "3" });

So it works for me - are you regenerating the scanner (‘ant jflex’)?

FYI, I found a bug when I was testing the above: “” is left intact when
“/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and ‘/‘ are in
[/\p{WB:MidLetter}], the letter-on-both-sides requirement should instead result in “”
being split into “http” and “”.  Further testing indicates that this is
a problem for MidLetter, MidNumLet and MidNum.  I’ve filed an issue: <>.


On Feb 14, 2014, at 1:42 PM, Diego Fernandez <> wrote:

> Hi guys, this is my first time posting on the Lucene list, so hello everyone.
> I really like the way that the StandardTokenizer works, however I'd like for it to not
split tokens on / (forward slash).  I've been looking at
to try to understand the rules, but I'm either misunderstanding or missing something.  If
I understand correctly, the symbols in MidLetter keep it from splitting a token as long as
there's alpha chars on either side.  I tried adding the forward slash to the MidLetter and
MidLetterSupp rules (tried different combinations), but it still seems like it's splitting
on it.
> Does anyone have any tips or ideas?
> Thanks
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message