lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Extending StandardTokenizer Jflex to not split on '/'
Date Fri, 14 Feb 2014 20:55:57 GMT
Welcome Diego,

I think you’re right about MidLetter - adding a char to it should disable splitting on that
char, as long as there is a letter on one side or the other.  (If you’d like that behavior
to be extended to numeric digits, you should use MidNumLet instead.)

I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex (compressed whitespace
diff below):

    -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
    +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})

then running ‘ant jflex’ under lucene/analysis/common/, and the following text was split
as indicated (I tested by adding the method below to TestStandardAnalyzer.java):

  public void testMidLetterSlash() throws Exception {
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”, 
                                  new String[]{ "one/two/three", "four" });
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”, 
                                 new String[] { "1", "two", "3" });
  }

So it works for me - are you regenerating the scanner (‘ant jflex’)?

FYI, I found a bug when I was testing the above: “http://example.com” is left intact when
“/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and ‘/‘ are in
[/\p{WB:MidLetter}], the letter-on-both-sides requirement should instead result in “http://example.com”
being split into “http” and “example.com”.  Further testing indicates that this is
a problem for MidLetter, MidNumLet and MidNum.  I’ve filed an issue: <https://issues.apache.org/jira/browse/LUCENE-5447>.

Steve

On Feb 14, 2014, at 1:42 PM, Diego Fernandez <difernan@redhat.com> wrote:

> Hi guys, this is my first time posting on the Lucene list, so hello everyone.
> 
> I really like the way that the StandardTokenizer works, however I'd like for it to not
split tokens on / (forward slash).  I've been looking at http://unicode.org/reports/tr29/#Default_Word_Boundaries
to try to understand the rules, but I'm either misunderstanding or missing something.  If
I understand correctly, the symbols in MidLetter keep it from splitting a token as long as
there's alpha chars on either side.  I tried adding the forward slash to the MidLetter and
MidLetterSupp rules (tried different combinations), but it still seems like it's splitting
on it.
> 
> Does anyone have any tips or ideas?
> 
> Thanks
> 
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message