lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Fri, 01 Sep 2006 14:09:14 GMT
So this will recognize anything in quotes as a single token and '_' and 
'-' will not break up words. There may be some repercussions for the NUM 
token but nothing I'd worry about. maybe you want to use Unicode for '-' 
and '_' as well...I wouldn't worry about it myself.

- Mark


TOKEN : {                      // token patterns

  // basic word: a sequence of digits & letters
  <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >

| <QUOTED:     "\"" (~["\""])+ "\"">

  // internal apostrophes: O'Reilly, you're, O'Reilly's
  // use a post-filter to remove possesives
| <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >

  // acronyms: U.S.A., I.B.M., etc.
  // use a post-filter to remove dots
| <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >

  // company names like AT&T and Excite@Home.
| <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >

  // email addresses
| <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM> 
(("."|"-") <ALPHANUM>)+ >

  // hostname
| <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >

  // floating point, serial, model numbers, ip addresses, etc.
  // every other segment must have at least one digit
| <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
       | <HAS_DIGIT> <P> <ALPHANUM>
       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
<HAS_DIGIT>)+
       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
<ALPHANUM>)+
        )
  >
| <#P: ("_"|"-"|"/"|"."|",") >
| <#HAS_DIGIT:                      // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
  >

| < #ALPHA: (<LETTER>)+>
| < #LETTER:                      // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "-", "_"
      ]
  >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message