lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lixin Meng" <li...@fulldegree.com>
Subject RE: search item with '-' in it
Date Wed, 04 Jun 2003 22:04:02 GMT
Thanks for the tip. The analyzer does tokenize "SG-XRRH-C1M0-A" into 'SG'
and 'XRRH-C1M0-A'.

The approach that 'accept hyphenations when digits are included on one side
or the other' is indeed 'heuristic' :).

I might consider your suggestion on using keyword. However, in a more
general case, if one has a block of text with hyphenated words inside, the
work around with keyword doesn't apply.

Therefore, it would be preferable to treat all hyphen in the same way.
Either as a delimiter or as part of the word (maybe with a flag at the API).

Again, thanks for all the help.

Regards,
Lixin

-----Original Message-----
From: Doug Cutting [mailto:cutting@lucene.com]
Sent: Wednesday, June 04, 2003 9:59 AM
To: Lucene Users List
Subject: Re: search item with '-' in it


You should look at the output of your analyzer.  Just write a simple
test program, something like:

   public static void main(String[] args) throws Exception {
     System.out.println("Tokenizing " + args[0]);
     Analyzer analyzer = new MyAnalyzer(...);
     TokenStream ts = analyzer.tokenStream(new StringReader(args[0]));
     Token token;
     while ((token = ts.next()) != null) {
       System.out.println("Token: " + token.termText());
     }
   }

StandardAnalyzer will accept hyphenations when digits are included on
one side or the other.  This is a heuristic used to index things like
part numbers (which contain digits) as a single word but not index
things like "long-hyphenated-phrase" as a single word.  It may not be
appropriate for your application.

Also, a part number field might better be indexed as a keyword field...

Doug

Lixin Meng wrote:
> I have a field, 'PartNumber', that has '-' in its value (e.g.
> SG-XRRH-C1M0-A).
>
> After indexing, I can perform certain queries. However, I feel confused to
> explain the behavior.
>
> - if searching for
> 	PartNumber:"SG"
>   it will return multiple hits. I assume the anaylzer might take out '-'.
>
> - if searching for
> 	PartNumber:"XRRH"
>   it will return no hit. So, the above assumption doesn't hold itself. :)
>
> - if searching for
> 	PartNumber:"SG-XRRH-C1M0-A"
>   it will return one hit
>
> - if searching for
>       PartNumber:"sg-xrrh-c1m0-a*"
>   it will return one hit. So far so good
>
> - if searching for
>       PartNumber:sg-xrrh-c1m0-a*
>   it will return multiple hits which even include things like
> "SG-XSWBRO...". Why?
>
> - if searching for
>       PartNumber:"sg-xrrh-c1m0*"
>   no hit. Why?
>
> Any comments?
>
> Regards,
> Lixin
>
> P.S. I used following filters
>
>     result = new StandardFilter(result);
>     result = new LowerCaseFilter(result);
>     result = new StopFilter(result, m_StopWordTable);
>     result = new PorterStemFilter(result);
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message