lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dino Korah" <dcko...@gmail.com>
Subject RE: Case Sensitivity
Date Fri, 22 Aug 2008 09:50:02 GMT
That is very clever. With that, the text we index will get through the
analyser, but will not get tokenized. Will hit the analyser the same way
when we search, again untokenized.

Brilliant!!


-----Original Message-----
From: Andre Rubin [mailto:andre.rubin@gmail.com] 
Sent: 21 August 2008 08:21
To: java-user@lucene.apache.org
Subject: Re: Case Sensitivity

Just to add to that, as I said before, in my case, I found more useful not
to use UN_Tokenized. Instead, I used Tokenized with a custom analyzer that
uses the KeywordTokenizer (entire input as only one token) with the
LowerCaseFilter: This way I get the best of both worlds.

public class KeywordLowerAnalyzer extends Analyzer {

    public KeywordLowerAnalyzer() {
    }


    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream result = new KeywordTokenizer(reader);
        result = new LowerCaseFilter(result);
        return result;
    }

}

On Wed, Aug 20, 2008 at 10:21 AM, Dino Korah <dckorah@gmail.com> wrote:
> Hi Steve,
>
> Thanks a lot for that.
>
> I have a question on TokenStreams and email addresses, but I will post
them
> on a separate thread.
>
> Many thanks,
> Dino
>
>
> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe@syr.edu]
> Sent: 19 August 2008 17:43
> To: java-user@lucene.apache.org
> Subject: RE: Case Sensitivity
>
> Hi Dino,
>
> I think you'd benefit from reading some FAQ answers, like:
>
> "Why is it important to use the same analyzer type during indexing and 
> search?"
> <
http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
> 4472d10961ba63c>
>
> Also, have a look at the AnalysisParalysis wiki page for some hints:
> <http://wiki.apache.org/lucene-java/AnalysisParalysis>
>
> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
>> From the discussion here what I could understand was, if I am using 
>> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying, 
>> I shouldn't have any problems with cases.
>
> If by "shouldn't have problems with cases" you mean "can match 
> case-insensitively", then this is true.
>
>> But if I have any UN_TOKENIZED fields there will be problems if I do 
>> not case-normalize them myself before adding them as a field to the 
>> document.
>
> Again, assuming that by "case-normalize" you mean "downcase", and that 
> you want case-insensitive matching, and that you use the 
> StandardAnalyzer (or some other downcasing analyzer) at query-time, then
this is true.
>
>> In my case I have a mixed scenario. I am indexing emails and the 
>> email addresses are indexed UN_TOKENIZED. I do have a second set of 
>> custom tokenized field, which keep the tokens in individual fields 
>> with same name.
> [...]
>> Does it mean that where ever I use UN_TOKENIZED, they do not get 
>> through the StandardAnalyzer before getting Indexed, but they do when 
>> they are searched on?
>
> This is true.
>
>> If that is the case, Do I need to normalise them before adding to 
>> document?
>
> If you want case-insensitive matching, then yes, you do need to 
> normalize them before adding them to the document.
>
>> I also would like to know if it is better to employ an EmailAnalyzer 
>> that makes a TokenStream out of the given email address, rather than 
>> using a simplistic function that gives me a list of string pieces and 
>> adding them one by one. With searches, would both the approaches give 
>> same result?
>
> Yes, both approaches give the same result.  When you add string pieces 
> one-by-one, you are adding multiple same-named fields. By contrast, 
> the EmailAnalyzer approach would add a single field, and would allow 
> you to control positions (via Token.setPositionIncrement():
> <
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.  
> Also, if you make up an EmailAnalyzer, you can use it to search 
> against your tokenized email field, along with other analyzer(s) on 
> other field(s),
using
> the PerFieldAnalyzerWrapper
> <
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
> AnalyzerWrapper.html>.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message