lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dino Korah" <>
Subject RE: Case Sensitivity
Date Tue, 26 Aug 2008 13:17:49 GMT
I think I should rephrase my question.

[ Context: Using out of the box StandardAnalyzer for indexing and searching.

Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-ized (
field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
Which means that when we search, it gets thru the analyzer and we need to
analyze them differently in the analyzer we use for searching?
Doesn't it mean that a setOmitNorms(true) field also doesn't get tokenized?

What is the best solution if one was to add a set of fields UN_TOKENIZED and
others TOKENIZED, of the later set a few with setOmitNorms(true) (the index
writer is plain StandardAnalyzer based)? A per field analyzer at query time

Many thanks,

-----Original Message-----
From: Dino Korah [] 
Sent: 26 August 2008 12:12
To: ''
Subject: RE: Case Sensitivity

A little more case sensitivity questions.

Based on the discussion on and
on this thread, is it right to say that a field, if either UN_TOKENIZED or
NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
to case-normalize (down-case) those fields before hand?

Doest it mean that if I can afford, I should use norms.

Many thanks,

-----Original Message-----
From: Steven A Rowe []
Sent: 19 August 2008 17:43
Subject: RE: Case Sensitivity

Hi Dino,

I think you'd benefit from reading some FAQ answers, like:

"Why is it important to use the same analyzer type during indexing and

Also, have a look at the AnalysisParalysis wiki page for some hints:

On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> From the discussion here what I could understand was, if I am using 
> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying, 
> I shouldn't have any problems with cases.

If by "shouldn't have problems with cases" you mean "can match
case-insensitively", then this is true.

> But if I have any UN_TOKENIZED fields there will be problems if I do 
> not case-normalize them myself before adding them as a field to the 
> document.

Again, assuming that by "case-normalize" you mean "downcase", and that you
want case-insensitive matching, and that you use the StandardAnalyzer (or
some other downcasing analyzer) at query-time, then this is true.

> In my case I have a mixed scenario. I am indexing emails and the email 
> addresses are indexed UN_TOKENIZED. I do have a second set of custom 
> tokenized field, which keep the tokens in individual fields with same 
> name.
> Does it mean that where ever I use UN_TOKENIZED, they do not get 
> through the StandardAnalyzer before getting Indexed, but they do when 
> they are searched on?

This is true.

> If that is the case, Do I need to normalise them before adding to 
> document?

If you want case-insensitive matching, then yes, you do need to normalize
them before adding them to the document.

> I also would like to know if it is better to employ an EmailAnalyzer 
> that makes a TokenStream out of the given email address, rather than 
> using a simplistic function that gives me a list of string pieces and 
> adding them one by one. With searches, would both the approaches give 
> same result?

Yes, both approaches give the same result.  When you add string pieces
one-by-one, you are adding multiple same-named fields. By contrast, the
EmailAnalyzer approach would add a single field, and would allow you to
control positions (via Token.setPositionIncrement():
ml#setPositionIncrement(int)>), e.g. to improve phrase handling.  Also, if
you make up an EmailAnalyzer, you can use it to search against your
tokenized email field, along with other analyzer(s) on other field(s), using
the PerFieldAnalyzerWrapper


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message