lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dino Korah" <dcko...@gmail.com>
Subject FW: Case Sensitivity
Date Thu, 28 Aug 2008 08:58:40 GMT
Looks like my question got unnoticed among the more important Jira
discussion. :(

On the same topic, what would be the effect of the following code.

Document doc = new Document();
Field f = new Field("body", bodyText, Field.Store.NO,
Field.Index.TOKENIZED);
f.setOmitNorms(true);

Would that be equivalent to

Document doc = new Document();
Field f = new Field("body", bodyText, Field.Store.NO ,Field.Index.NO_NORMS);

And Field.Index.TOKENIZED has no effect after f.setOmitNorms(true); ?

Many thanks,
Dino



-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com]
Sent: 27 August 2008 10:37
To: java-user@lucene.apache.org
Subject: Re: Case Sensitivity


Actually, as confusing as it is, Field.Index.NO_NORMS means
Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

Mike

Otis Gospodnetic wrote:

> Dino, you lost me half-way through your email :(
>
> NO_NORMS does not mean the field is not tokenized.
> UN_TOKENIZED does mean the field is not tokenized.
>
>
> Otis--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Dino Korah <dckorah@gmail.com>
>> To: java-user@lucene.apache.org
>> Sent: Tuesday, August 26, 2008 9:17:49 AM
>> Subject: RE: Case Sensitivity
>>
>> I think I should rephrase my question.
>>
>> [ Context: Using out of the box StandardAnalyzer for indexing and
>> searching.
>> ]
>>
>> Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-
>> ized (
>> field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
>> Which means that when we search, it gets thru the analyzer and we
>> need to analyze them differently in the analyzer we use for
>> searching?
>> Doesn't it mean that a setOmitNorms(true) field also doesn't get
>> tokenized?
>>
>> What is the best solution if one was to add a set of fields
>> UN_TOKENIZED and others TOKENIZED, of the later set a few with
>> setOmitNorms(true) (the index writer is plain StandardAnalyzer
>> based)? A per field analyzer at query time ?!
>>
>> Many thanks,
>> Dino
>>
>>
>> -----Original Message-----
>> From: Dino Korah [mailto:dckorah@gmail.com]
>> Sent: 26 August 2008 12:12
>> To: 'java-user@lucene.apache.org'
>> Subject: RE: Case Sensitivity
>>
>> A little more case sensitivity questions.
>>
>> Based on the discussion on
>> http://markmail.org/message/q7dqr4r7o6t6dgo5
>>  and
>> on this thread, is it right to say that a field, if either
>> UN_TOKENIZED or NO_NORMS-ized, it doesn't get analyzed while
>> indexing? Which means we need to case-normalize (down-case) those
>> fields before hand?
>>
>> Doest it mean that if I can afford, I should use norms.
>>
>> Many thanks,
>> Dino
>>
>>
>>
>> -----Original Message-----
>> From: Steven A Rowe [mailto:sarowe@syr.edu]
>> Sent: 19 August 2008 17:43
>> To: java-user@lucene.apache.org
>> Subject: RE: Case Sensitivity
>>
>> Hi Dino,
>>
>> I think you'd benefit from reading some FAQ answers, like:
>>
>> "Why is it important to use the same analyzer type during indexing
>> and search?"
>>
>> 4472d10961ba63c>
>>
>> Also, have a look at the AnalysisParalysis wiki page for some hints:
>>
>>
>> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
>>> From the discussion here what I could understand was, if I am using
>>> StandardAnalyzer on TOKENIZED fields, for both Indexing and
>>> Querying, I shouldn't have any problems with cases.
>>
>> If by "shouldn't have problems with cases" you mean "can match
>> case-insensitively", then this is true.
>>
>>> But if I have any UN_TOKENIZED fields there will be problems if I do
>>> not case-normalize them myself before adding them as a field to the
>>> document.
>>
>> Again, assuming that by "case-normalize" you mean "downcase", and
>> that you want case-insensitive matching, and that you use the
>> StandardAnalyzer (or some other downcasing analyzer) at query-time,
>> then this is true.
>>
>>> In my case I have a mixed scenario. I am indexing emails and the
>>> email addresses are indexed UN_TOKENIZED. I do have a second set of
>>> custom tokenized field, which keep the tokens in individual fields
>>> with same name.
>> [...]
>>> Does it mean that where ever I use UN_TOKENIZED, they do not get
>>> through the StandardAnalyzer before getting Indexed, but they do
>>> when they are searched on?
>>
>> This is true.
>>
>>> If that is the case, Do I need to normalise them before adding to
>>> document?
>>
>> If you want case-insensitive matching, then yes, you do need to
>> normalize them before adding them to the document.
>>
>>> I also would like to know if it is better to employ an EmailAnalyzer
>>> that makes a TokenStream out of the given email address, rather than
>>> using a simplistic function that gives me a list of string pieces
>>> and adding them one by one. With searches, would both the approaches
>>> give same result?
>>
>> Yes, both approaches give the same result.  When you add string
>> pieces one-by-one, you are adding multiple same-named fields. By
>> contrast, the EmailAnalyzer approach would add a single field, and
>> would allow you to control positions (via
>> Token.setPositionIncrement():
>>
>> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.  
>> Also, if
>> you make up an EmailAnalyzer, you can use it to search against your
>> tokenized email field, along with other analyzer(s) on other
>> field(s), using the PerFieldAnalyzerWrapper
>>
>> AnalyzerWrapper.html>.
>>
>> Steve
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message