lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andre Rubin" <andre.ru...@gmail.com>
Subject Re: Case Sensitivity
Date Thu, 21 Aug 2008 07:21:03 GMT
Just to add to that, as I said before, in my case, I found more useful not
to use UN_Tokenized. Instead, I used Tokenized with a custom analyzer that
uses the KeywordTokenizer (entire input as only one token) with the
LowerCaseFilter: This way I get the best of both worlds.

public class KeywordLowerAnalyzer extends Analyzer {

    public KeywordLowerAnalyzer() {
    }


    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream result = new KeywordTokenizer(reader);
        result = new LowerCaseFilter(result);
        return result;
    }

}

On Wed, Aug 20, 2008 at 10:21 AM, Dino Korah <dckorah@gmail.com> wrote:
> Hi Steve,
>
> Thanks a lot for that.
>
> I have a question on TokenStreams and email addresses, but I will post
them
> on a separate thread.
>
> Many thanks,
> Dino
>
>
> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe@syr.edu]
> Sent: 19 August 2008 17:43
> To: java-user@lucene.apache.org
> Subject: RE: Case Sensitivity
>
> Hi Dino,
>
> I think you'd benefit from reading some FAQ answers, like:
>
> "Why is it important to use the same analyzer type during indexing and
> search?"
> <
http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
> 4472d10961ba63c>
>
> Also, have a look at the AnalysisParalysis wiki page for some hints:
> <http://wiki.apache.org/lucene-java/AnalysisParalysis>
>
> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
>> From the discussion here what I could understand was, if I am using
>> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
>> I shouldn't have any problems with cases.
>
> If by "shouldn't have problems with cases" you mean "can match
> case-insensitively", then this is true.
>
>> But if I have any UN_TOKENIZED fields there will be problems if I do
>> not case-normalize them myself before adding them as a field to the
>> document.
>
> Again, assuming that by "case-normalize" you mean "downcase", and that you
> want case-insensitive matching, and that you use the StandardAnalyzer (or
> some other downcasing analyzer) at query-time, then this is true.
>
>> In my case I have a mixed scenario. I am indexing emails and the email
>> addresses are indexed UN_TOKENIZED. I do have a second set of custom
>> tokenized field, which keep the tokens in individual fields with same
>> name.
> [...]
>> Does it mean that where ever I use UN_TOKENIZED, they do not get
>> through the StandardAnalyzer before getting Indexed, but they do when
>> they are searched on?
>
> This is true.
>
>> If that is the case, Do I need to normalise them before adding to
>> document?
>
> If you want case-insensitive matching, then yes, you do need to normalize
> them before adding them to the document.
>
>> I also would like to know if it is better to employ an EmailAnalyzer
>> that makes a TokenStream out of the given email address, rather than
>> using a simplistic function that gives me a list of string pieces and
>> adding them one by one. With searches, would both the approaches give
>> same result?
>
> Yes, both approaches give the same result.  When you add string pieces
> one-by-one, you are adding multiple same-named fields. By contrast, the
> EmailAnalyzer approach would add a single field, and would allow you to
> control positions (via Token.setPositionIncrement():
> <
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.  Also, if
> you make up an EmailAnalyzer, you can use it to search against your
> tokenized email field, along with other analyzer(s) on other field(s),
using
> the PerFieldAnalyzerWrapper
> <
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
> AnalyzerWrapper.html>.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message