Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 7878 invoked from network); 14 Aug 2008 22:17:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Aug 2008 22:17:04 -0000 Received: (qmail 18240 invoked by uid 500); 14 Aug 2008 22:16:57 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 18208 invoked by uid 500); 14 Aug 2008 22:16:56 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 18197 invoked by uid 99); 14 Aug 2008 22:16:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Aug 2008 15:16:56 -0700 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of andre.rubin@gmail.com designates 209.85.217.13 as permitted sender) Received: from [209.85.217.13] (HELO mail-gx0-f13.google.com) (209.85.217.13) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Aug 2008 22:15:59 +0000 Received: by gxk6 with SMTP id 6so2734659gxk.5 for ; Thu, 14 Aug 2008 15:16:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=n/vhYPd4mvSdJokncYGF8MTTZ617Jn6Rqc9j63laTTs=; b=Xq0mlf1u8dCcbd9sG+QTKDbcdvnSdmHxxwtbFpCHPI7AMpt33YMF+1fp1WBMHmPr4B DgstlgzVViZld2H7+US5+5J7xeYZ/dLg8zAv6sx3mg1GCZoZmxfELLE0GYEeQ36BC6tG TOVt4PhzVCk79I3W0W4cBal4GKpRxDdQlVDs4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=tPv0koZ0F8Fue1/tYFqwVJQiFzYwMpNce2it8yi7Qz98McMZeLVIqWx+MYuuMFf7+I GI+17C2mT8dwnodI4I/g8RuGV8jiIZvyZ42jMApUBvET08domzYeDrmYjbyaJmZFIvLh DwqdpJfeI4p5KReocCPbZR4TsCRB/aioex4ZU= Received: by 10.150.50.1 with SMTP id x1mr2588064ybx.39.1218752187485; Thu, 14 Aug 2008 15:16:27 -0700 (PDT) Received: by 10.151.43.11 with HTTP; Thu, 14 Aug 2008 15:16:27 -0700 (PDT) Message-ID: Date: Thu, 14 Aug 2008 15:16:27 -0700 From: "Andre Rubin" To: java-user@lucene.apache.org Subject: Re: Case Sensitivity In-Reply-To: <359a92830808140817j1114861cw7c240b15bc149c28@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <48a3076a.2679420a.1c53.ffffa5c4@mx.google.com> <359a92830808140644u7e55d48ek3c048f2b05fd5345@mail.gmail.com> <359a92830808140817j1114861cw7c240b15bc149c28@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org Sergey, Based on a recent discussion I posted: http://www.nabble.com/Searching-Tokenized-x-Un_tokenized-td18882569.html , you cannot use Un_Tokenized because you can't have any analyzer run thorugh it. My suggestion, use a tokenized filed and a custom made Analyzer. Haven't figure out all the details for you, but I think it's possible. Andre On Thu, Aug 14, 2008 at 8:17 AM, Erick Erickson wrote: > Be aware that StandardAnalyzer lowercases all the input, > both at index and query times. Field.Store.YES will store > the original text without any transformations, so doc.get() > will return the original text. However, no matter what the > Field.Store value, the *indexed* tokens (using > TOKENIZED as you Field.Index.TOKENIZED) > are passed through the analyzer. > > For instance, indexing "MIXed CasE TEXT" in a > field called "myfield" with Field.Store.YES, > Field.Index.TOKENIZED would index the > following tokens (with StandardAnalyzer). > mixed > case > text > > and searches (with StandardAnalyzer) would match > any case in the query terms (e.g. MIXED would hit, > as would mixed as would CaSE). > > However, doc.get("myfield") would return > "MIXed CasE TEXT" > > As Doron said, though, a few use cases would > help us provide better answers. > > Best > Erick > > > On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk wrote: > >> Thanks for you reply Erick. >> >> >> About the only way to do this that I know of is to >>> index the data three times, once without any case >>> changing, once uppercased and once lowercased. >>> You'll have to watch your analyzer, probably making >>> up your own (easily done, see the synonym analyzer >>> in Lucene in Action). >>> >>> Your example doesn't tell us anything, since the critical >>> information is the *analyzer* you use, both at query and >>> at index times. The analyzer is responsible for any >>> transformations, like case folding, tokenizing, etc. >>> >> >> >> In example I want to show what I stored field as Field.Index.NO_NORMS >> >> As I understand it means what field contains original string >> despite what analyzer I chose(StandardAnalyzer by default). >> >> All querys I made myself without using Parsers. >> For example new TermQuery(new Term("filed", "MaMa")); >> >> >> I agree with you about possible implementation, >> but it increase size of index at times. >> >> But are there other possibilities, such as using custom query, possibly >> similar to RegexQuery,RegexTermEnum that would compare terms >> at it's own discretion? >> >> >> >> >> >>> But what is your use-case for needing both upper and >>> lower case comparisons? I have a hard time coming >>> up with a reason to do both that wouldn't be satisfied >>> by just a caseless search. >>> >>> Best >>> Erick >>> >>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk >> >wrote: >>> >>> Hello. >>>> >>>> I have the similar question. >>>> >>>> I need to implement >>>> 1. Case sensitive search. >>>> 2. Lower case search for concrete field. >>>> 3. Upper case search for concrete filed. >>>> >>>> For now I use >>>> new Field("PROPERTIES", >>>> content, >>>> Field.Store.NO, >>>> Field.Index.NO_NORMS, >>>> Field.TermVector.NO) >>>> for original string and make case sensitive search. >>>> >>>> But does anyone have an idea to how implement second and third type of >>>> search? >>>> >>>> Thanks >>>> >>>> >>>> >>>> Hi All, >>>> >>>>> Once I index a bunch of documents with a StandardAnalyzer (and if the >>>>> effort >>>>> I need to put in to reindex the documents is not worth the effort), is >>>>> there >>>>> a way to search on the index without case sensitivity. >>>>> I do not use any sophisticated Analyzer that makes use of >>>>> LowerCaseTokenizer. >>>>> Please let me know if there is a solution to circumvent this case >>>>> sensitivity problem. >>>>> Many thanks >>>>> Dino >>>>> >>>>> >>>>> -- >>>> Sergey Kabashnyuk >>>> eXo Platform SAS >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>>> -- >> Sergey Kabashnyuk >> eXo Platform SAS >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org