Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 92EA911CBA for ; Wed, 27 Aug 2014 14:56:31 +0000 (UTC) Received: (qmail 82169 invoked by uid 500); 27 Aug 2014 14:56:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 82118 invoked by uid 500); 27 Aug 2014 14:56:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 82106 invoked by uid 99); 27 Aug 2014 14:56:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Aug 2014 14:56:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of milindr@gmail.com designates 209.85.215.43 as permitted sender) Received: from [209.85.215.43] (HELO mail-la0-f43.google.com) (209.85.215.43) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Aug 2014 14:56:18 +0000 Received: by mail-la0-f43.google.com with SMTP id gl10so45879lab.16 for ; Wed, 27 Aug 2014 07:55:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Chg23ymc21BuHLnBRS0ZUrdy0iYOpol4Mxot3VWY7rk=; b=0vxbseA+OwfGp9pIvswTqsBrXbJvf7TvygSIE6EGy/DfN1hMcgsOBhcgo1fixNYDA3 bXtgLhWTaBxUrFOStHpDnN6x1FEMXWVymjQBRtoRXq4zB7EXLrtR70GA5gOPxuSjzR/B fmGGPsEOQ3wUwm67gel8K2JAinRrxiUHDC8tBB36HNP/m9sHTqcjQm1JMSwoyacIZ5C6 aPhl8ZnLAAmmrusNK7n9TVbiaJnEBB/xfFfkg2O5AqvrGArIkcYHP7wvErED+VvUyMBd ynF7gEkfUnquUsZPabmnETwOTu+x1/jpHZwp7StvEDmTKfoCRQre2Zmh6C267mfZ9ozL 6+UA== MIME-Version: 1.0 X-Received: by 10.152.88.81 with SMTP id be17mr21844018lab.75.1409151357006; Wed, 27 Aug 2014 07:55:57 -0700 (PDT) Received: by 10.25.20.170 with HTTP; Wed, 27 Aug 2014 07:55:56 -0700 (PDT) In-Reply-To: References: <6E6C37B1A0EA40B38E85A23108C7B2B4@JackKrupansky14> <53FDEA87.4070501@safaribooksonline.com> Date: Wed, 27 Aug 2014 10:55:56 -0400 Message-ID: Subject: Re: Why does this search fail? From: Milind To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a11c3556411b74905019d9e77 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3556411b74905019d9e77 Content-Type: text/plain; charset=UTF-8 Thanks Jack. I'll try this out. I'll have to see if that creates other side effects :-(. Tokenization is already causing a great deal of confusion. I want to make it as intuitive as possible. On Wed, Aug 27, 2014 at 10:45 AM, Jack Krupansky wrote: > Yes, the white space tokenizer will preserve all punctuation, but... then > the query for DevNm00* will fail. A "smarter" set of filters is probably > needed here... start with white space tokenization, keep that overall > token, then trim external punctuation and keep that token as well, and then > use word delimiter filter to split out the embedded words, like DevNm00, > and add them. > > The word delimiter filter will do most of that, but not the part of > trimming out external punctuation. But depending on your use case, it may > be close enough. > > See: > http://lucene.apache.org/core/4_9_0/analyzers-common/org/ > apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html > > -- Jack Krupansky > > -----Original Message----- From: Michael Sokolov > Sent: Wednesday, August 27, 2014 10:26 AM > To: java-user@lucene.apache.org > Subject: Re: Why does this search fail? > > > Tokenization is tricky. You might consider using whitespace tokenizer > followed by word delimiter filter (instead of standard tokenizer); it > does a kind of secondary tokenization pass that can preserve the > original token in addition to its component parts. There are some weird > side effects to do with term frequencies and phrase-like queries, but it > would make all these wildcard queries work I think. > > -Mike > > On 08/27/2014 09:54 AM, Milind wrote: > >> I see. This is going to be extremely difficult to explain to end users. >> It doesn't work as they would expect. Some of the tokenizing rules are >> already somewhat confusing. Their expectation is that it should work the >> way their searches work in Google. >> >> It's difficult enough to recognize that because the period is surrounded >> by >> a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets >> tokenized. So I'd have expected that C0001.DevNm00* would effectively >> become a search for C0001 OR DevNm00*. But now, because of the presence >> of >> the wildcard, it's considered as 1 term and the period is not a tokenizer. >> That's actually good, but now the fact that it's still considered as 2 >> terms for wildcard searches makes it very unintuitive. I don't suppose >> that I can do anything about making wildcard search use multiple terms if >> joined together with a tokenizer. But is there any way that I can force >> it >> to go through an analyzer prior to doing the search? >> >> >> >> >> On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky >> wrote: >> >> Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001" >>> gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't >>> match any term (at least in this case.) >>> >>> Also, if your query term includes a wildcard, it will not be fully >>> analyzed. Some filters such as lower case are defined as "multi-term", so >>> they will be performed, but the standard tokenizer is not being called, >>> so >>> the dot remains and this whole term is treated as one term, unlike the >>> index analysis. >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: Milind >>> Sent: Tuesday, August 26, 2014 12:24 PM >>> To: java-user@lucene.apache.org >>> Subject: Why does this search fail? >>> >>> >>> I have a field with the value C0001.DevNm001. If I search for >>> >>> C0001.DevNm001 --> Get Hit >>> DevNm00* --> Get Hit >>> C0001.DevNm00* --> Get No Hit >>> >>> The field gets tokenized on the period since it's surrounded by a letter >>> and and a number. The query gets evaluated as a prefix query. I'd have >>> thought that this should have found the document. Any clues on why this >>> doesn't work? >>> >>> The full code is below. >>> >>> Directory theDirectory = new RAMDirectory(); >>> Version theVersion = Version.LUCENE_47; >>> Analyzer theAnalyzer = new StandardAnalyzer(theVersion); >>> IndexWriterConfig theConfig = >>> new IndexWriterConfig(theVersion, >>> theAnalyzer); >>> IndexWriter theWriter = new IndexWriter(theDirectory, theConfig); >>> >>> String theFieldName = "Name"; >>> String theFieldValue = "C0001.DevNm001"; >>> Document theDocument = new Document(); >>> theDocument.add(new TextField(theFieldName, theFieldValue, >>> Field.Store.YES)); >>> theWriter.addDocument(theDocument); >>> theWriter.close(); >>> >>> String theQueryStr = theFieldName + ":C0001.DevNm00*"; >>> Query theQuery = >>> new QueryParser(theVersion, theFieldName, >>> theAnalyzer).parse(theQueryStr); >>> System.out.println(theQuery.getClass() + ", " + theQuery); >>> IndexReader theIndexReader = DirectoryReader.open(theDirectory); >>> IndexSearcher theSearcher = new IndexSearcher(theIndexReader); >>> TopScoreDocCollector collector = TopScoreDocCollector.create(10, >>> true); >>> theSearcher.search(theQuery, collector); >>> ScoreDoc[] theHits = collector.topDocs().scoreDocs; >>> System.out.println("Hits found: " + theHits.length); >>> >>> Output: >>> >>> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00* >>> Hits found: 0 >>> >>> >>> -- >>> Regards >>> Milind >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > -- Regards Milind --001a11c3556411b74905019d9e77--