Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 98F7011352 for ; Wed, 27 Aug 2014 17:15:33 +0000 (UTC) Received: (qmail 70111 invoked by uid 500); 27 Aug 2014 17:15:28 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 70044 invoked by uid 500); 27 Aug 2014 17:15:28 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 70031 invoked by uid 99); 27 Aug 2014 17:15:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Aug 2014 17:15:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of milindr@gmail.com designates 209.85.215.51 as permitted sender) Received: from [209.85.215.51] (HELO mail-la0-f51.google.com) (209.85.215.51) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Aug 2014 17:15:24 +0000 Received: by mail-la0-f51.google.com with SMTP id b8so281094lan.38 for ; Wed, 27 Aug 2014 10:15:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=wJDLcFKJc/Kyni+16SXGtRIDfYt8kStpWuFdVHcerQo=; b=ktP3ME96OqtFWYuOuMG3dHhk8kO56E5gHEVvZ2heKEt4MybE3fkZeZ9BapeCiE76e+ ag7D0DwVjRE4dyvtA/VgcUCcko9A9O4u2fpp6mRl8ah30SwFX9kVGl156jQdBnjHi8/h yKtsH48lF0wSiiiiSqbwKEvyW3n5PMPQ7ii1GGYwUPxFNwSUreBl7ROoYEX9GQ/WJx4R VBkBPtRxrQHQWtSrcAVqyVY1cUpMjRVaGeO5WbzbWPRjaiBic5S1AI7gGG15J/VRf63n GvRU0g4rc3XxMZlKBTM6uiEPhdAIuLSddrwHk8eSQpzVurmC0mX01TwA+XNokYbI6Iio PM6Q== MIME-Version: 1.0 X-Received: by 10.112.166.139 with SMTP id zg11mr34311455lbb.62.1409159702342; Wed, 27 Aug 2014 10:15:02 -0700 (PDT) Received: by 10.25.20.170 with HTTP; Wed, 27 Aug 2014 10:15:02 -0700 (PDT) In-Reply-To: <61D40391C752443B85E284C6AF879C42@JackKrupansky14> References: <6E6C37B1A0EA40B38E85A23108C7B2B4@JackKrupansky14> <61D40391C752443B85E284C6AF879C42@JackKrupansky14> Date: Wed, 27 Aug 2014 13:15:02 -0400 Message-ID: Subject: Re: Why does this search fail? From: Milind To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a113395bc7d75df05019f8fba X-Virus-Checked: Checked by ClamAV on apache.org --001a113395bc7d75df05019f8fba Content-Type: text/plain; charset=UTF-8 Thanks for the Google link. I wasn't aware of it. Most of it is very intuitive. And most importantly consistent. On Wed, Aug 27, 2014 at 11:07 AM, Jack Krupansky wrote: > It's not documented, but Google does seem to support trailing wildcard, > but only if the prefix has at least six characters. For shorter prefixes, > it seems to just drop the wildcard. > > Google also uses "*" in quoted phrases to mean a placeholder for any > single term. That's documented. > > See: > https://support.google.com/websearch/answer/136861?hl=en > > It also seems to support "**" in a quoted phrase to mean one or more > arbitrary terms. This isn't documented, but seems to work. > > > -- Jack Krupansky > > -----Original Message----- From: Milind > Sent: Wednesday, August 27, 2014 10:51 AM > To: java-user@lucene.apache.org > Subject: Re: Why does this search fail? > > > Yes. If you search for alphare on google and alphare*, you get 2 different > results. Sorry for the contrived example. I just tried searching for > alpharetta and went backwards deleting characters. > > > On Wed, Aug 27, 2014 at 10:01 AM, Benson Margulies > wrote: > > Does google actually support "*"? >> >> >> >> On Wed, Aug 27, 2014 at 9:54 AM, Milind wrote: >> >> > I see. This is going to be extremely difficult to explain to end users. >> > It doesn't work as they would expect. Some of the tokenizing rules are >> > already somewhat confusing. Their expectation is that it should work > >> the >> > way their searches work in Google. >> > >> > It's difficult enough to recognize that because the period is surrounded >> by >> > a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets >> > tokenized. So I'd have expected that C0001.DevNm00* would effectively >> > become a search for C0001 OR DevNm00*. But now, because of the presence >> of >> > the wildcard, it's considered as 1 term and the period is not a >> tokenizer. >> > That's actually good, but now the fact that it's still considered as 2 >> > terms for wildcard searches makes it very unintuitive. I don't suppose >> > that I can do anything about making wildcard search use multiple terms >> > if >> > joined together with a tokenizer. But is there any way that I can force >> it >> > to go through an analyzer prior to doing the search? >> > >> > >> > >> > >> > On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky < >> jack@basetechnology.com >> > >> > wrote: >> > >> > > Sorry, but you can only use a wildcard on a single term. >> "C0001.DevNm001" >> > > gets indexed as two terms, "c0001" and "devnm001", so your wildcard >> won't >> > > match any term (at least in this case.) >> > > >> > > Also, if your query term includes a wildcard, it will not be fully >> > > analyzed. Some filters such as lower case are defined as "multi-term", >> so >> > > they will be performed, but the standard tokenizer is not being > > >> called, >> > so >> > > the dot remains and this whole term is treated as one term, unlike the >> > > index analysis. >> > > >> > > -- Jack Krupansky >> > > >> > > -----Original Message----- From: Milind >> > > Sent: Tuesday, August 26, 2014 12:24 PM >> > > To: java-user@lucene.apache.org >> > > Subject: Why does this search fail? >> > > >> > > >> > > I have a field with the value C0001.DevNm001. If I search for >> > > >> > > C0001.DevNm001 --> Get Hit >> > > DevNm00* --> Get Hit >> > > C0001.DevNm00* --> Get No Hit >> > > >> > > The field gets tokenized on the period since it's surrounded by a >> letter >> > > and and a number. The query gets evaluated as a prefix query. I'd >> have >> > > thought that this should have found the document. Any clues on why >> this >> > > doesn't work? >> > > >> > > The full code is below. >> > > >> > > Directory theDirectory = new RAMDirectory(); >> > > Version theVersion = Version.LUCENE_47; >> > > Analyzer theAnalyzer = new StandardAnalyzer(theVersion); >> > > IndexWriterConfig theConfig = >> > > new IndexWriterConfig(theVersion, >> > theAnalyzer); >> > > IndexWriter theWriter = new IndexWriter(theDirectory, >> theConfig); >> > > >> > > String theFieldName = "Name"; >> > > String theFieldValue = "C0001.DevNm001"; >> > > Document theDocument = new Document(); >> > > theDocument.add(new TextField(theFieldName, theFieldValue, >> > > Field.Store.YES)); >> > > theWriter.addDocument(theDocument); >> > > theWriter.close(); >> > > >> > > String theQueryStr = theFieldName + ":C0001.DevNm00*"; >> > > Query theQuery = >> > > new QueryParser(theVersion, theFieldName, >> > > theAnalyzer).parse(theQueryStr); >> > > System.out.println(theQuery.getClass() + ", " + theQuery); >> > > IndexReader theIndexReader = > > DirectoryReader.open( >> theDirectory); >> > > IndexSearcher theSearcher = new IndexSearcher(theIndexReader); >> > > TopScoreDocCollector collector = > > >> TopScoreDocCollector.create(10, >> > > true); >> > > theSearcher.search(theQuery, collector); >> > > ScoreDoc[] theHits = collector.topDocs().scoreDocs; >> > > System.out.println("Hits found: " + theHits.length); >> > > >> > > Output: >> > > >> > > class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00* >> > > Hits found: 0 >> > > >> > > >> > > -- >> > > Regards >> > > Milind >> > > >> > > --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > > For additional commands, e-mail: java-user-help@lucene.apache.org >> > > >> > > >> > >> > >> > -- >> > Regards >> > Milind >> > >> >> > > > -- > Regards > Milind > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > -- Regards Milind --001a113395bc7d75df05019f8fba--