Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 58250 invoked from network); 17 Nov 2009 17:25:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Nov 2009 17:25:25 -0000 Received: (qmail 47953 invoked by uid 500); 17 Nov 2009 17:25:25 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 47884 invoked by uid 500); 17 Nov 2009 17:25:25 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 47872 invoked by uid 99); 17 Nov 2009 17:25:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Nov 2009 17:25:25 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates 209.85.216.186 as permitted sender) Received: from [209.85.216.186] (HELO mail-px0-f186.google.com) (209.85.216.186) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Nov 2009 17:25:22 +0000 Received: by pxi16 with SMTP id 16so129521pxi.29 for ; Tue, 17 Nov 2009 09:25:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=ur0mCTptMQ94ZszVQBlTxeA7WiqNaq/lfQz4QKj8/hg=; b=aQxlI8snxVnvtfxVHpDuKPuD2rd0hxcqwtD2YT4ma9M5t8BK+5+cR/Qfpt/sLEWdPV ogwgyRz6LEki1pbnjDqWQFT5I17zjW6RvCtD3ZXFQhAJwrbLnORZtHf+1XL4gI8DyrIg Zxc08OI3+yOhGe4n/+DCteLTyuYgNHyN61WJM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=X8rze0OGtfT+iJ7APBpCUS2mtKZW3A/Msmi7URGETJ2acHh29aF++LMZHrYCfupKxm VRXb/5xa300f0KUsl/b7fM1afYCBqJSyhuz/fCC2aTzCX30gWW3mvdefNZzIm9z0VVGY Jly9EhA5SShZCtQM4XC1kHy/Uf7Y3i18VnWWg= MIME-Version: 1.0 Received: by 10.115.99.4 with SMTP id b4mr1966576wam.88.1258478702233; Tue, 17 Nov 2009 09:25:02 -0800 (PST) In-Reply-To: References: <26389750.post@talk.nabble.com> From: Robert Muir Date: Tue, 17 Nov 2009 12:24:42 -0500 Message-ID: <8f0ad1f30911170924y707923b0ra4803e63437eb719@mail.gmail.com> Subject: Re: Lucene Not Throwing Matches Without Spaces To: general@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e648f0ae010a5b0478946875 --0016e648f0ae010a5b0478946875 Content-Type: text/plain; charset=UTF-8 Solr's WordDelimiterFilter has an option splitOnCaseChange i think that might work for your SaddamHussain example. if you want to use Ted's first approach with lucene, you could try the compounds package in Lucene's analysis contrib, and give it an english wordlist. (or create a very refined custom list of your own as he suggested). On Tue, Nov 17, 2009 at 12:14 PM, Ted Dunning wrote: > That is what is going on. > > To fix the problem you generally need to do a bit of statistics on your > corpus to discover word pairs that appear both with and without a space. > Once you have that, you have two approaches that will work. > > The first approach is to index your text in an ambiguous fashion. Where > your "mighty duck" text would have previously been indexed, as Simon says, > as two terms ["mighty"@0, "duck"@1] with the pair lexicon, you would index > the text as ["mighty duck"@0, "mighty"@0, "duck"@1]. At this point, either > query will work. > > Another approach that is easier if you don't want to mess with the indexer > and analyzer chain, is to do the same transformation at query time. If the > user types the query [mightyduck], you would rewrite this to be [mightyduck > OR phrase(mighty duck)]. Similarly, if the user types [mighty duck], you > would rewrite the query to be [mightyduck OR phrase(mighty duck) OR mighty > OR duck]. > > On Tue, Nov 17, 2009 at 8:09 AM, Simon Willnauer < > simon.willnauer@googlemail.com> wrote: > > > Nishu, > > > > first you should send this question to java-users not to general :) > > When you index a doc the the content "mighty duck" your TokenStream > > most likely builds two tokens t1:"mighty" t2:"duck" > > the same happens (most likely) when you search for "mighty duck" with > > the QueryParser so the query will be a boolean TermQuery("mighty") OR > > TermQuery("duck"). This will retrieve your document. If you search for > > "mightyduck" the query will only have one boolean clause (actually > > none, its just a term query) with TermQuery("mightyduck"). Lucene will > > not find any matches as this term is not in the index. > > > > Hope that helps for understanding what is going on. > > > > simon > > > > On Tue, Nov 17, 2009 at 2:16 PM, Nishu Soni > > wrote: > > > > > > Lucene is not throwing matches when search string is without space and > > data > > > in my index file is with space.For e.g. if "Saddam Hussain" text is in > > index > > > file and I am searchin "SaddamHussain", I am not getting any matches.I > am > > > using Boolean Query for scanning. > > > > > > Any help will be highly appreciated. > > > -- > > > View this message in context: > > > http://old.nabble.com/Lucene-Not-Throwing-Matches-Without-Spaces-tp26389750p26389750.html > > > Sent from the Lucene - General mailing list archive at Nabble.com. > > > > > > > > > -- > Ted Dunning, CTO > DeepDyve > -- Robert Muir rcmuir@gmail.com --0016e648f0ae010a5b0478946875--