Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 73266 invoked from network); 13 Feb 2009 09:52:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Feb 2009 09:52:48 -0000 Received: (qmail 41812 invoked by uid 500); 13 Feb 2009 09:52:40 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 41774 invoked by uid 500); 13 Feb 2009 09:52:40 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 41763 invoked by uid 99); 13 Feb 2009 09:52:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Feb 2009 01:52:40 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of karl.wettin@gmail.com designates 209.85.219.15 as permitted sender) Received: from [209.85.219.15] (HELO mail-ew0-f15.google.com) (209.85.219.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Feb 2009 09:52:31 +0000 Received: by ewy8 with SMTP id 8so880960ewy.5 for ; Fri, 13 Feb 2009 01:52:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=x1pyq5bYWFYXY6Kpi+sWbMh3zJUWuJhEQLhnRD4KItg=; b=S90kOGEQHoRhq/uuxnihPyWA3wLK+OhDN4tWwBXfqaXvvj5AqKlw64N4yJRfItYVuh 0XafVBlMho5+ZGKnvMNa1qeQIOe0ywKtMeB8+7kUJN/sUhsSlxgVvbPD5FZPdO2L3iqE vNppznHzOFgxzF7Di+afqu2PCdVm68ZN7WU7E= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=EkfA9a5CNBfsuMTeW8hl/xHMzQwA0i35vLJLdz++1eTfHK0E0q3DdepM2L7HBtJvEx Qdj/6+hE64Lv4IL+mPlytZE+3/k8h6kE25Ps0SBj2h8QSno36/e39pZ+C+TySRXRFqEu PGvMPkxSSPQhaFH5Ln7VUQgA5eCeTLZ5uoqNE= Received: by 10.210.133.19 with SMTP id g19mr1482778ebd.101.1234518730897; Fri, 13 Feb 2009 01:52:10 -0800 (PST) Received: from ?192.168.1.201? (c-cc8770d5.029-18-6d6c6d2.cust.bredbandsbolaget.se [213.112.135.204]) by mx.google.com with ESMTPS id 35sm1622682nfu.47.2009.02.13.01.52.10 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 13 Feb 2009 01:52:10 -0800 (PST) Message-Id: <4A0ED6FA-4358-4747-BD1C-7B50228FBE09@gmail.com> From: Karl Wettin To: java-user@lucene.apache.org In-Reply-To: <499540BB.2030606@gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: Partial / starts with searching Date: Fri, 13 Feb 2009 10:52:08 +0100 References: <499529D4.2040101@gmail.com> <6B73A2E5-0464-40C5-AAD6-0CB781230D0E@gmail.com> <499540BB.2030606@gmail.com> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org If you attach an NgramTokenFilter to your analyzer at index and query time you should be able to query for parts of the word. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html The classes are available in the contrib/analyzer module. You might want to boost edges a bit more than inner parts, start trying out with something like 3-5 grams. Be aware, this will produce a rather large index. karl 13 feb 2009 kl. 10.43 skrev d-fader: > Karl, > > As a matter of fact I more or less did. I'm not really into NGrams, > but I read some articles about this technique and I eventually ended > up at the 'Did you mean: Lucene?' article written by Tom White. To > make a long story short, this solved my problem partially. I do have > 2 indexes now and I've written code to extract all terms a user > entered, put them through the suggestion engine and tries to be > clever about what suggestion should be used. It includes that stop > words are ignored, when the entered term exists for more than x > times in the index already it's probably good (and thus a suggestion > is not needed) and when there are suggestions available, the > suggestion with the most occurences in the index is presented. After > that the original query is being built up again, preserving all > command codes (like ", ( ), AND, OR, etc. etc.). > As said, this system works pretty well and mostly if there's a > suggestion available, it's actually quite accurate, so thanks for > this. > > Still, it doesn't solve my problem fully. But I think I now know why > Lucene can't search 'truely' partially. To find a document fast, all > terms are stored with a list of documents which contain the term and > when a user searches, Lucene can identify the documents by comparing > the terms entered to the terms on that list, right? If so, it's > understandable that a true partial search never will work, but then > I just don't understand how Google manages to do this :) > > Jori. > > > > > Karl Wettin wrote: >> Hi again Jori, >> >> did you try N-grams as suggested in the reply on -dev? >> >> >> karl >> >> 13 feb 2009 kl. 09.05 skrev d-fader: >> >>> Hi, >>> >>> I've actually posted this message in de dev mailing list earlier, >>> because I though my 'issue' is a limitation of the functionality of >>> Lucene, but they redirected me to this mailinglist, so I hope one >>> of you >>> guys can help me out :) >>> >>> Maybe the 'issue' I'm addressing now is discussed thouroughly >>> already, >>> in that case I think I need some redirection to the sources of those >>> discussions :) Anyway, here's the thing. >>> For all I know it's impossible to search partial words with Lucene >>> (except the asterix method with e.g. the StandardAnalyzer -> >>> ambul* to >>> find ambulance). My problem with that method is that my index >>> consists >>> of quite a few terms. This means that if a user would search for >>> 'ambu >>> amster' (ambulance amsterdam), there will be so many terms to >>> search, >>> the waiting time is just inacceptable. Now I started thinking why >>> it's >>> impossible to search only a 'part' of a term or even only the >>> 'start' of >>> a term and the only reason I could think of was that the Index >>> terms are >>> stored tokenized (in that way you (of course) can't find partial >>> terms, >>> since the index doesn't actually contain the literal terms, but >>> tokens >>> instead). But Lucene can also store all terms untokenized, so in >>> that >>> case, in my humble opinion, a partial search would be possible, >>> since >>> all terms would be stored 'literally'. >>> >>> Maybe my thinking is wrong, I only have a black box view of >>> Lucene, so I >>> don't know much about indexing algorithm and all, but I just want to >>> know if this could be done or else why not :) You see, the users >>> of my >>> index want to know why they can't search parts of the words they >>> enter >>> and I still can't give them a really good answer, except the 'it >>> would >>> result in too many OR operators in the query' statement :) . I've >>> tried >>> using a Dutch stemmer (most of the data I'm indexing is Dutch) but >>> that >>> didn't work out quite good. Furthermore users sometimes search for a >>> certain 'filename' and mostly they just enter a part of the name and >>> thus don't find anything. >>> >>> I hope someone can enlighten me :) Thanks in advance! >>> >>> Jori >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org