Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 67439 invoked from network); 13 Feb 2009 09:43:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Feb 2009 09:43:28 -0000 Received: (qmail 25504 invoked by uid 500); 13 Feb 2009 09:43:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 25473 invoked by uid 500); 13 Feb 2009 09:43:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 25462 invoked by uid 99); 13 Feb 2009 09:43:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Feb 2009 01:43:21 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dfader@gmail.com designates 209.85.218.167 as permitted sender) Received: from [209.85.218.167] (HELO mail-bw0-f167.google.com) (209.85.218.167) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Feb 2009 09:43:13 +0000 Received: by bwz11 with SMTP id 11so1634734bwz.5 for ; Fri, 13 Feb 2009 01:42:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=MM0zAowusRxh0k1In8ufUtWKFgr5MCzcyZDObEzYH7Y=; b=v3bOUCsLg3GyaUipUX2idvdtWSR7tGRM5Hi0EmOrYXfbgSajR70KzhQ6s2r01z9qjq ob0D2TrEsf8I6OehB2z9eEkzkT3Q0HDD4rPtH/MhRepq4y+t1MjoPcf4Lo9NZME/dkNo POD0nweA/DaB6BCSaq7H5YJoy0pX1e5jnSkwo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=maYPy8H9T8SzYobaNiYnltTyXWIXsa16d3TsKV4Ijijy58uN7w+VMCUWnDIZ2Iaqqy S/1JftXSzYdzXw2dfHzrAjhVJglonU1WZnzS57BVDGFyTnrh6CS/LdIetWHT1vocm2Ua y67LzLMJDvg591bs963J0gaoFM8EuF0VuJuug= Received: by 10.223.117.1 with SMTP id o1mr2486228faq.96.1234518171812; Fri, 13 Feb 2009 01:42:51 -0800 (PST) Received: from ?192.168.41.154? ([213.197.196.190]) by mx.google.com with ESMTPS id 39sm13134331ugb.13.2009.02.13.01.42.50 (version=SSLv3 cipher=RC4-MD5); Fri, 13 Feb 2009 01:42:51 -0800 (PST) Message-ID: <499540BB.2030606@gmail.com> Date: Fri, 13 Feb 2009 10:43:23 +0100 From: d-fader User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Partial / starts with searching References: <499529D4.2040101@gmail.com> <6B73A2E5-0464-40C5-AAD6-0CB781230D0E@gmail.com> In-Reply-To: <6B73A2E5-0464-40C5-AAD6-0CB781230D0E@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Karl, As a matter of fact I more or less did. I'm not really into NGrams, but I read some articles about this technique and I eventually ended up at the 'Did you mean: Lucene?' article written by Tom White. To make a long story short, this solved my problem partially. I do have 2 indexes now and I've written code to extract all terms a user entered, put them through the suggestion engine and tries to be clever about what suggestion should be used. It includes that stop words are ignored, when the entered term exists for more than x times in the index already it's probably good (and thus a suggestion is not needed) and when there are suggestions available, the suggestion with the most occurences in the index is presented. After that the original query is being built up again, preserving all command codes (like ", ( ), AND, OR, etc. etc.). As said, this system works pretty well and mostly if there's a suggestion available, it's actually quite accurate, so thanks for this. Still, it doesn't solve my problem fully. But I think I now know why Lucene can't search 'truely' partially. To find a document fast, all terms are stored with a list of documents which contain the term and when a user searches, Lucene can identify the documents by comparing the terms entered to the terms on that list, right? If so, it's understandable that a true partial search never will work, but then I just don't understand how Google manages to do this :) Jori. Karl Wettin wrote: > Hi again Jori, > > did you try N-grams as suggested in the reply on -dev? > > > karl > > 13 feb 2009 kl. 09.05 skrev d-fader: > >> Hi, >> >> I've actually posted this message in de dev mailing list earlier, >> because I though my 'issue' is a limitation of the functionality of >> Lucene, but they redirected me to this mailinglist, so I hope one of you >> guys can help me out :) >> >> Maybe the 'issue' I'm addressing now is discussed thouroughly already, >> in that case I think I need some redirection to the sources of those >> discussions :) Anyway, here's the thing. >> For all I know it's impossible to search partial words with Lucene >> (except the asterix method with e.g. the StandardAnalyzer -> ambul* to >> find ambulance). My problem with that method is that my index consists >> of quite a few terms. This means that if a user would search for 'ambu >> amster' (ambulance amsterdam), there will be so many terms to search, >> the waiting time is just inacceptable. Now I started thinking why it's >> impossible to search only a 'part' of a term or even only the 'start' of >> a term and the only reason I could think of was that the Index terms are >> stored tokenized (in that way you (of course) can't find partial terms, >> since the index doesn't actually contain the literal terms, but tokens >> instead). But Lucene can also store all terms untokenized, so in that >> case, in my humble opinion, a partial search would be possible, since >> all terms would be stored 'literally'. >> >> Maybe my thinking is wrong, I only have a black box view of Lucene, so I >> don't know much about indexing algorithm and all, but I just want to >> know if this could be done or else why not :) You see, the users of my >> index want to know why they can't search parts of the words they enter >> and I still can't give them a really good answer, except the 'it would >> result in too many OR operators in the query' statement :) . I've tried >> using a Dutch stemmer (most of the data I'm indexing is Dutch) but that >> didn't work out quite good. Furthermore users sometimes search for a >> certain 'filename' and mostly they just enter a part of the name and >> thus don't find anything. >> >> I hope someone can enlighten me :) Thanks in advance! >> >> Jori >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org