Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 45863 invoked from network); 21 May 2010 13:19:23 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 May 2010 13:19:23 -0000 Received: (qmail 77964 invoked by uid 500); 21 May 2010 13:19:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 77915 invoked by uid 500); 21 May 2010 13:19:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 77906 invoked by uid 99); 21 May 2010 13:19:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 May 2010 13:19:21 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.211.194 as permitted sender) Received: from [209.85.211.194] (HELO mail-yw0-f194.google.com) (209.85.211.194) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 May 2010 13:19:15 +0000 Received: by ywh32 with SMTP id 32so617783ywh.5 for ; Fri, 21 May 2010 06:18:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=RhdTpoa4oYCrj/ZyqQmlm1FFjurmMsdbn/WNwjXfNlM=; b=bC5k4HlsDr2P9v4EyK47Mo0jqGfxvtjRvhkSR3ghiSesgcZmJqyVAeChsoHM6S84Ue T4mh90zyTfLN7O6fH0VQQ3gDWInBQIn2GT6w+rhahY1qA3Ys4lQfupsRUahXSYzFB8Ma LTytXzv5YrQmUhJ3rIssbj6QoiSH1Z9HYN9a4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=uu9f0ep05pcUlsKKtlaJugThLl0wTSNrXzkNJoN4YwTG5T0/e/9DifSIyjYkbMdP+Q UoB74qSCcgJlOOeyXs/egfnRnQ/GcOd2oLz7E9qtMrEXgSIvGuDACw3fcZhwBawHrXsl Jtnwtn51x6pfYAEEZ+nsW/mpmKaL/+PhHFCKo= MIME-Version: 1.0 Received: by 10.150.246.8 with SMTP id t8mr2554504ybh.97.1274447934807; Fri, 21 May 2010 06:18:54 -0700 (PDT) Received: by 10.151.43.7 with HTTP; Fri, 21 May 2010 06:18:54 -0700 (PDT) In-Reply-To: References: <212353.47457.qm@web113304.mail.gq1.yahoo.com> Date: Fri, 21 May 2010 09:18:54 -0400 Message-ID: Subject: Re: Stemming and Wildcard Queries From: Erick Erickson To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=000e0cd6afdc70608c04871a8831 --000e0cd6afdc70608c04871a8831 Content-Type: text/plain; charset=ISO-8859-1 Another approach to stemming at index time but still providing exact matches when requested is to index the stemmed version AND the original version at the same position (think synonyms). But here's the trick, index the original token with a special character. For instance, indexing "running" would look like indexing "run" and "running$". Now, whenever you want the exact match, just add the "$" to the end of the token. With this approach, you have to watch that your analyzers don't strip the '$'... Of course, each approach has its trade-offs, and the characteristics of your particular problem may determine which is preferable... FWIW Erick On Thu, May 20, 2010 at 4:48 PM, Herbert Roitblat wrote: > At a general level, we have found that stemming during indexing is not > advisable. Sometimes users want the exact form and if you have removed the > exact form during indexing, obviously, you cannot provide that. Rather, we > have found that stemming during search is more useful, or maybe it should be > called anti-stemming. For any given input for which the user wants to stem, > we could derive the variations during the query processing. E.g., plan can > be expanded to include plans, planning, planned, etc. > > In our application we provide a feature that is sometimes called a word > wheel. When someone enters plan in this tool, we show all of the words in > the index that start with plan. Here are some of the related words: > plan > plane > planes > planet > planificaci > planned > plannedoutages.xls > planner > planners > > Just a thought. > Herb > > ----- Original Message ----- From: "Ivan Provalov" > To: > Sent: Thursday, May 20, 2010 1:16 PM > Subject: Stemming and Wildcard Queries > > > > Is there a good way to combine the wildcard queries and stemming? >> >> As is, the field which is stemmed at index time, won't work with some >> wildcard queries. >> >> We were thinking to create two separate index fields - one stemmed, one >> non-stemmed, but we are having issues with our SpanNear queries (they >> require the same field). >> >> We thought to try combining the stemmed and non-stemmed terms in the same >> field, but we are concerned about the stats being skewed as a result of this >> (especially for the TermVector stats). Can overloading the non-stemmed >> field with stemmed terms cause any issues with the TermVector? >> >> Any suggestions? >> >> Ivan Provalov >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --000e0cd6afdc70608c04871a8831--