Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8D63FD7F5 for ; Thu, 8 Nov 2012 19:07:48 +0000 (UTC) Received: (qmail 60034 invoked by uid 500); 8 Nov 2012 19:07:45 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 59915 invoked by uid 500); 8 Nov 2012 19:07:45 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 59907 invoked by uid 99); 8 Nov 2012 19:07:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 19:07:45 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dmitry.kan@gmail.com designates 209.85.219.48 as permitted sender) Received: from [209.85.219.48] (HELO mail-oa0-f48.google.com) (209.85.219.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 19:07:38 +0000 Received: by mail-oa0-f48.google.com with SMTP id h2so3585165oag.35 for ; Thu, 08 Nov 2012 11:07:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=dFW4ogeuIeRidAenv7/m73tJQpOGCIc1S+6p/3mYSyI=; b=vFj3irjiEIvK5wOMEAgmbHkZex6X+sKqCVrNdOi+U8TwzCqWWgJQ2hJbL/eNdOQJVH GVTflsvg1MRptQGt6JEexl1GHLCl+VmGUoHaN4Me5Oc49/kxamg4jaLpIqvTa19l5bAq zuPXgTLEltq39C7R7xBHAmfzpA3qIQjXrMuKSVVILAeZ83Odjg03EgKl4U8McDmPySRX AHfI2z1hW1hOIy2IPDaN4GbuuNd2zTsP+SZdJiHodAXh/EN0Jmlmq2yV8P0R9d7kU0cJ W58hzvfiLbDPGaJlIS45QSXVKlE/awyU3RkS+nCl++GHag815FVt1lcJIdXeeJxu4G8u fApA== MIME-Version: 1.0 Received: by 10.182.89.42 with SMTP id bl10mr6572789obb.27.1352401637113; Thu, 08 Nov 2012 11:07:17 -0800 (PST) Received: by 10.60.146.171 with HTTP; Thu, 8 Nov 2012 11:07:16 -0800 (PST) In-Reply-To: <497E5F8206EC4A459DF5C52C1EEC77A4@JackKrupansky> References: <8E9A97984EF84BAE81664DA0749AE49D@JackKrupansky> <497E5F8206EC4A459DF5C52C1EEC77A4@JackKrupansky> Date: Thu, 8 Nov 2012 21:07:16 +0200 Message-ID: Subject: Re: searching camel cased terms with phrase queries From: Dmitry Kan To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=f46d04479f3d2c80f304ce008b29 X-Virus-Checked: Checked by ClamAV on apache.org --f46d04479f3d2c80f304ce008b29 Content-Type: text/plain; charset=UTF-8 Thanks, Jack. This filter should help for dealing with user input without clear lexical boundaries. I.e. breaking compound-to-be-words into sub-words on the query side. It does require still mining the dictionary, but is doable by some "simple" camel case term frequency analysis. But would it help really to match with the indexed data? Tried with solr 4.0.0-BETA (hopefully not too different from stable 4.0 release on this side): text field in schema (slightly modified "text_general" type by adding WDF and DCWTF + placing LCF in-between them; english-common-nouns.txt is from http://www.typo3-media.com/fileadmin/files/wordlists/english-common-nouns.txtwith word 'rice' removed to make the example below make more sense): index: product for PricewaterhouseCoopers company is this! query: "product for Pricewaterhousecoopers company is this!" I believe no match here according to terms and their positions on the analysis page. Some misconfiguration? Included DCWTF on the query side as well as opposed to e.g. to an approach here http://www.typo3-media.com/blog/solr-noun-expansion.html, so that to encounter for user no lexical boundary compound words. -- Dmitry On Thu, Nov 8, 2012 at 5:04 PM, Jack Krupansky wrote: > I forgot to mention DictionaryCompoundWordTokenFil**terFactory. It does > require you to create a dictionary of terms, as opposed to using the terms > that have been encountered in the index. > > -- Jack Krupansky > > -----Original Message----- From: Jack Krupansky > Sent: Wednesday, November 07, 2012 8:14 AM > To: solr-user@lucene.apache.org > Subject: Re: searching camel cased terms with phrase queries > > > This is one of those areas of Solr where you can refine and make > improvements, as you have done, but never actually reach 100% satisfaction. > And, in some cases, as here, you have a choice of settings and no single > combination covers all cases. > > In this case, you really need compound-term recognition - detecting that > two > or more terms have been juxtaposed with no lexical boundary. Google has it, > and I 'm sure some Solr users have implemented it on their own, but it > isn't > in Solr proper, yet. > > WDF provides a partial approximation, by generating extra, compound terms > at > index time. That works well when ALL of the terms are written together, but > not when only a subset are written together without lexical boundaries, as > in your final example. > > So, you COULD go the full Google route with a lot of additional effort, or > accept that you offer only a reasonable approximation. Your choice. > > So, pick the approximation which seems "best" and accept that it doesn't > handle the other cases. > > BTW, the proper name is "PricewaterhouseCoopers". > > -- Jack Krupansky > > -----Original Message----- From: Dmitry Kan > Sent: Wednesday, November 07, 2012 1:58 AM > To: solr-user@lucene.apache.org > Subject: searching camel cased terms with phrase queries > > Hello list, > > There was a number of threads about handling camel cased words apparently > in the past ( > http://search-lucene.com/?q=**camel+case&fc_project=Lucene&** > fc_project=Solr > ). > Our case is somewhat different from them. > > =================== > Configuration & example > =================== > > To illustrate the issue, let me give you a real example from our data. > Suppose there is a term in the original text: SmartTV. > > If a user wants to type "SmartTV" and "smart tv", we want both to hit the > original term SmartTV. In order to achieve this, the following filter is > used in our solr 3.4 schema: > > index side: > > generateWordParts="1" > generateNumberParts="0" > catenateWords="0" > catenateNumbers="0" > catenateAll="0" > preserveOriginal="1" > spiltOnCaseChange="1" > /> > > query side: > > generateWordParts="1" > generateNumberParts="0" > catenateWords="0" > catenateNumbers="0" > catenateAll="0" > preserveOriginal="1" > spiltOnCaseChange="1" > /> > > (no differences) > > Copying from the analysis page, the index will contain the following terms > and their positions: > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 12 term text SmartTVTV Smart startOffset 05 0 > endOffset 77 5 type > > (there are tokenizer StandardTokenizerFactory and StandardFilterFactory > preceeding this filter, but as they didn't affect in this case, their > output is skipped). > > On the query side the query="smart tv" gets processed like: > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 12 term text smarttv startOffset 06 endOffset > 58 > type > > so there is a match (of course the LowerCaseFilterFactory is configured to > follow the WordDelimiterFilterFactory to unify the cases for matching) and > user is happily shooting queries: 'smart tv', 'smarttv' and 'SmartTV'. > > ==============================**===================== > More complex example that doesn't work with the above configuration > ==============================**===================== > > Problems start to occur, if a user types "smarttv for me" against the text > "SmartTV for me". Here are the index and query analysis excerpts: > > index: > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 1234 term text SmartTVTVforme Smart startOffset > 05812 0 endOffset 771114 5 type ** > > > query: > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 123 term text smarttvforme startOffset 0812 > endOffset 71114 type > since in the user query smarttv was written in small case, no split on case > is triggered and we believe there is no match due to mismatch of the term > positions ('for' is on the 3rd position in the index and on the 2nd > position in the query and 'smarttv' and 'for' are not adjacent to satisfy > the phrase query). > > > ========================= > Config change to fix the problem > ========================= > > > But here catenateWords=1 on indexing side comes at rescue. Which changes > things to: > > index: > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 1234 term text SmartTVTVforme SmartSmartTV > startOffset 05812 00 endOffset 771114 57 type > > > query (copying again for comparison purposes): > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 123 term text smarttvforme startOffset 0812 > endOffset 71114 type > > now there should be a match, because terms 'smarttv', 'for' and 'me' are > adjacent in the index (ingoring the case differences as > LowerCaseFilterFactory unifies them for us) and that is what's required by > the phrase query "smarttv for me". > > ==================== > Problem we couldn't solve > ==================== > > As we saw above, catenateWords merges maximum run of compound term parts > into one and aligns the resulting concatenated term with the last term > part. Illustration with an artificial camel casing: > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 1234 term text PriceWaterHouseCoopersWaterHou* > *se > Coopers PricePriceWaterHouseCoopers startOffset 051015 00 endOffset > 22101522 > 522 type ** > The following text and query will not match each other: text='product for > PriceWaterHouseCoopers company', query="product for PricewaterHouseCoopers > company": > > index: > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 1234567 term text productfor > PriceWaterHouseCoopersWaterHou**seCooperscompany > PricePriceWaterHouseCoopers > startOffset 081217222735 1212 endOffset 7113422273442 1734 type > ** > > query: > > org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1, > spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0, > luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, > catenateNumbers=0} position 123456 term text productfor > PricewaterHouseCoopersHouseCoo**perscompany Pricewater startOffset > 1913232836 > 13 endOffset 81235283543 23 type ** > > > > Is there any way to make them match? > > Thanks for reading this far. > > -dmitry > -- Regards, Dmitry Kan --f46d04479f3d2c80f304ce008b29--