Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 70091 invoked from network); 4 Sep 2010 05:26:21 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Sep 2010 05:26:21 -0000 Received: (qmail 20330 invoked by uid 500); 4 Sep 2010 05:26:19 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 20058 invoked by uid 500); 4 Sep 2010 05:26:15 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 20049 invoked by uid 99); 4 Sep 2010 05:26:14 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Sep 2010 05:26:14 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.191.84.217] (HELO web82104.mail.mud.yahoo.com) (209.191.84.217) by apache.org (qpsmtpd/0.29) with SMTP; Sat, 04 Sep 2010 05:25:52 +0000 Received: (qmail 36970 invoked by uid 60001); 4 Sep 2010 05:25:30 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sbcglobal.net; s=s1024; t=1283577929; bh=vFnwXzwvSwg57tIeDLRsFhEI9U1FPY6BmFnSJ88ZQaI=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=ltm7lqG5qazJ0ncfV3mT+o9hZdkj1rgoTohLFcnY5KKuE5uccc5eeoWk58dU4RVo3UhS4m+jESjAOKV4htxa66r20C6yCnrAB6drb9lzxHG1pP9l2aUsyx9qVF83hrpKLFBptsGnWJWorvTjYY7xzwFj5c2w9cd0NFEZRrIR1VI= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=sbcglobal.net; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=ddVXmILqxAS/n8NxvH0AcAzcCiIt+YhrLxNV4XxG0XjObJ0umP4s+2DyGBcV2Sv6vockB3pgTzMLsFnXXlxZA2lsf2DutreLUvCCDz1lLn+wwih/zlx3wCLdUeX/vMc1Ddoj9ISHfa6zSvD9tuIopo6SV77hy3nzZf2lwwhhJWQ=; Message-ID: <925632.29452.qm@web82104.mail.mud.yahoo.com> X-YMail-OSG: lWRSJfQVM1nu06GGitvZbfFYbo7i3j_1W6_yhEID6ZKNKAK 9Q9dJ7WQHE.d7Pf.X0ZvVEw0zc.qKp.qTeOb4oW7nOaG0O6CgDn0vlP3ZeTm ridqSfUhc_RPko9zLcRx98LOkHBaIvANcrFC23KST.UGFLvZAR6t6ELGDAln X88vBPsYa5ww87rs9y9loQx6GLO_E1YUBNJ0FvIuXONb_NqvSlkENtMH7ULF 4ExRogvmN6v6IBG.liy25AVInkuc7iP5nkOT7zJKjLSHeuT.8QEThIBvPxN. qWntAqc5UkxbcpkTmwPtt0seg0kKRZBR0Yg_aEzVyf9G6HVElWPnAkJg_TZK GIcRAu4e6bgdrFF.fu6bIgTGAMKgg Received: from [68.183.64.79] by web82104.mail.mud.yahoo.com via HTTP; Fri, 03 Sep 2010 22:25:29 PDT X-Mailer: YahooMailClassic/11.3.2 YahooMailWebService/0.8.105.279950 Date: Fri, 3 Sep 2010 22:25:29 -0700 (PDT) From: Dennis Gearon Subject: Re: shingles work in analyzer but not real data To: solr-user@lucene.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thank you mucho much, Lance.=0A=0A=0ADennis Gearon=0A=0ASignature Warning= =0A----------------=0AEARTH has a Right To Life,=0A otherwise we all die.= =0A=0ARead 'Hot, Flat, and Crowded'=0ALaugh at http://www.yert.com/film.php= =0A=0A=0A--- On Fri, 9/3/10, Lance Norskog wrote:=0A=0A= > From: Lance Norskog =0A> Subject: Re: shingles work in= analyzer but not real data=0A> To: solr-user@lucene.apache.org=0A> Date: F= riday, September 3, 2010, 9:55 PM=0A> http://en.wikipedia.org/wiki/W-shingl= ing=0A> =0A> On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe = =0A> wrote:=0A> > Hi Dennis,=0A> >=0A> > I took a stab at answering this qu= estion in the=0A> following java-user mailing list post:=0A> >=0A> > http:/= /www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes= =0A> >=0A> > Steve=0A> >=0A> >> -----Original Message-----=0A> >> From: Den= nis Gearon [mailto:gearond@sbcglobal.net]=0A> >> Sent: Friday, September 03= , 2010 5:06 AM=0A> >> To: solr-user@lucene.apache.org=0A> >> Subject: Re: s= hingles work in analyzer but not=0A> real data=0A> >>=0A> >> Anyone got a d= efinitive, authoritative link to the=0A> definition of a=0A> >> 'shingle' i= n search engine results/technology?=0A> >>=0A> >>=0A> >> Dennis Gearon=0A> = >>=0A> >> Signature Warning=0A> >> ----------------=0A> >> EARTH has a Righ= t To Life,=0A> >> =A0 otherwise we all die.=0A> >>=0A> >> Read 'Hot, Flat, = and Crowded'=0A> >> Laugh at http://www.yert.com/film.php=0A> >>=0A> >>=0A>= >> --- On Fri, 9/3/10, Jeff Rose =0A> wrote:=0A> >>= =0A> >> > From: Jeff Rose =0A> >> > Subject: Re: shin= gles work in analyzer but=0A> not real data=0A> >> > To: solr-user@lucene.a= pache.org=0A> >> > Date: Friday, September 3, 2010, 1:48 AM=0A> >> > Thanks= Steven and Jonathan, we got it=0A> >> > working by using a combination of= =0A> >> > quoting and the PositionFilterFactory, like=0A> is shown=0A> >> >= below.=A0 The=0A> >> > documentation for the position filter doesn't=0A> m= ake much=0A> >> > sense without=0A> >> > understanding more about how posit= ioning of=0A> tokens is taken=0A> >> > into account,=0A> >> > but it appear= s to do the trick.=A0 Does anyone=0A> know why=0A> >> > position would matt= er=0A> >> > here?=A0 It seems like tokens would be emitted=0A> by a=0A> >> = > tokenizer, filtered,=0A> >> > joined into pairwise tokens by the shingler= ,=0A> and then=0A> >> > matched against the=0A> >> > index.=A0 If position = information is also=0A> important it=0A> >> > seems odd that this is=0A> >>= > not discussed in the documentation..=A0 (Same=0A> for the=0A> >> > pre-t= okenizing done by=0A> >> > the query parser, before handing phrases to=0A> = the=0A> >> > tokenizer...)=0A> >> >=0A> >> > Anyway, here is our final sche= ma that works=0A> as long as we=0A> >> > put search phrases=0A> >> > in dou= ble quotes.=A0 Thanks for all the help!=0A> >> >=0A> >> > -Jeff=0A> >> >=0A= > >> > =A0 class=3D"solr.TextField"=0A> >> > po= sitionIncrementGap=3D"100">=0A> >> > =A0 =A0 =A0 = =0A> >> > =A0 =A0 =A0 =A0 >> > class=3D"solr.PatternTokenize= rFactory"=0A> pattern=3D";"/>=0A> >> > =A0 =A0 =A0 =A0 >> > cla= ss=3D"solr.LowerCaseFilterFactory"/>=0A> >> > =A0 =A0 =A0 =A0 >= > > class=3D"solr.TrimFilterFactory" />=0A> >> > =A0 =A0 =A0 =A0 >> > class=3D"solr.LowerCaseFilterFactory"/>=0A> >> > =A0 =A0 =A0 =A0 >> > class=3D"solr.ShingleFilterFactory"=0A> outputUnigrams= =3D"true"=0A> >> > outputUnigramIfNoNgram=3D"true"=0A> maxShingleSize=3D"2"= />=0A> >> > -->=0A> >> > =A0 =A0 =A0 =0A> >> > =A0 =A0 =A0 =0A> >> > =A0 =A0 =A0 =A0 >> > class=3D"= solr.PatternTokenizerFactory"=0A> pattern=3D"[.,?;:=0A> >> > !]"/>=0A> >> >= =A0 class=3D"solr.LowerCaseFilterFactory"/>=0A> >> > =A0 =A0 = =A0 =A0=A0=A0 >> > class=3D"solr.TrimFilterFactory" />=0A> >> > = =A0 class=3D"solr.ShingleFilterFactory"/>=0A> >> > =A0 class=3D"solr.PositionFilterFactory"/>=0A> >> > =A0 =A0 =A0 = =0A> >> > =A0 =A0 =0A> >> >=0A> >> >=0A> >> > On Thu, Sep 2, 20= 10 at 11:47 PM, Jonathan=0A> Rochkind =0A> >> > wrote:=0A= > >> >=0A> >> > > I've run into this before too. Both the=0A> dismax and=0A= > >> > solr-lucene _query=0A> >> > > parsers_ will tokenize a query on=0A> = whitespace _before_=0A> >> > they pass the query to=0A> >> > > any field an= alyzers.=0A> >> > > There are some reasons for this, lots of=0A> things=0A>= >> > wouldn't work if they=0A> >> > > didn't do this.=0A> >> > >=0A> >> > = > But it makes your approach kind of hard.=0A> Try doing=0A> >> > your sear= ch as a phrase=0A> >> > > search with double quotes, "apple pie",=0A> I bet= it'll=0A> >> > work then -- because=0A> >> > > both dismax and solr-lucene= will respect=0A> the phrase=0A> >> > quotes and NOT tokenize=0A> >> > > th= e stuff inside there before it gets to=0A> the field=0A> >> > analyzers.=0A= > >> > >=0A> >> > > So if non-tokenized fields like this are=0A> all that a= re=0A> >> > included in your=0A> >> > > search, and if you can get your cli= ent=0A> application to=0A> >> > just force phrase=0A> >> > > quoting of eve= rything before sending to=0A> Solr, that=0A> >> > might work. Otherwise....= =0A> >> > > I don't know of a good solution. If you=0A> figure one=0A> >> >= out, let me know.=0A> >> > >=0A> >> > > Jonathan=0A> >> > >=0A> >> > >=0A>= >> > > Jeff Rose wrote:=0A> >> > >=0A> >> > >> Hi,=0A> >> > >>=A0 We are u= sing SOLR to match query=0A> strings=0A> >> > with a keyword database, wher= e=0A> >> > >> some of the keywords are actually=0A> more than one=0A> >> > = word.=A0 For example a=0A> >> > >> keyword=0A> >> > >> might be "apple pie"= and we only=0A> want it to match=0A> >> > for a query containing=0A> >> > = >> that word pair, but not one only=0A> containing=0A> >> > "apple".=A0 Her= e is the relevant=0A> >> > >> piece of the schema.xml, defining=0A> the ind= ex and=0A> >> > query pipelines:=0A> >> > >>=0A> >> > >>=A0 >> > class=3D"solr.TextField"=0A> positionIncrementGap=3D"100= ">=0A> >> > >>=A0 =A0=A0=A0 >> > type=3D"index">=0A> >> > >>= =A0 =A0 =A0=A0=A0 >> > class=3D"solr.PatternTokenizerFactory"= =0A> pattern=3D";"/>=0A> >> > >>=A0 =A0 =A0 =A0 >> > class=3D"s= olr.LowerCaseFilterFactory"/>=0A> >> > >>=A0 =A0 =A0 =A0 >> > c= lass=3D"solr.TrimFilterFactory" />=0A> >> > >>=A0 =A0=A0=A0=0A> = >> > >>=A0 =A0=A0=A0 >> > type=3D"query">=0A> >> > >>=A0 =A0 = =A0 =A0 >> > class=3D"solr.WhitespaceTokenizerFactory"/>=0A>= >> > >> >> > class=3D"solr.LowerCaseFilterFactory"/>=0A> >> > = >>=A0 =A0 =A0 =A0 >> > class=3D"solr.TrimFilterFactory" />=0A> = >> > >> class=3D"solr.ShingleFilterFactory"=0A> >> > />=0A> >> = > >>=A0 =A0 =A0 =0A> >> > >>=A0=A0=A0=0A> >> > >>=0A= > >> > >> In the analysis tool this schema=0A> looks like it=0A> >> > works= correctly.=A0 Our=0A> >> > >> multi-word keywords are indexed as a=0A> sin= gle entry,=0A> >> > and then when a search=0A> >> > >> phrase contains one = of these=0A> multi-word keywords=0A> >> > it is shingled and=0A> >> > >> ma= tched.=0A> >> > >>=A0 Unfortunately, when we do the same=0A> queries=0A> >>= > on top of the actual index it=0A> >> > >> responds with zero matches.=A0= I can=0A> see in the=0A> >> > index histogram that the=0A> >> > >> terms= =0A> >> > >> are correctly indexed from our mysql=0A> datasource=0A> >> > c= ontaining the keywords,=0A> >> > >> but=0A> >> > >> somehow the shingling d= oesn't appear=0A> to work on=0A> >> > this live data.=A0 Does=0A> >> > >> a= nyone=0A> >> > >> have experience with shingling that=0A> might have=0A> >>= > some tips for us, or=0A> >> > >> otherwise advice for debugging the=0A> = issue?=0A> >> > >>=0A> >> > >> Thanks,=0A> >> > >> Jeff=0A> >> > >>=0A> >> = > >>=0A> >> > >>=0A> >> > >=0A> >> >=0A> >=0A> =0A> =0A> =0A> -- =0A> Lance= Norskog=0A> goksron@gmail.com=0A>