Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 13909 invoked from network); 9 Sep 2008 16:35:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Sep 2008 16:35:49 -0000 Received: (qmail 50828 invoked by uid 500); 9 Sep 2008 16:35:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 50799 invoked by uid 500); 9 Sep 2008 16:35:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 50788 invoked by uid 99); 9 Sep 2008 16:35:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Sep 2008 09:35:39 -0700 X-ASF-Spam-Status: No, hits=0.9 required=10.0 tests=DATE_IN_PAST_06_12,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jak-lucene-user@m.gmane.org designates 80.91.229.2 as permitted sender) Received: from [80.91.229.2] (HELO ciao.gmane.org) (80.91.229.2) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Sep 2008 16:34:38 +0000 Received: from root by ciao.gmane.org with local (Exim 4.43) id 1Kd6BC-0004ff-Mn for java-user@lucene.apache.org; Tue, 09 Sep 2008 16:35:02 +0000 Received: from home.schibsted.no ([80.91.33.33]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 09 Sep 2008 16:35:02 +0000 Received: from mick by home.schibsted.no with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 09 Sep 2008 16:35:02 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: java-user@lucene.apache.org From: Mck Subject: Replacing FAST functionality at sesam.no - ShingleFilter+ exact matching Date: Tue, 09 Sep 2008 09:31:56 +0200 Lines: 106 Message-ID: <1220941046.8075.14.camel@localhost> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-tErjgaGKY0DGUk8790yG" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: home.schibsted.no X-Mailer: Evolution 2.22.3.1 Sender: news X-Virus-Checked: Checked by ClamAV on apache.org --=-tErjgaGKY0DGUk8790yG Content-Type: text/plain Content-Transfer-Encoding: quoted-printable -- original post was on solr's user list. -- -- i've reposted here as it's centered on the ShingleFilter which comes fro= m lucene -- *ShortVersion* is there a way to make the ShingleFilter perform exact matching via inserting ^ $ begin/end markers? *LongVersion* At sesam.no we want to replace a FAST (fast.no) Query Matching Server with a Solr index. The index we are trying to replace is not a regular index, but specially configured to perform phrases (and sub-phrases) matches against several large lists (like an index with only a 'title' field). I'm not sure of a correct, or logical, name for the behaviour we are after, but it is like a combination between Shingles and exact matching. Our test list has 9 entries: "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh", "ijkl", "ijkl = efgh", "efgh abcd", and "ijkl efgh abcd". The query behaviour we are looking for is like: (i've included ^$ to denote the exact matching) Original Query --> Filtered Query abcd --> ^abcd$ "abcd efgh" --> (^abcd$ ^"abcd efgh"$ ^efgh$) "abcd efgh ijkl" --> (^abcd$ ^"abcd efgh"$ ^"abcd efgh ijkl"$ ^efgh$ ^"efgh= ijkl"$ ^ijkl$) I'm using a trunk build of Solr, and using the example/solr for the solr home. I'm using trunk builds of lucene libraries as well. Editing schema.xml so to put these entries in as type=3D"string" and using defaultOperator=3D"OR" gives the expected exact matching functionality given queries are quoted, eg /solr/select/?q=3D"abcd efgh ijkl" ( I've noticed that this exact matching can also be achieved with TextField and using KeywordTokenizer at index time. ) So then i change type=3D"string" to type=3D"shingleString" along with > > > > > > > > > I never get any hits with quoted queries. Without quotes i only get the unigrams. I get the same outcomes using fieldType@class=3D"solr.TextField" and in the index analyzer tokenizer@class=3D"solr.KeywordTokenizerFactory". Debugging ShingleFilter I see that (with the quotes) the shingles array fills up with the expected shingles. And the Query (infact a MultiPhraseQuery) returned from SolrQueryParser.getFieldQuery() looks like list_entry_shingle:"(abcd abcd efgh abcd efgh ijkl) (efgh efgh ijkl) ijkl" I'm struggling to make sense of this. How can the shingles be matched if they aren't quoted? I would be expecting a Query instead like: abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl (This with the ShingleFilter disabled does indeed work perfectly). Am i barking up the wrong tree? Is there a way to get the shingles phrased? Or, better yet, is there a way to get the shingles surrounded with ^ $ being/end markers for exact matching? ~mck --=-tErjgaGKY0DGUk8790yG Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEABECAAYFAkjGFPQACgkQkdPrePiuutPSogCdE7PKLkfXOq1OOFU040UQAtPx p0cAoLtXNLPvYQgi3XWg8C1H/YFTCwCj =6wiE -----END PGP SIGNATURE----- --=-tErjgaGKY0DGUk8790yG--