Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 294FE9C3A for ; Tue, 11 Oct 2011 19:52:15 +0000 (UTC) Received: (qmail 15129 invoked by uid 500); 11 Oct 2011 19:52:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 15085 invoked by uid 500); 11 Oct 2011 19:52:13 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 15077 invoked by uid 99); 11 Oct 2011 19:52:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2011 19:52:13 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ian.lea@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2011 19:52:07 +0000 Received: by iakh37 with SMTP id h37so1219029iak.35 for ; Tue, 11 Oct 2011 12:51:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=PSWw5bOSk/zXsr/ausgRHSIqPKkH/bBllm5i4sOG7Ok=; b=aUtwXUY3nfWNXbD3Ptp4n4areYMsuSLKCqEyHC6PMHM4E2Nly5ndAEjsu+qiyus2LS FE11QFx2A1cKuQE4WtaEi6B6ayfALtG4UeBhMvkc30LRB9R+Pke9rRNpO8UeACnosYqb rT/r8bX42FW6h+IHLlszU+APeflLclKOdCuwA= Received: by 10.42.189.6 with SMTP id dc6mr28710840icb.16.1318362707113; Tue, 11 Oct 2011 12:51:47 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.45.141 with HTTP; Tue, 11 Oct 2011 12:51:27 -0700 (PDT) In-Reply-To: References: From: Ian Lea Date: Tue, 11 Oct 2011 20:51:27 +0100 Message-ID: Subject: Re: Shingles Filter problems To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Something does appear dodgy here. Using 3.4.0 the following very simple code, with no custom classes ShingleAnalyzerWrapper saw =3D new ShingleAnalyzerWrapper(LUCENE_34); QueryParser qp =3D new QueryParser(LUCENE_34, "t", saw); String s =3D "simple sentences rule"; Query q =3D qp.parse(s); System.out.printf("%s parsed to %s\n", s, q); produces simple sentences rule parsed to t:simple t:sentences t:rule Like you, I would have expected there to be some shingles in there. Are we both missing something? -- Ian. On Tue, Oct 11, 2011 at 3:25 PM, Peyman Faratin wr= ote: > Hi > > I have the following shinglefilter (Lucene 3.2) > > =A0 =A0 =A0 =A0 =A0public TokenStream tokenStream(String fieldName, Reade= r reader) { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0StandardTokenizer first =3D new Standa= rdTokenizer(Version.LUCENE_32, reader); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0StandardFilter second =3D new Standard= Filter(Version.LUCENE_32,first); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0LowerCaseFilter third =3D new LowerCas= eFilter(Version.LUCENE_32,second); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0StopFilter fourth =3D new StopFilter(V= ersion.LUCENE_32,third,Stopwords); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0PositionFilter fifth =3D new PositionF= ilter(fourth); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ShingleFilter filter =3D new ShingleFi= lter(fifth,shingleSize); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return filter; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} > > that produces the following token stream given sentence > > "please parse this sentence into a shingle of size 2. I'll pay $2 for it" > > 1: [_ parse:7->12:shingle] > 2: [parse:7->12:] [parse sentence:7->26:shingle] > 3: [sentence:18->26:] [sentence shingle:18->41:shingle] > 4: [shingle:34->41:] [shingle size:34->49:shingle] > 5: [size:45->49:] [size 2:45->51:shingle] > 6: [2:50->51:] [2 pay:50->61:shingle] > 7: [pay:58->61:] [pay 2:58->64:shingle] > 8: [2:63->64:] > > The query analyzer produces the following analyzed query for the field "t= itleShingled" for above sentence: > > ...... analyzed query:titleShingled:parse titleShingled:sentence titleShi= ngled:shingle titleShingled:size titleShingled:2 titleShingled:pay titleShi= ngled:2 > > As you can see there is no bigram singles in the query. I tried removing = the unigrams from the token stream (using =A0filter.setOutputUnigrams(false= ) in above shingles filter) but even though the singles seem to be fine the= query is empty > > > 1: [_ parse:7->12:shingle] > 2: [parse sentence:7->26:shingle] > 3: [sentence shingle:18->41:shingle] > 4: [shingle size:34->49:shingle] > 5: [size 2:45->51:shingle] > 6: [2 pay:50->61:shingle] > 7: [pay 2:58->64:shingle] > > ...... analyzed query: > > My goal is to index both unigrams and bigrams but first try to search on = bigrams. I think it is the queryparser that is parsing the shingles in a ma= nner that I am not understanding properly. > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0QueryParser parser =3D new QueryParser= (Version.LUCENE_32,"titleShingled",new ShinglesAnalyzer(2,Stopwords)); > > Any help would be very much appreciated > > Peyman > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org