Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of ian.lea@gmail.com designates
 209.85.210.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BE149AFB-0448-4C0E-B89F-4A8DD4E077B2@robustlinks.com>
References: <BE149AFB-0448-4C0E-B89F-4A8DD4E077B2@robustlinks.com>
From: Ian Lea <ian.lea@gmail.com>
Date: Tue, 11 Oct 2011 20:51:27 +0100
Message-ID: 
 <CAEY5pxVtkc4g0BO_1eQM-T4pEyDow7jK_Pcp+Rjz9Ukh9OncXg@mail.gmail.com>
Subject: Re: Shingles Filter problems
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Something does appear dodgy here.  Using 3.4.0 the following very
simple code, with no custom classes

	ShingleAnalyzerWrapper saw =3D new ShingleAnalyzerWrapper(LUCENE_34);
	QueryParser qp =3D new QueryParser(LUCENE_34, "t", saw);
	String s =3D "simple sentences rule";
	Query q =3D qp.parse(s);
	System.out.printf("%s parsed to %s\n", s, q);

produces

simple sentences rule parsed to t:simple t:sentences t:rule

Like you, I would have expected there to be some shingles in there.
Are we both missing something?


--
Ian.


On Tue, Oct 11, 2011 at 3:25 PM, Peyman Faratin <peyman@robustlinks.com> wr=
ote:
> Hi
>
> I have the following shinglefilter (Lucene 3.2)
>
> =A0 =A0 =A0 =A0 =A0public TokenStream tokenStream(String fieldName, Reade=
r reader) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0StandardTokenizer first =3D new Standa=
rdTokenizer(Version.LUCENE_32, reader);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0StandardFilter second =3D new Standard=
Filter(Version.LUCENE_32,first);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0LowerCaseFilter third =3D new LowerCas=
eFilter(Version.LUCENE_32,second);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0StopFilter fourth =3D new StopFilter(V=
ersion.LUCENE_32,third,Stopwords);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0PositionFilter fifth =3D new PositionF=
ilter(fourth);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ShingleFilter filter =3D new ShingleFi=
lter(fifth,shingleSize);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return filter;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
>
> that produces the following token stream given sentence
>
> "please parse this sentence into a shingle of size 2. I'll pay $2 for it"
>
> 1: [_ parse:7->12:shingle]
> 2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle]
> 3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle]
> 4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle]
> 5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle]
> 6: [2:50->51:<NUM>] [2 pay:50->61:shingle]
> 7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle]
> 8: [2:63->64:<NUM>]
>
> The query analyzer produces the following analyzed query for the field "t=
itleShingled" for above sentence:
>
> ...... analyzed query:titleShingled:parse titleShingled:sentence titleShi=
ngled:shingle titleShingled:size titleShingled:2 titleShingled:pay titleShi=
ngled:2
>
> As you can see there is no bigram singles in the query. I tried removing =
the unigrams from the token stream (using =A0filter.setOutputUnigrams(false=
) in above shingles filter) but even though the singles seem to be fine the=
 query is empty
>
>
> 1: [_ parse:7->12:shingle]
> 2: [parse sentence:7->26:shingle]
> 3: [sentence shingle:18->41:shingle]
> 4: [shingle size:34->49:shingle]
> 5: [size 2:45->51:shingle]
> 6: [2 pay:50->61:shingle]
> 7: [pay 2:58->64:shingle]
>
> ...... analyzed query:
>
> My goal is to index both unigrams and bigrams but first try to search on =
bigrams. I think it is the queryparser that is parsing the shingles in a ma=
nner that I am not understanding properly.
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0QueryParser parser =3D new QueryParser=
(Version.LUCENE_32,"titleShingled",new ShinglesAnalyzer(2,Stopwords));
>
> Any help would be very much appreciated
>
> Peyman
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org