lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peyman Faratin <pey...@robustlinks.com>
Subject Shingles Filter problems
Date Tue, 11 Oct 2011 14:25:29 GMT
Hi

I have the following shinglefilter (Lucene 3.2)

	  public TokenStream tokenStream(String fieldName, Reader reader) {
		  StandardTokenizer first = new StandardTokenizer(Version.LUCENE_32, reader);
		  StandardFilter second = new StandardFilter(Version.LUCENE_32,first);
		  LowerCaseFilter third = new LowerCaseFilter(Version.LUCENE_32,second);
		  StopFilter fourth = new StopFilter(Version.LUCENE_32,third,Stopwords);
		  PositionFilter fifth = new PositionFilter(fourth);
		  ShingleFilter filter = new ShingleFilter(fifth,shingleSize);		  
		   return filter;
		}

that produces the following token stream given sentence

"please parse this sentence into a shingle of size 2. I'll pay $2 for it"

1: [_ parse:7->12:shingle] 
2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle] 
3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle] 
4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle] 
5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle] 
6: [2:50->51:<NUM>] [2 pay:50->61:shingle] 
7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle] 
8: [2:63->64:<NUM>] 

The query analyzer produces the following analyzed query for the field "titleShingled" for
above sentence: 

...... analyzed query:titleShingled:parse titleShingled:sentence titleShingled:shingle titleShingled:size
titleShingled:2 titleShingled:pay titleShingled:2

As you can see there is no bigram singles in the query. I tried removing the unigrams from
the token stream (using  filter.setOutputUnigrams(false) in above shingles filter) but even
though the singles seem to be fine the query is empty


1: [_ parse:7->12:shingle] 
2: [parse sentence:7->26:shingle] 
3: [sentence shingle:18->41:shingle] 
4: [shingle size:34->49:shingle] 
5: [size 2:45->51:shingle] 
6: [2 pay:50->61:shingle] 
7: [pay 2:58->64:shingle] 

...... analyzed query: 

My goal is to index both unigrams and bigrams but first try to search on bigrams. I think
it is the queryparser that is parsing the shingles in a manner that I am not understanding
properly. 

		  QueryParser parser = new QueryParser(Version.LUCENE_32,"titleShingled",new ShinglesAnalyzer(2,Stopwords));

Any help would be very much appreciated

Peyman


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message