lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peyman Faratin <pey...@robustlinks.com>
Subject ShinglesAnalyzer Queston
Date Sun, 09 Oct 2011 16:11:58 GMT
Hi

I am trying to understand why I am not able to retrieve docs I have indexed by a ShingleAnalyzer.
The setup is as follows:


During indexing I do the following:

		PerFieldAnalyzerWrapper wrapper = DocFieldAnalyzerWrapper.getDocFieldAnalyzerWrapper(Stopwords);

		writer = new IndexWriter(_lucenedir,
				new IndexWriterConfig(Version.LUCENE_32,wrapper));

where DocFieldAnalyzerWrapper returns an instance of the PerFieldAnalyzerWrapper

		public static PerFieldAnalyzerWrapper getDocFieldAnalyzerWrapper(HashSet<String> Stopwords){
			PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new KeywordAnalyzer());
			wrapper.addAnalyzer("title",new KeywordAnalyzer());
			wrapper.addAnalyzer("titleSynonyms",new KeywordAnalyzer());
			wrapper.addAnalyzer("date",new KeywordAnalyzer());
			wrapper.addAnalyzer("about",new KeywordAnalyzer());

			wrapper.addAnalyzer("titleAnalyzed",new StandardAnalyzer(Version.LUCENE_32,Stopwords));
			wrapper.addAnalyzer("content",new LimitTokenCountAnalyzer(
										new StandardAnalyzer(Version.LUCENE_32,Stopwords),
											Integer.MAX_VALUE));
			wrapper.addAnalyzer("contentForSpelling",new ShinglesAnalyzer(2,Stopwords));
			return wrapper;
		}

where the custom ShinglesAnalyzer is defined as follows: 

	 public class ShinglesAnalyzer extends Analyzer {
	  private HashSet<String> Stopwords;
	  private Integer shingleSize;
	  public TokenStream tokenStream(String fieldName, Reader reader) {
		  TokenStream filter = new ShingleFilter(
						new StopFilter(Version.LUCENE_32,
		    				new LowerCaseFilter(Version.LUCENE_32,
		    				new StandardFilter(Version.LUCENE_32,
		    				new StandardTokenizer(Version.LUCENE_32, reader))),
	    					Stopwords),
		    				shingleSize);		  
		   return filter;
		}
	}

Then index as follows (note, all fields are set to ANALYZED because the fields that are not
analyzed are set to be KeywordAnalyzer)

				doc.add(new Field("title",title,Field.Store.YES, Field.Index.ANALYZED));
				doc.add(new Field("titleAnalyzed",title,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
				doc.add(new Field("titleSynonyms",pageSynonmy.toString(),Field.Store.YES, Field.Index.ANALYZED));
				doc.add(new Field("about",article.getAbout().toString(),Field.Store.YES, Field.Index.ANALYZED));
				doc.add(new Field("date", article.getDateCreated(),Field.Store.NO, Field.Index.ANALYZED));
				
				String content = article.getCleanContent();
				Field contentField = new Field("content",
						content, Field.Store.NO,
						Field.Index.ANALYZED,
						Field.TermVector.WITH_POSITIONS_OFFSETS);
				doc.add(contentField);
				
				Field contentSpellingField = new Field("contentForSpelling",
						content, Field.Store.YES,
						Field.Index.ANALYZED,
						Field.TermVector.WITH_POSITIONS_OFFSETS);
				doc.add(contentSpellingField);

Looking at index using luke the field "contentForSpelling" is indexed using both unigram and
bi-gram (Shingles is set to be 2). 

Then during search time given a query q, which is a sentence provided by the user, I do the
following:

    		  ShingleAnalyzerWrapper  analyzer = new ShinglesAnalyzer(2,Stopwords);
		  QueryParser parser = new QueryParser(Version.LUCENE_32, "contentForSpelling",analyzer);
		  Query query = parser.parse(q);
		  TopDocs hits = searcher.search(query);


This is the output

query: $13 for any of season package at Dallas

ShinglesAnalyzer:
    
1: [13:1->3:<NUM>] [13 _:1->15:shingle] 
2: [_ season:15->21:shingle] 
3: [season:15->21:<ALPHANUM>] [season package:15->29:shingle] 
4: [package:22->29:<ALPHANUM>] [package _:22->33:shingle] 
5: [_ dallas:33->39:shingle] 
6: [dallas:33->39:<ALPHANUM>] 

but when I print the query (query.toString()) it looks like this 

analyzed query: contentForSpelling:13 contentForSpelling:season contentForSpelling:package
contentForSpelling:dallas

But the query looks wrong to me. 

thank you 

Peyman


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message