lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul D Thakare" <rahul_thak...@rediffmail.com>
Subject Wild card and multiple keyword search
Date Wed, 13 Jul 2005 12:18:44 GMT
  
Hi, 

 We are using doc.add(Field.Text("keywords",keywords)); to add the keywords to the document,
where keywords is comma separated keywords string.
Lucene seems to tokenize the keywords with multiple words like(MAIN BOARD) as different keywords(ie
as MAIN and BOARD). Tokenization is based on comma and space...So if we search for "MAIN BOARD",
documents having keywords like "MAIN LOGIC", "MAIN PARTS", etc also show up

If one searches for "MAIN BOARD", we want get only the documents have "MAIN BOARD".  How to
do this ?

To achieve this we used doc.add(Field.Keyword("keywords", keywords)); and while searching
we cannot use standard analyzer, while searching, as divides the keywords if we search keywords
having space... so we wrote an KeywordAnalyser(KeywordAnalyzer is basically returns only one
single token) as given below.

/**
 * Tokenizes the entire stream as single token
 */

 public class KeywordAnalyzer extends Analyzer
 {
	 public TokenStream tokenStream(String fieldName, final Reader reader)
	 {
		 return new TokenStream(){
			 private boolean done;
			 private final char[] buffer = new char[1024];
			 public Token next() throws IOException
			 {
				 if(!done)
				 {
					 done = true;
					 StringBuffer buffer = new StringBuffer();
					 int length = 0;
					 while(true)
					 {
						 length = reader.read(this.buffer);
						 if(length == -1) break;

						 buffer.append(this.buffer,0,length);
					 }
					 String text = buffer.toString();
					 return new Token(text.toUpperCase(),0,text.length());
				 }
				 return null;
			 }
		 };
	 }
 }

Which solve the above said problem, but we are not able to the wild card searchs like MAIN*,
etc.

We need both the functionality ie. 
1.  if user searches for MAIN BOARD, should get only documents that contain MAIN BOARD and
not MAIN LOGIC, MAIN, MAIN PART etc. 
2. User should be able to do the wild card search like MAIN*, etc and get the desired documents.

Please let us know, how we should do the indexing ? and which analyzer to use to do the search
?

thanks
Rahul...
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message