lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Venkateshprasanna <prasanna...@yahoo.co.in>
Subject Indexing bigrams and trigrams in Lucene
Date Mon, 04 Sep 2006 02:58:57 GMT

I need to index bigrams and trigrams in a document. Here is an example:

Text:
This is a text document written by someone. Read this and post your comments

words that must be indexed:
text
document
written
someone
read
post
your
comments
text document
document written
post your
your comments
text document written
post your comments

So, I made changes to StandardAnalyzer.java and StandardTokenizer.jj to try
and achieve this.

I increased the LOOKAHEAD option value to 4:

options {
  LOOKAHEAD = 4;
  FORCE_LA_CHECK = true;
  .
  .
}


I made the following changes to StandardTokenizer.jj :

org.apache.lucene.analysis.Token next() throws IOException :
  :
  :
    {
      if (token.kind == EOF) {
	return null;
      }
      
      else if(token.kind == ALPHANUM) {
     	
      	Token nextToken = token.next;
      	if(token.next.kind ==ALPHANUM) {
      	   return
	   new org.apache.lucene.analysis.Token(token.image+" "+nextToken.image,
					token.beginColumn,nextToken.endColumn,
					tokenImage[token.kind]);
	}
      }
  	
      else {
	return
	  new org.apache.lucene.analysis.Token(token.image,
					token.beginColumn,token.endColumn,
					tokenImage[token.kind]);
      }
    }


That is, I am using token.next to get info about the next token. But it is
returning null. What is the reason and is there a better way of doing this?


-- 
View this message in context: http://www.nabble.com/Indexing-bigrams-and-trigrams-in-Lucene-tf2213042.html#a6129254
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message