lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mansour Al Akeel <>
Subject StandardTokenizer and split tokens
Date Fri, 22 Jun 2012 22:26:18 GMT
Hello all,

I am tying to write a simple autosuggest functionality. I was looking
at some auto suggest code, and came over this post
I have been stuck with the some strange words, trying to see how they
are generated. Here's the Anayzer:

public class AutoCompleteAnalyzer extends Analyzer {
	public TokenStream tokenStream(String fieldName, Reader reader) {
		TokenStream result = null;
		result = new StandardTokenizer(Version.LUCENE_36, reader);
		result = new EdgeNGramTokenFilter(result,	EdgeNGramTokenFilter.Side.FRONT,
1, 20);
		return result;

And this is the relevant method that does the indexing. It's being
called with reindexOn("title");

private void reindexOn(String keyword) throws CorruptIndexException,
			IOException {"indexing on " + keyword);
		Analyzer analyzer = new AutoCompleteAnalyzer();
		IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36,
		IndexWriter analyticalWriter = new IndexWriter(suggestIndexDirectory, config);
		analyticalWriter.commit(); // needed to create the initiale index
		IndexReader indexReader =;
		Map<String, Integer> wordsMap = new HashMap<String, Integer>();
		LuceneDictionary dict = new LuceneDictionary(indexReader, keyword);
		BytesRefIterator iter = dict.getWordsIterator();
		BytesRef ref = null;
		while ((ref = != null) {
			String word = new String(ref.bytes);
			int len = word.length();
			if (len < 3) {
			if (wordsMap.containsKey(word)) {
				String msg = "Word " + word + " Already Exists";
				throw new IllegalStateException(msg);
			wordsMap.put(word, indexReader.docFreq(new Term(keyword, word)));

		for (String word : wordsMap.keySet()) {
			Document doc = new Document();
			Field field = null;
			field = new Field(SOURCE_WORD_FIELD, word, Field.Store.YES,
			field = new Field(GRAMMED_WORDS_FIELD, word,
Field.Store.YES,	Field.Index.ANALYZED);
			String count = Integer.toString(wordsMap.get(word));
			field = new Field(COUNT_FIELD, count, Field.Store.NO,
Field.Index.NOT_ANALYZED); // count

	private static final String GRAMMED_WORDS_FIELD = "words";
	private static final String SOURCE_WORD_FIELD = "sourceWord";
	private static final String COUNT_FIELD = "count";

And now, my unit testing :

	public static void setUp() throws CorruptIndexException, IOException {
		String idxFileName = "myIndexDirectory";
		Indexer indexer = new Indexer(idxFileName);
		indexer.addDoc("Apache Lucene in Action");
		indexer.addDoc("Lord of the Rings");
		indexer.addDoc("Apache Solr in Action");
		indexer.addDoc("apples and Oranges");
		indexer.addDoc("apple iphone");
		search = new SearchEngine(idxFileName);

The strange part, is looking under the index I found there are
sourceWords (lordne, applee, solres ). I understand that the ngram
will result in parts of each word. Ex:


But of these go into one field, but what about "lorden" and "solres"
?? I checked the docs for this, and looked into Jira, but didn't find
relevant info.
Is there something I am missing ??

I understand there could be easier ways to create this functionality
(, but I like to
resolve this issue, and to
understand if I am doing something wrong.

Thank you in advance.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message