lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: StandardTokenizer and split tokens
Date Sat, 23 Jun 2012 07:06:52 GMT
Don't ever do this:

String word = new String(ref.bytes);

This has following problems:
- ignores character set!!! (in general: never ever use new String(byte[])
without specifying the 2nd charset parameter!). byte[] != String. Depending
on the default charset on your computer this would return bullshit
- ignores length
- ignores offset

Use the following code to convert a UTF-8 encoded BytesRef to a String:

String word = ref.utf8ToString()

Thanks :-)

P.S.: I posted this here because I want to prevent that the code you posted
gets used by anybody else

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Mansour Al Akeel [mailto:mansour.alakeel@gmail.com]
> Sent: Saturday, June 23, 2012 12:26 AM
> To: java-user@lucene.apache.org
> Subject: StandardTokenizer and split tokens
> 
> Hello all,
> 
> I am tying to write a simple autosuggest functionality. I was looking at
some
> auto suggest code, and came over this post
> http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-
> suggestions-in-lucene
> I have been stuck with the some strange words, trying to see how they are
> generated. Here's the Anayzer:
> 
> public class AutoCompleteAnalyzer extends Analyzer {
> 	public TokenStream tokenStream(String fieldName, Reader reader) {
> 		TokenStream result = null;
> 		result = new StandardTokenizer(Version.LUCENE_36, reader);
> 		result = new EdgeNGramTokenFilter(result,
> 	EdgeNGramTokenFilter.Side.FRONT,
> 1, 20);
> 		return result;
> 	}
> }
> 
> And this is the relevant method that does the indexing. It's being called
with
> reindexOn("title");
> 
> private void reindexOn(String keyword) throws CorruptIndexException,
> 			IOException {
> 		log.info("indexing on " + keyword);
> 		Analyzer analyzer = new AutoCompleteAnalyzer();
> 		IndexWriterConfig config = new
> IndexWriterConfig(Version.LUCENE_36,
> 	analyzer);
> 		IndexWriter analyticalWriter = new
> IndexWriter(suggestIndexDirectory, config);
> 		analyticalWriter.commit(); // needed to create the initiale
> index
> 		IndexReader indexReader =
> IndexReader.open(productsIndexDirectory);
> 		Map<String, Integer> wordsMap = new HashMap<String,
> Integer>();
> 		LuceneDictionary dict = new LuceneDictionary(indexReader,
> keyword);
> 		BytesRefIterator iter = dict.getWordsIterator();
> 		BytesRef ref = null;
> 		while ((ref = iter.next()) != null) {
> 			String word = new String(ref.bytes);
> 			int len = word.length();
> 			if (len < 3) {
> 				continue;
> 			}
> 			if (wordsMap.containsKey(word)) {
> 				String msg = "Word " + word + " Already
> Exists";
> 				throw new IllegalStateException(msg);
> 			}
> 			wordsMap.put(word, indexReader.docFreq(new
> Term(keyword, word)));
> 		}
> 
> 		for (String word : wordsMap.keySet()) {
> 			Document doc = new Document();
> 			Field field = null;
> 			field = new Field(SOURCE_WORD_FIELD, word,
> Field.Store.YES, Field.Index.NOT_ANALYZED);
> 			doc.add(field);
> 			field = new Field(GRAMMED_WORDS_FIELD, word,
> Field.Store.YES,	Field.Index.ANALYZED);
> 			doc.add(field);
> 			String count = Integer.toString(wordsMap.get(word));
> 			field = new Field(COUNT_FIELD, count,
Field.Store.NO,
> Field.Index.NOT_ANALYZED); // count
> 			doc.add(field);
> 			analyticalWriter.addDocument(doc);
> 		}
> 		analyticalWriter.commit();
> 		analyticalWriter.close();
> 		indexReader.close();
> 	}
> 
> 	private static final String GRAMMED_WORDS_FIELD = "words";
> 	private static final String SOURCE_WORD_FIELD = "sourceWord";
> 	private static final String COUNT_FIELD = "count";
> 
> And now, my unit testing :
> 
> 	@BeforeClass
> 	public static void setUp() throws CorruptIndexException, IOException
{
> 		String idxFileName = "myIndexDirectory";
> 		Indexer indexer = new Indexer(idxFileName);
> 		indexer.addDoc("Apache Lucene in Action");
> 		indexer.addDoc("Lord of the Rings");
> 		indexer.addDoc("Apache Solr in Action");
> 		indexer.addDoc("apples and Oranges");
> 		indexer.addDoc("apple iphone");
> 		indexer.reindexKeywords();
> 		search = new SearchEngine(idxFileName);
> 	}
> 
> The strange part, is looking under the index I found there are sourceWords
> (lordne, applee, solres ). I understand that the ngram will result in
parts of each
> word. Ex:
> 
> l
> lo
> lor
> lord
> 
> But of these go into one field, but what about "lorden" and "solres"
> ?? I checked the docs for this, and looked into Jira, but didn't find
relevant info.
> Is there something I am missing ??
> 
> I understand there could be easier ways to create this functionality
> (http://wiki.apache.org/lucene-java/SpellChecker), but I like to resolve
this
> issue, and to understand if I am doing something wrong.
> 
> Thank you in advance.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message