lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: "Umlaute" getting lost
Date Sun, 24 Apr 2011 06:30:23 GMT

On Apr 21, 2011, at 5:02 PM, Clemens Wyss wrote:

> I keep my search terms in a dedicated RAMDirectory (the termIndex). 
> In there I palce all the term of my real index. When putting the terms into the 
> termIndex I can still see [using the debugger] the Umlaute (äöü). Unfortunately when
searching the 
> termIndex the documents no more contain these Umlaute.
> 
> Populating the termIndex:
> termIndex = new RAMDirectory();
> IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31, new TermAnalyzer(
locale ) );
> termIndexWriter = new IndexWriter( termIndex, config );
> TermEnum tEnum = realIndexReader.terms();
> while ( tEnum.next() )
> {
> 	Term t = tEnum.term();
> 	String termText = t.text();
> 	Document termDocument = new Document();
> 	Field field = new Field( FIELDNAME_TERM, termText, Field.Store.YES, Field.Index.ANALYZED
);
> 	termDocument.add( field );
> 	// and add term into the index
> 	termIndexWriter.addDocument( termDocument );
> }
> termIndexWriter.commit();
> termIndexWriter.optimize();
> termIndexWriter.close();
> 
> termIndexReader = IndexReader.open( termIndex, true );
> ---------- searching terms
> Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM, termFilter.toLowerCase()
) ) :
> 					new WildcardQuery( new Term( FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*"
) );
> TopDocs topDocs = new IndexSearcher( getTermIndexReader() ).search( q, 100 );				
> for ( ScoreDoc hit : topDocs.scoreDocs )
> {
> 	Document doc = getTermIndexReader().document( hit.doc );
> 	String indexTerm = doc.get( FIELDNAME_TERM );
> 	if ( !returnValue.contains( indexTerm  ) )
> 	{
> 		returnValue.add( indexTerm );
> 	}
> }
> ----------
> The TermAbnalyzer is the same analyzer as the main index analyzer with the exception
that a LowerCaseFilter is applied.

What is the Analyzer for the Main Index?  What is the tokenizer and token filters used?

Out of curiosity, what is the problem you are trying to solve?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message