Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Clemens Wyss <clemensdev@mysign.ch>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Date: Thu, 21 Apr 2011 17:02:45 +0200
Subject: "Umlaute" getting lost
Thread-Topic: "Umlaute" getting lost
Thread-Index: AcwANSpa8bZeAqKqTMK2253VpMDS2A==
Message-ID: 
 <E594BA962D832C49A3CF858DAA3A696C1135A51733@Exchange2007.mysigndomain.corp>
Accept-Language: de-DE, de-CH
Content-Language: de-DE
acceptlanguage: de-DE, de-CH
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

I keep my search terms in a dedicated RAMDirectory (the termIndex).=20
In there I palce all the term of my real index. When putting the terms into=
 the=20
termIndex I can still see [using the debugger] the Umlaute (=E4=F6=FC). Unf=
ortunately when searching the=20
termIndex the documents no more contain these Umlaute.

Populating the termIndex:
termIndex =3D new RAMDirectory();
IndexWriterConfig config =3D new IndexWriterConfig( Version.LUCENE_31, new =
TermAnalyzer( locale ) );
termIndexWriter =3D new IndexWriter( termIndex, config );
TermEnum tEnum =3D realIndexReader.terms();
while ( tEnum.next() )
{
	Term t =3D tEnum.term();
	String termText =3D t.text();
	Document termDocument =3D new Document();
	Field field =3D new Field( FIELDNAME_TERM, termText, Field.Store.YES, Fiel=
d.Index.ANALYZED );
	termDocument.add( field );
	// and add term into the index
	termIndexWriter.addDocument( termDocument );
}
termIndexWriter.commit();
termIndexWriter.optimize();
termIndexWriter.close();

termIndexReader =3D IndexReader.open( termIndex, true );
---------- searching terms
Query q =3D fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM, termFilter.to=
LowerCase() ) ) :
					new WildcardQuery( new Term( FIELDNAME_TERM, "*" + termFilter.toLowerC=
ase() + "*" ) );
TopDocs topDocs =3D new IndexSearcher( getTermIndexReader() ).search( q, 10=
0 );			=09
for ( ScoreDoc hit : topDocs.scoreDocs )
{
	Document doc =3D getTermIndexReader().document( hit.doc );
	String indexTerm =3D doc.get( FIELDNAME_TERM );
	if ( !returnValue.contains( indexTerm  ) )
	{
		returnValue.add( indexTerm );
	}
}
----------
The TermAbnalyzer is the same analyzer as the main index analyzer with the =
exception that a LowerCaseFilter is applied.
I have unit tests for my Umlaute which work as expected.=20
Unfortunately this is not the case when I debug my real app...
What could possibly cause the "loss"?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org