lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clemens Wyss <clemens...@mysign.ch>
Subject AW: "Umlaute" getting lost
Date Tue, 26 Apr 2011 06:11:51 GMT
TermAnalyzer# tokenStream ( final String fieldName, final Reader reader )
------------------------------------------------------------------------------------------
TokenStream t = new WhitespaceAnalyzer( Version.LUCENE_31 ).tokenStream( fieldName, cf);
t = new StopFilter( Version.LUCENE_31, t, stopWordSet, true );
t = new ShingleAnalyzerWrapper( t, 4 ).tokenStream( fieldName, reader );
t = new LowerCaseFilter( Version.LUCENE_31, t );
return t;

Thx
Clemens

> -----Ursprüngliche Nachricht-----
> Von: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
> Gesendet: Montag, 25. April 2011 12:13
> An: java-user@lucene.apache.org
> Betreff: Re: "Umlaute" getting lost
> 
> On Sun, Apr 24, 2011 at 8:30 AM, Grant Ingersoll <gsingers@apache.org>
> wrote:
> >
> > On Apr 21, 2011, at 5:02 PM, Clemens Wyss wrote:
> >
> >> I keep my search terms in a dedicated RAMDirectory (the termIndex).
> >> In there I palce all the term of my real index. When putting the
> >> terms into the termIndex I can still see [using the debugger] the
> >> Umlaute (äöü). Unfortunately when searching the termIndex the
> documents no more contain these Umlaute.
> >>
> >> Populating the termIndex:
> >> termIndex = new RAMDirectory();
> >> IndexWriterConfig config = new IndexWriterConfig( Version.LUCENE_31,
> >> new TermAnalyzer( locale ) ); termIndexWriter = new IndexWriter(
> >> termIndex, config ); TermEnum tEnum = realIndexReader.terms(); while
> >> ( tEnum.next() ) {
> >>       Term t = tEnum.term();
> >>       String termText = t.text();
> >>       Document termDocument = new Document();
> >>       Field field = new Field( FIELDNAME_TERM, termText,
> >> Field.Store.YES, Field.Index.ANALYZED );
> >>       termDocument.add( field );
> >>       // and add term into the index
> >>       termIndexWriter.addDocument( termDocument ); }
> >> termIndexWriter.commit(); termIndexWriter.optimize();
> >> termIndexWriter.close();
> >>
> >> termIndexReader = IndexReader.open( termIndex, true );
> >> ---------- searching terms
> >> Query q = fuzzy ? new FuzzyQuery( new Term( FIELDNAME_TERM,
> termFilter.toLowerCase() ) ) :
> >>                                       new WildcardQuery(
new Term(
> >> FIELDNAME_TERM, "*" + termFilter.toLowerCase() + "*" ) ); TopDocs
> >> topDocs = new IndexSearcher( getTermIndexReader() ).search( q, 100 );
> >> for ( ScoreDoc hit : topDocs.scoreDocs ) {
> >>       Document doc = getTermIndexReader().document( hit.doc );
> >>       String indexTerm = doc.get( FIELDNAME_TERM );
> >>       if ( !returnValue.contains( indexTerm  ) )
> >>       {
> >>               returnValue.add( indexTerm );
> >>       }
> >> }
> >> ----------
> >> The TermAbnalyzer is the same analyzer as the main index analyzer with
> the exception that a LowerCaseFilter is applied.
> >
> > What is the Analyzer for the Main Index?  What is the tokenizer and token
> filters used?
> 
> in other words, can you provide what TermAnalyzer is composed of?
> 
> 
> simon
> >
> > Out of curiosity, what is the problem you are trying to solve?
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message