lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <izavo...@caci.com>
Subject how to preserve whitespaces etc when tokenizing stream?
Date Fri, 13 Jan 2012 16:44:31 GMT
I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need
to tokenize the input stream, look up every term in a specialized dictionary and output the
corresponding "translation" of the token. However, i also want to preserve all the original
whitespaces, stopwords etc from the input so that the output is formatted in the same way
as the input instead of ended up being a stream of translations. So if my input is



<term1>: <term2> <stopword>! <term3>

<term4>



then I want the output to look like



<term1'>: <term2'> <stopword>! <term3'>

<term4'>



(where <termi'> is translation of <termi>) instead of



<term1'> <term2'> <term3'> <term4'>



Currently I am doing the following:



PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,

                                           PatternAnalyzer.WHITESPACE_PATTERN,

                                           false,

                                           WordlistLoader.getWordSet(new File(stopWordFilePath)));

TokenStream ts = pa.tokenStream(null, in);

CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);



while (ts.incrementToken()) { // loop over tokens

       String termIn = charTermAttribute.toString();

       ...

}



but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert
them into the output? thanks much!


Thanks,

Ilya

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message