lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <izavo...@caci.com>
Subject RE: how to preserve whitespaces etc when tokenizing stream?
Date Mon, 16 Jan 2012 22:08:34 GMT
Yes, I ended up doing essentially that. No need to tokenize, I basically split the input string
into a sequence of alternating "word" and "nonword" tokens based on Character.isLetter() and
then looked up the words


Ilya



-----Original Message-----
From: Danil Ε’ORIN [mailto:torindan@gmail.com] 
Sent: Monday, January 16, 2012 5:50 AM
To: java-user@lucene.apache.org
Subject: Re: how to preserve whitespaces etc when tokenizing stream?

Maybe you could simply use String.replace()?
Or the text actually needs to be tokenized?

On Fri, Jan 13, 2012 at 18:44, Ilya Zavorin <izavorin@caci.com> wrote:

> I am trying to perform a "translation" of sorts of a stream of text. More
> specifically, I need to tokenize the input stream, look up every term in a
> specialized dictionary and output the corresponding "translation" of the
> token. However, i also want to preserve all the original whitespaces,
> stopwords etc from the input so that the output is formatted in the same
> way as the input instead of ended up being a stream of translations. So if
> my input is
>
>
>
> <term1>: <term2> <stopword>! <term3>
>
> <term4>
>
>
>
> then I want the output to look like
>
>
>
> <term1'>: <term2'> <stopword>! <term3'>
>
> <term4'>
>
>
>
> (where <termi'> is translation of <termi>) instead of
>
>
>
> <term1'> <term2'> <term3'> <term4'>
>
>
>
> Currently I am doing the following:
>
>
>
> PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
>
>
> PatternAnalyzer.WHITESPACE_PATTERN,
>
>                                           false,
>
>                                           WordlistLoader.getWordSet(new
> File(stopWordFilePath)));
>
> TokenStream ts = pa.tokenStream(null, in);
>
> CharTermAttribute charTermAttribute =
> ts.getAttribute(CharTermAttribute.class);
>
>
>
> while (ts.incrementToken()) { // loop over tokens
>
>       String termIn = charTermAttribute.toString();
>
>       ...
>
> }
>
>
>
> but this, of course, loses all the whitespaces etc. How can I modify this
> to be able to re-insert them into the output? thanks much!
>
>
> Thanks,
>
> Ilya
>
Mime
View raw message