From Johannes Neubarth <>
Subject Aligning text analyses, with and without stopwords
Date Thu, 26 Jul 2012 16:16:54 GMT
I want to align the output of two different analysis pipelines, but I
don't know how.
We are using Lucene for text analysis. First, every input text is
normalized using StandardTokenizer, StandardFilter and LowerCaseFilter.
This yields a list of tokens (list1). Second, the same input text is
also stemmed and stopwords are removed, yielding list2:

list1: [this text contains stopwords i need to align them]
list2: [---- text contain  stopword -- need -- align ----]

If I want to align both lists, I need to know which tokens were removed
by the StopFilter. The following code works, but not for the last token

while (tokenStream.incrementToken()) {
    int skippedTokens =
        = tokenStream.getAttribute(PositionIncrementAttribute.class)
          .getPositionIncrement() - 1;
    // process the current token, e.g. we know that "need" is the 6th
    // element in the list because the previous token was removed

For stopwords that are at the end of the tokenStream (e.g. "them"), the
positionIncrement is not updated - after leaving the while-loop,
skippedTokens is 0. My workaround is to append a unique number to every
input text, so that every text ends with a non-stopword. Can you think
of a more reasonable approach?

Thank you,

