lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Token implementation
Date Fri, 11 Jul 2008 19:42:50 GMT
Michael McCandless wrote:
>
> DM Smith wrote:
>
>>  Shouldn't Term have constructors that take a Token?
>
> I think that makes sense, though normally Token appears during 
> analysis and Term during searching (I think?) -- how often would you 
> need to make a Term from a Token?
>
The problem I'm addressing is that tokens are used in contexts that need 
String and not char[].
The call to the deprecated
   String termText = token.termText();
needs to be replaced with:
   String termText = new String(token.termBuffer(), 0, token.termLength());

There are over 170 calls to token.termText(), each of these places have 
to be modified. In some, perhaps many, of these cases it may be possible 
to use char[] directly to get a performance gain.

In the case of Term changing it to work with char[] buffer, int start, 
int length, does not seem quite right. I think the ripple would keep 
getting bigger. But logically, the Term's text is the text of a Token.

To me it makes sense to have a method that returns the token as a 
String, but that method is deprecated and the suggested replacement is 
to directly use the buffer. So this leads to the above construct. 
Perhaps it would be good to add a new method and document that as one of 
two replacements.
public String term() {
  return termText != null ? termText : new String(token.termBuffer(), 0, 
token.termLength());
}

Here is an example from QueryParser that has 5 instances, each calling 
the deprecated t.termText() method. In this example, there is the 
construction of a query from a token stream.
Each of the problem lines are of the pattern:
   TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));

To remove the deprecated call to t.termText(), the Token's buffer needs 
to be marshalled with something like:
   String termText = new String(token.termBuffer(), 0, token.termLength());
   TermQuery currentQuery = new TermQuery(new Term(field, termText)));

  /**
   * @exception ParseException throw in overridden method to disallow
   */
  protected Query getFieldQuery(String field, String queryText)  throws 
ParseException {
    // Use the analyzer to get all the tokens, and then build a TermQuery,
    // PhraseQuery, or nothing based on the term count

    TokenStream source = analyzer.tokenStream(field, new 
StringReader(queryText));
    Vector v = new Vector();
    org.apache.lucene.analysis.Token t;
    int positionCount = 0;
    boolean severalTokensAtSamePosition = false;

    while (true) {
      try {
        t = source.next();
      }
      catch (IOException e) {
        t = null;
      }
      if (t == null)
        break;
      v.addElement(t);
      if (t.getPositionIncrement() != 0)
        positionCount += t.getPositionIncrement();
      else
        severalTokensAtSamePosition = true;
    }
    try {
      source.close();
    }
    catch (IOException e) {
      // ignore
    }

    if (v.size() == 0)
      return null;
    else if (v.size() == 1) {
      t = (org.apache.lucene.analysis.Token) v.elementAt(0);
      return new TermQuery(new Term(field, t.termText()));
    } else {
      if (severalTokensAtSamePosition) {
        if (positionCount == 1) {
          // no phrase query:
          BooleanQuery q = new BooleanQuery(true);
          for (int i = 0; i < v.size(); i++) {
            t = (org.apache.lucene.analysis.Token) v.elementAt(i);
            TermQuery currentQuery = new TermQuery(
                new Term(field, t.termText()));
            q.add(currentQuery, BooleanClause.Occur.SHOULD);
          }
          return q;
        }
        else {
          // phrase query:
          MultiPhraseQuery mpq = new MultiPhraseQuery();
          mpq.setSlop(phraseSlop);
          List multiTerms = new ArrayList();
          int position = -1;
          for (int i = 0; i < v.size(); i++) {
            t = (org.apache.lucene.analysis.Token) v.elementAt(i);
            if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
              if (enablePositionIncrements) {
                mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
              } else {
                mpq.add((Term[])multiTerms.toArray(new Term[0]));
              }
              multiTerms.clear();
            }
            position += t.getPositionIncrement();
            multiTerms.add(new Term(field, t.termText()));
          }
          if (enablePositionIncrements) {
            mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
          } else {
            mpq.add((Term[])multiTerms.toArray(new Term[0]));
          }
          return mpq;
        }
      }
      else {
        PhraseQuery pq = new PhraseQuery();
        pq.setSlop(phraseSlop);
        int position = -1;
        for (int i = 0; i < v.size(); i++) {
          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
          if (enablePositionIncrements) {
            position += t.getPositionIncrement();
            pq.add(new Term(field, t.termText()),position);
          } else {
            pq.add(new Term(field, t.termText()));
          }
        }
        return pq;
      }
    }
  }


Here is an example that works around the deprecated code:
  public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
    Analyzer analyzer = new ShingleAnalyzerWrapper(new 
WhitespaceAnalyzer(), 2);
    searcher = setUpSearcher(analyzer);

    PhraseQuery q = new PhraseQuery();

    TokenStream ts = analyzer.tokenStream("content",
                                          new StringReader("this 
sentence"));
    Token token;
    int j = -1;
    while ((token = ts.next()) != null) {
      j += token.getPositionIncrement();
      String termText = new String(token.termBuffer(), 0, 
token.termLength());
      q.add(new Term("content", termText), j);
    }

    Hits hits = searcher.search(q);
    int[] ranks = new int[] { 0 };
    compareRanks(hits, ranks);
  }

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message