lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Token implementation
Date Sat, 12 Jul 2008 03:13:11 GMT

On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:

> Another suggestion from me:
> How about making token object as an singleton?

Would that work for a multi-threaded application?

>
>
>
>> Maybe we should un-deprecate the termText() method but add javadocs
>> explaining that for better performance you should use the char[]  
>> reuse
>> methods instead?
>>
>> Mike
>>
>> DM Smith wrote:
>>
>>> Michael McCandless wrote:
>>>>
>>>> DM Smith wrote:
>>>>
>>>>> Shouldn't Term have constructors that take a Token?
>>>>
>>>> I think that makes sense, though normally Token appears during
>>>> analysis and Term during searching (I think?) -- how often would
>>>> you need to make a Term from a Token?
>>>>
>>> The problem I'm addressing is that tokens are used in contexts that
>>> need String and not char[].
>>> The call to the deprecated
>>> String termText = token.termText();
>>> needs to be replaced with:
>>> String termText = new String(token.termBuffer(), 0,
>>> token.termLength());
>>>
>>> There are over 170 calls to token.termText(), each of these places
>>> have to be modified. In some, perhaps many, of these cases it may be
>>> possible to use char[] directly to get a performance gain.
>>>
>>> In the case of Term changing it to work with char[] buffer, int
>>> start, int length, does not seem quite right. I think the ripple
>>> would keep getting bigger. But logically, the Term's text is the
>>> text of a Token.
>>>
>>> To me it makes sense to have a method that returns the token as a
>>> String, but that method is deprecated and the suggested replacement
>>> is to directly use the buffer. So this leads to the above construct.
>>> Perhaps it would be good to add a new method and document that as
>>> one of two replacements.
>>> public String term() {
>>> return termText != null ? termText : new String(token.termBuffer(),
>>> 0, token.termLength());
>>> }
>>>
>>> Here is an example from QueryParser that has 5 instances, each
>>> calling the deprecated t.termText() method. In this example, there
>>> is the construction of a query from a token stream.
>>> Each of the problem lines are of the pattern:
>>> TermQuery currentQuery = new TermQuery(new Term(field,
>>> t.termText()));
>>>
>>> To remove the deprecated call to t.termText(), the Token's buffer
>>> needs to be marshalled with something like:
>>> String termText = new String(token.termBuffer(), 0,
>>> token.termLength());
>>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>>
>>> /**
>>> * @exception ParseException throw in overridden method to disallow
>>> */
>>> protected Query getFieldQuery(String field, String queryText)
>>> throws ParseException {
>>>  // Use the analyzer to get all the tokens, and then build a
>>> TermQuery,
>>>  // PhraseQuery, or nothing based on the term count
>>>
>>>  TokenStream source = analyzer.tokenStream(field, new
>>> StringReader(queryText));
>>>  Vector v = new Vector();
>>>  org.apache.lucene.analysis.Token t;
>>>  int positionCount = 0;
>>>  boolean severalTokensAtSamePosition = false;
>>>
>>>  while (true) {
>>>    try {
>>>      t = source.next();
>>>    }
>>>    catch (IOException e) {
>>>      t = null;
>>>    }
>>>    if (t == null)
>>>      break;
>>>    v.addElement(t);
>>>    if (t.getPositionIncrement() != 0)
>>>      positionCount += t.getPositionIncrement();
>>>    else
>>>      severalTokensAtSamePosition = true;
>>>  }
>>>  try {
>>>    source.close();
>>>  }
>>>  catch (IOException e) {
>>>    // ignore
>>>  }
>>>
>>>  if (v.size() == 0)
>>>    return null;
>>>  else if (v.size() == 1) {
>>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>>    return new TermQuery(new Term(field, t.termText()));
>>>  } else {
>>>    if (severalTokensAtSamePosition) {
>>>      if (positionCount == 1) {
>>>        // no phrase query:
>>>        BooleanQuery q = new BooleanQuery(true);
>>>        for (int i = 0; i < v.size(); i++) {
>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>          TermQuery currentQuery = new TermQuery(
>>>              new Term(field, t.termText()));
>>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>>        }
>>>        return q;
>>>      }
>>>      else {
>>>        // phrase query:
>>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
>>>        mpq.setSlop(phraseSlop);
>>>        List multiTerms = new ArrayList();
>>>        int position = -1;
>>>        for (int i = 0; i < v.size(); i++) {
>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>          if (t.getPositionIncrement() > 0 && multiTerms.size() >
 
>>> 0) {
>>>            if (enablePositionIncrements) {
>>>              mpq.add((Term[])multiTerms.toArray(new
>>> Term[0]),position);
>>>            } else {
>>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>            }
>>>            multiTerms.clear();
>>>          }
>>>          position += t.getPositionIncrement();
>>>          multiTerms.add(new Term(field, t.termText()));
>>>        }
>>>        if (enablePositionIncrements) {
>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>>        } else {
>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>        }
>>>        return mpq;
>>>      }
>>>    }
>>>    else {
>>>      PhraseQuery pq = new PhraseQuery();
>>>      pq.setSlop(phraseSlop);
>>>      int position = -1;
>>>      for (int i = 0; i < v.size(); i++) {
>>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>        if (enablePositionIncrements) {
>>>          position += t.getPositionIncrement();
>>>          pq.add(new Term(field, t.termText()),position);
>>>        } else {
>>>          pq.add(new Term(field, t.termText()));
>>>        }
>>>      }
>>>      return pq;
>>>    }
>>>  }
>>> }
>>>
>>>
>>> Here is an example that works around the deprecated code:
>>> public void testShingleAnalyzerWrapperPhraseQuery() throws  
>>> Exception {
>>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new
>>> WhitespaceAnalyzer(), 2);
>>>  searcher = setUpSearcher(analyzer);
>>>
>>>  PhraseQuery q = new PhraseQuery();
>>>
>>>  TokenStream ts = analyzer.tokenStream("content",
>>>                                        new StringReader("this
>>> sentence"));
>>>  Token token;
>>>  int j = -1;
>>>  while ((token = ts.next()) != null) {
>>>    j += token.getPositionIncrement();
>>>    String termText = new String(token.termBuffer(), 0,
>>> token.termLength());
>>>    q.add(new Term("content", termText), j);
>>>  }
>>>
>>>  Hits hits = searcher.search(q);
>>>  int[] ranks = new int[] { 0 };
>>>  compareRanks(hits, ranks);
>>> }
>>>
>>> -- DM
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message