lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Token implementation
Date Sat, 12 Jul 2008 12:19:58 GMT
This would be better, especially when we get to Java 5, with covariant 
returns. Then we could also rename termBuffer() to term() (or just leave 
it :).
DM

Michael McCandless wrote:
>
> Or we could leave termText() deprecated, add term() which does the 
> same thing sub-optimally (ie, always creates new String from the 
> byte[]), and in the javadocs for termText() state that you can migrate 
> either term() (if you really want a String and you understand the 
> performance cost of doing so) or to the re-use APIs?
>
> Mike
>
> DM Smith wrote:
>
>> Michael McCandless wrote:
>>>
>>> Maybe we should un-deprecate the termText() method but add javadocs 
>>> explaining that for better performance you should use the char[] 
>>> reuse methods instead?
>> I think so, too. Should we leave it as deprecated until 3.0? With the 
>> performance note and the encouragement to go for re-use, but also 
>> with a note that the current implementation is deprecated not the 
>> interface.
>>
>> That's not quite what deprecated means. My thought on this is that it 
>> will give everyone a heads up that the current implementation is 
>> going away and that the replacement is sub-optimal.
>>
>> (I use Eclipse and have it set to flag all deprecated uses. This 
>> helps me look for places to change.)
>>
>> I think that this will make migration to 3.0 be much easier.
>>
>> With this changing Term to add Term(String, Token) won't be necessary.
>>
>> -- DM
>>>
>>> Mike
>>>
>>> DM Smith wrote:
>>>
>>>> Michael McCandless wrote:
>>>>>
>>>>> DM Smith wrote:
>>>>>
>>>>>> Shouldn't Term have constructors that take a Token?
>>>>>
>>>>> I think that makes sense, though normally Token appears during 
>>>>> analysis and Term during searching (I think?) -- how often would 
>>>>> you need to make a Term from a Token?
>>>>>
>>>> The problem I'm addressing is that tokens are used in contexts that 
>>>> need String and not char[].
>>>> The call to the deprecated
>>>> String termText = token.termText();
>>>> needs to be replaced with:
>>>> String termText = new String(token.termBuffer(), 0, 
>>>> token.termLength());
>>>>
>>>> There are over 170 calls to token.termText(), each of these places 
>>>> have to be modified. In some, perhaps many, of these cases it may 
>>>> be possible to use char[] directly to get a performance gain.
>>>>
>>>> In the case of Term changing it to work with char[] buffer, int 
>>>> start, int length, does not seem quite right. I think the ripple 
>>>> would keep getting bigger. But logically, the Term's text is the 
>>>> text of a Token.
>>>>
>>>> To me it makes sense to have a method that returns the token as a 
>>>> String, but that method is deprecated and the suggested replacement 
>>>> is to directly use the buffer. So this leads to the above 
>>>> construct. Perhaps it would be good to add a new method and 
>>>> document that as one of two replacements.
>>>> public String term() {
>>>> return termText != null ? termText : new String(token.termBuffer(), 
>>>> 0, token.termLength());
>>>> }
>>>>
>>>> Here is an example from QueryParser that has 5 instances, each 
>>>> calling the deprecated t.termText() method. In this example, there 
>>>> is the construction of a query from a token stream.
>>>> Each of the problem lines are of the pattern:
>>>> TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));
>>>>
>>>> To remove the deprecated call to t.termText(), the Token's buffer 
>>>> needs to be marshalled with something like:
>>>> String termText = new String(token.termBuffer(), 0, 
>>>> token.termLength());
>>>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>>>
>>>> /**
>>>> * @exception ParseException throw in overridden method to disallow
>>>> */
>>>> protected Query getFieldQuery(String field, String queryText)  
>>>> throws ParseException {
>>>>  // Use the analyzer to get all the tokens, and then build a 
>>>> TermQuery,
>>>>  // PhraseQuery, or nothing based on the term count
>>>>
>>>>  TokenStream source = analyzer.tokenStream(field, new 
>>>> StringReader(queryText));
>>>>  Vector v = new Vector();
>>>>  org.apache.lucene.analysis.Token t;
>>>>  int positionCount = 0;
>>>>  boolean severalTokensAtSamePosition = false;
>>>>
>>>>  while (true) {
>>>>    try {
>>>>      t = source.next();
>>>>    }
>>>>    catch (IOException e) {
>>>>      t = null;
>>>>    }
>>>>    if (t == null)
>>>>      break;
>>>>    v.addElement(t);
>>>>    if (t.getPositionIncrement() != 0)
>>>>      positionCount += t.getPositionIncrement();
>>>>    else
>>>>      severalTokensAtSamePosition = true;
>>>>  }
>>>>  try {
>>>>    source.close();
>>>>  }
>>>>  catch (IOException e) {
>>>>    // ignore
>>>>  }
>>>>
>>>>  if (v.size() == 0)
>>>>    return null;
>>>>  else if (v.size() == 1) {
>>>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>>>    return new TermQuery(new Term(field, t.termText()));
>>>>  } else {
>>>>    if (severalTokensAtSamePosition) {
>>>>      if (positionCount == 1) {
>>>>        // no phrase query:
>>>>        BooleanQuery q = new BooleanQuery(true);
>>>>        for (int i = 0; i < v.size(); i++) {
>>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>          TermQuery currentQuery = new TermQuery(
>>>>              new Term(field, t.termText()));
>>>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>>>        }
>>>>        return q;
>>>>      }
>>>>      else {
>>>>        // phrase query:
>>>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
>>>>        mpq.setSlop(phraseSlop);
>>>>        List multiTerms = new ArrayList();
>>>>        int position = -1;
>>>>        for (int i = 0; i < v.size(); i++) {
>>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>          if (t.getPositionIncrement() > 0 && multiTerms.size()
> 0) {
>>>>            if (enablePositionIncrements) {
>>>>              mpq.add((Term[])multiTerms.toArray(new 
>>>> Term[0]),position);
>>>>            } else {
>>>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>>            }
>>>>            multiTerms.clear();
>>>>          }
>>>>          position += t.getPositionIncrement();
>>>>          multiTerms.add(new Term(field, t.termText()));
>>>>        }
>>>>        if (enablePositionIncrements) {
>>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>>>        } else {
>>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>>        }
>>>>        return mpq;
>>>>      }
>>>>    }
>>>>    else {
>>>>      PhraseQuery pq = new PhraseQuery();
>>>>      pq.setSlop(phraseSlop);
>>>>      int position = -1;
>>>>      for (int i = 0; i < v.size(); i++) {
>>>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>        if (enablePositionIncrements) {
>>>>          position += t.getPositionIncrement();
>>>>          pq.add(new Term(field, t.termText()),position);
>>>>        } else {
>>>>          pq.add(new Term(field, t.termText()));
>>>>        }
>>>>      }
>>>>      return pq;
>>>>    }
>>>>  }
>>>> }
>>>>
>>>>
>>>> Here is an example that works around the deprecated code:
>>>> public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
>>>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new 
>>>> WhitespaceAnalyzer(), 2);
>>>>  searcher = setUpSearcher(analyzer);
>>>>
>>>>  PhraseQuery q = new PhraseQuery();
>>>>
>>>>  TokenStream ts = analyzer.tokenStream("content",
>>>>                                        new StringReader("this 
>>>> sentence"));
>>>>  Token token;
>>>>  int j = -1;
>>>>  while ((token = ts.next()) != null) {
>>>>    j += token.getPositionIncrement();
>>>>    String termText = new String(token.termBuffer(), 0, 
>>>> token.termLength());
>>>>    q.add(new Term("content", termText), j);
>>>>  }
>>>>
>>>>  Hits hits = searcher.search(q);
>>>>  int[] ranks = new int[] { 0 };
>>>>  compareRanks(hits, ranks);
>>>> }
>>>>
>>>> -- DM


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message