lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Token implementation
Date Sat, 12 Jul 2008 11:55:11 GMT

Or we could leave termText() deprecated, add term() which does the  
same thing sub-optimally (ie, always creates new String from the  
byte[]), and in the javadocs for termText() state that you can migrate  
either term() (if you really want a String and you understand the  
performance cost of doing so) or to the re-use APIs?

Mike

DM Smith wrote:

> Michael McCandless wrote:
>>
>> Maybe we should un-deprecate the termText() method but add javadocs  
>> explaining that for better performance you should use the char[]  
>> reuse methods instead?
> I think so, too. Should we leave it as deprecated until 3.0? With  
> the performance note and the encouragement to go for re-use, but  
> also with a note that the current implementation is deprecated not  
> the interface.
>
> That's not quite what deprecated means. My thought on this is that  
> it will give everyone a heads up that the current implementation is  
> going away and that the replacement is sub-optimal.
>
> (I use Eclipse and have it set to flag all deprecated uses. This  
> helps me look for places to change.)
>
> I think that this will make migration to 3.0 be much easier.
>
> With this changing Term to add Term(String, Token) won't be necessary.
>
> -- DM
>>
>> Mike
>>
>> DM Smith wrote:
>>
>>> Michael McCandless wrote:
>>>>
>>>> DM Smith wrote:
>>>>
>>>>> Shouldn't Term have constructors that take a Token?
>>>>
>>>> I think that makes sense, though normally Token appears during  
>>>> analysis and Term during searching (I think?) -- how often would  
>>>> you need to make a Term from a Token?
>>>>
>>> The problem I'm addressing is that tokens are used in contexts  
>>> that need String and not char[].
>>> The call to the deprecated
>>> String termText = token.termText();
>>> needs to be replaced with:
>>> String termText = new String(token.termBuffer(), 0,  
>>> token.termLength());
>>>
>>> There are over 170 calls to token.termText(), each of these places  
>>> have to be modified. In some, perhaps many, of these cases it may  
>>> be possible to use char[] directly to get a performance gain.
>>>
>>> In the case of Term changing it to work with char[] buffer, int  
>>> start, int length, does not seem quite right. I think the ripple  
>>> would keep getting bigger. But logically, the Term's text is the  
>>> text of a Token.
>>>
>>> To me it makes sense to have a method that returns the token as a  
>>> String, but that method is deprecated and the suggested  
>>> replacement is to directly use the buffer. So this leads to the  
>>> above construct. Perhaps it would be good to add a new method and  
>>> document that as one of two replacements.
>>> public String term() {
>>> return termText != null ? termText : new  
>>> String(token.termBuffer(), 0, token.termLength());
>>> }
>>>
>>> Here is an example from QueryParser that has 5 instances, each  
>>> calling the deprecated t.termText() method. In this example, there  
>>> is the construction of a query from a token stream.
>>> Each of the problem lines are of the pattern:
>>> TermQuery currentQuery = new TermQuery(new Term(field,  
>>> t.termText()));
>>>
>>> To remove the deprecated call to t.termText(), the Token's buffer  
>>> needs to be marshalled with something like:
>>> String termText = new String(token.termBuffer(), 0,  
>>> token.termLength());
>>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>>
>>> /**
>>> * @exception ParseException throw in overridden method to disallow
>>> */
>>> protected Query getFieldQuery(String field, String queryText)   
>>> throws ParseException {
>>>  // Use the analyzer to get all the tokens, and then build a  
>>> TermQuery,
>>>  // PhraseQuery, or nothing based on the term count
>>>
>>>  TokenStream source = analyzer.tokenStream(field, new  
>>> StringReader(queryText));
>>>  Vector v = new Vector();
>>>  org.apache.lucene.analysis.Token t;
>>>  int positionCount = 0;
>>>  boolean severalTokensAtSamePosition = false;
>>>
>>>  while (true) {
>>>    try {
>>>      t = source.next();
>>>    }
>>>    catch (IOException e) {
>>>      t = null;
>>>    }
>>>    if (t == null)
>>>      break;
>>>    v.addElement(t);
>>>    if (t.getPositionIncrement() != 0)
>>>      positionCount += t.getPositionIncrement();
>>>    else
>>>      severalTokensAtSamePosition = true;
>>>  }
>>>  try {
>>>    source.close();
>>>  }
>>>  catch (IOException e) {
>>>    // ignore
>>>  }
>>>
>>>  if (v.size() == 0)
>>>    return null;
>>>  else if (v.size() == 1) {
>>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>>    return new TermQuery(new Term(field, t.termText()));
>>>  } else {
>>>    if (severalTokensAtSamePosition) {
>>>      if (positionCount == 1) {
>>>        // no phrase query:
>>>        BooleanQuery q = new BooleanQuery(true);
>>>        for (int i = 0; i < v.size(); i++) {
>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>          TermQuery currentQuery = new TermQuery(
>>>              new Term(field, t.termText()));
>>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>>        }
>>>        return q;
>>>      }
>>>      else {
>>>        // phrase query:
>>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
>>>        mpq.setSlop(phraseSlop);
>>>        List multiTerms = new ArrayList();
>>>        int position = -1;
>>>        for (int i = 0; i < v.size(); i++) {
>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>          if (t.getPositionIncrement() > 0 && multiTerms.size() >
 
>>> 0) {
>>>            if (enablePositionIncrements) {
>>>              mpq.add((Term[])multiTerms.toArray(new  
>>> Term[0]),position);
>>>            } else {
>>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>            }
>>>            multiTerms.clear();
>>>          }
>>>          position += t.getPositionIncrement();
>>>          multiTerms.add(new Term(field, t.termText()));
>>>        }
>>>        if (enablePositionIncrements) {
>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>>        } else {
>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>        }
>>>        return mpq;
>>>      }
>>>    }
>>>    else {
>>>      PhraseQuery pq = new PhraseQuery();
>>>      pq.setSlop(phraseSlop);
>>>      int position = -1;
>>>      for (int i = 0; i < v.size(); i++) {
>>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>        if (enablePositionIncrements) {
>>>          position += t.getPositionIncrement();
>>>          pq.add(new Term(field, t.termText()),position);
>>>        } else {
>>>          pq.add(new Term(field, t.termText()));
>>>        }
>>>      }
>>>      return pq;
>>>    }
>>>  }
>>> }
>>>
>>>
>>> Here is an example that works around the deprecated code:
>>> public void testShingleAnalyzerWrapperPhraseQuery() throws  
>>> Exception {
>>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new  
>>> WhitespaceAnalyzer(), 2);
>>>  searcher = setUpSearcher(analyzer);
>>>
>>>  PhraseQuery q = new PhraseQuery();
>>>
>>>  TokenStream ts = analyzer.tokenStream("content",
>>>                                        new StringReader("this  
>>> sentence"));
>>>  Token token;
>>>  int j = -1;
>>>  while ((token = ts.next()) != null) {
>>>    j += token.getPositionIncrement();
>>>    String termText = new String(token.termBuffer(), 0,  
>>> token.termLength());
>>>    q.add(new Term("content", termText), j);
>>>  }
>>>
>>>  Hits hits = searcher.search(q);
>>>  int[] ranks = new int[] { 0 };
>>>  compareRanks(hits, ranks);
>>> }
>>>
>>> -- DM
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message