lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Token implementation
Date Sat, 12 Jul 2008 00:42:03 GMT
Michael McCandless wrote:
>
> Maybe we should un-deprecate the termText() method but add javadocs 
> explaining that for better performance you should use the char[] reuse 
> methods instead?
I think so, too. Should we leave it as deprecated until 3.0? With the 
performance note and the encouragement to go for re-use, but also with a 
note that the current implementation is deprecated not the interface.

That's not quite what deprecated means. My thought on this is that it 
will give everyone a heads up that the current implementation is going 
away and that the replacement is sub-optimal.

(I use Eclipse and have it set to flag all deprecated uses. This helps 
me look for places to change.)

I think that this will make migration to 3.0 be much easier.

With this changing Term to add Term(String, Token) won't be necessary.

-- DM
>
> Mike
>
> DM Smith wrote:
>
>> Michael McCandless wrote:
>>>
>>> DM Smith wrote:
>>>
>>>> Shouldn't Term have constructors that take a Token?
>>>
>>> I think that makes sense, though normally Token appears during 
>>> analysis and Term during searching (I think?) -- how often would you 
>>> need to make a Term from a Token?
>>>
>> The problem I'm addressing is that tokens are used in contexts that 
>> need String and not char[].
>> The call to the deprecated
>>  String termText = token.termText();
>> needs to be replaced with:
>>  String termText = new String(token.termBuffer(), 0, 
>> token.termLength());
>>
>> There are over 170 calls to token.termText(), each of these places 
>> have to be modified. In some, perhaps many, of these cases it may be 
>> possible to use char[] directly to get a performance gain.
>>
>> In the case of Term changing it to work with char[] buffer, int 
>> start, int length, does not seem quite right. I think the ripple 
>> would keep getting bigger. But logically, the Term's text is the text 
>> of a Token.
>>
>> To me it makes sense to have a method that returns the token as a 
>> String, but that method is deprecated and the suggested replacement 
>> is to directly use the buffer. So this leads to the above construct. 
>> Perhaps it would be good to add a new method and document that as one 
>> of two replacements.
>> public String term() {
>> return termText != null ? termText : new String(token.termBuffer(), 
>> 0, token.termLength());
>> }
>>
>> Here is an example from QueryParser that has 5 instances, each 
>> calling the deprecated t.termText() method. In this example, there is 
>> the construction of a query from a token stream.
>> Each of the problem lines are of the pattern:
>>  TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));
>>
>> To remove the deprecated call to t.termText(), the Token's buffer 
>> needs to be marshalled with something like:
>>  String termText = new String(token.termBuffer(), 0, 
>> token.termLength());
>>  TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>
>> /**
>>  * @exception ParseException throw in overridden method to disallow
>>  */
>> protected Query getFieldQuery(String field, String queryText)  throws 
>> ParseException {
>>   // Use the analyzer to get all the tokens, and then build a TermQuery,
>>   // PhraseQuery, or nothing based on the term count
>>
>>   TokenStream source = analyzer.tokenStream(field, new 
>> StringReader(queryText));
>>   Vector v = new Vector();
>>   org.apache.lucene.analysis.Token t;
>>   int positionCount = 0;
>>   boolean severalTokensAtSamePosition = false;
>>
>>   while (true) {
>>     try {
>>       t = source.next();
>>     }
>>     catch (IOException e) {
>>       t = null;
>>     }
>>     if (t == null)
>>       break;
>>     v.addElement(t);
>>     if (t.getPositionIncrement() != 0)
>>       positionCount += t.getPositionIncrement();
>>     else
>>       severalTokensAtSamePosition = true;
>>   }
>>   try {
>>     source.close();
>>   }
>>   catch (IOException e) {
>>     // ignore
>>   }
>>
>>   if (v.size() == 0)
>>     return null;
>>   else if (v.size() == 1) {
>>     t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>     return new TermQuery(new Term(field, t.termText()));
>>   } else {
>>     if (severalTokensAtSamePosition) {
>>       if (positionCount == 1) {
>>         // no phrase query:
>>         BooleanQuery q = new BooleanQuery(true);
>>         for (int i = 0; i < v.size(); i++) {
>>           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>           TermQuery currentQuery = new TermQuery(
>>               new Term(field, t.termText()));
>>           q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>         }
>>         return q;
>>       }
>>       else {
>>         // phrase query:
>>         MultiPhraseQuery mpq = new MultiPhraseQuery();
>>         mpq.setSlop(phraseSlop);
>>         List multiTerms = new ArrayList();
>>         int position = -1;
>>         for (int i = 0; i < v.size(); i++) {
>>           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>           if (t.getPositionIncrement() > 0 && multiTerms.size() > 0)
{
>>             if (enablePositionIncrements) {
>>               mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>             } else {
>>               mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>             }
>>             multiTerms.clear();
>>           }
>>           position += t.getPositionIncrement();
>>           multiTerms.add(new Term(field, t.termText()));
>>         }
>>         if (enablePositionIncrements) {
>>           mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>         } else {
>>           mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>         }
>>         return mpq;
>>       }
>>     }
>>     else {
>>       PhraseQuery pq = new PhraseQuery();
>>       pq.setSlop(phraseSlop);
>>       int position = -1;
>>       for (int i = 0; i < v.size(); i++) {
>>         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>         if (enablePositionIncrements) {
>>           position += t.getPositionIncrement();
>>           pq.add(new Term(field, t.termText()),position);
>>         } else {
>>           pq.add(new Term(field, t.termText()));
>>         }
>>       }
>>       return pq;
>>     }
>>   }
>> }
>>
>>
>> Here is an example that works around the deprecated code:
>> public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
>>   Analyzer analyzer = new ShingleAnalyzerWrapper(new 
>> WhitespaceAnalyzer(), 2);
>>   searcher = setUpSearcher(analyzer);
>>
>>   PhraseQuery q = new PhraseQuery();
>>
>>   TokenStream ts = analyzer.tokenStream("content",
>>                                         new StringReader("this 
>> sentence"));
>>   Token token;
>>   int j = -1;
>>   while ((token = ts.next()) != null) {
>>     j += token.getPositionIncrement();
>>     String termText = new String(token.termBuffer(), 0, 
>> token.termLength());
>>     q.add(new Term("content", termText), j);
>>   }
>>
>>   Hits hits = searcher.search(q);
>>   int[] ranks = new int[] { 0 };
>>   compareRanks(hits, ranks);
>> }
>>
>> -- DM
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message