lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Token implementation
Date Fri, 11 Jul 2008 23:59:18 GMT

Maybe we should un-deprecate the termText() method but add javadocs  
explaining that for better performance you should use the char[] reuse  
methods instead?

Mike

DM Smith wrote:

> Michael McCandless wrote:
>>
>> DM Smith wrote:
>>
>>> Shouldn't Term have constructors that take a Token?
>>
>> I think that makes sense, though normally Token appears during  
>> analysis and Term during searching (I think?) -- how often would  
>> you need to make a Term from a Token?
>>
> The problem I'm addressing is that tokens are used in contexts that  
> need String and not char[].
> The call to the deprecated
>  String termText = token.termText();
> needs to be replaced with:
>  String termText = new String(token.termBuffer(), 0,  
> token.termLength());
>
> There are over 170 calls to token.termText(), each of these places  
> have to be modified. In some, perhaps many, of these cases it may be  
> possible to use char[] directly to get a performance gain.
>
> In the case of Term changing it to work with char[] buffer, int  
> start, int length, does not seem quite right. I think the ripple  
> would keep getting bigger. But logically, the Term's text is the  
> text of a Token.
>
> To me it makes sense to have a method that returns the token as a  
> String, but that method is deprecated and the suggested replacement  
> is to directly use the buffer. So this leads to the above construct.  
> Perhaps it would be good to add a new method and document that as  
> one of two replacements.
> public String term() {
> return termText != null ? termText : new String(token.termBuffer(),  
> 0, token.termLength());
> }
>
> Here is an example from QueryParser that has 5 instances, each  
> calling the deprecated t.termText() method. In this example, there  
> is the construction of a query from a token stream.
> Each of the problem lines are of the pattern:
>  TermQuery currentQuery = new TermQuery(new Term(field,  
> t.termText()));
>
> To remove the deprecated call to t.termText(), the Token's buffer  
> needs to be marshalled with something like:
>  String termText = new String(token.termBuffer(), 0,  
> token.termLength());
>  TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>
> /**
>  * @exception ParseException throw in overridden method to disallow
>  */
> protected Query getFieldQuery(String field, String queryText)   
> throws ParseException {
>   // Use the analyzer to get all the tokens, and then build a  
> TermQuery,
>   // PhraseQuery, or nothing based on the term count
>
>   TokenStream source = analyzer.tokenStream(field, new  
> StringReader(queryText));
>   Vector v = new Vector();
>   org.apache.lucene.analysis.Token t;
>   int positionCount = 0;
>   boolean severalTokensAtSamePosition = false;
>
>   while (true) {
>     try {
>       t = source.next();
>     }
>     catch (IOException e) {
>       t = null;
>     }
>     if (t == null)
>       break;
>     v.addElement(t);
>     if (t.getPositionIncrement() != 0)
>       positionCount += t.getPositionIncrement();
>     else
>       severalTokensAtSamePosition = true;
>   }
>   try {
>     source.close();
>   }
>   catch (IOException e) {
>     // ignore
>   }
>
>   if (v.size() == 0)
>     return null;
>   else if (v.size() == 1) {
>     t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>     return new TermQuery(new Term(field, t.termText()));
>   } else {
>     if (severalTokensAtSamePosition) {
>       if (positionCount == 1) {
>         // no phrase query:
>         BooleanQuery q = new BooleanQuery(true);
>         for (int i = 0; i < v.size(); i++) {
>           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>           TermQuery currentQuery = new TermQuery(
>               new Term(field, t.termText()));
>           q.add(currentQuery, BooleanClause.Occur.SHOULD);
>         }
>         return q;
>       }
>       else {
>         // phrase query:
>         MultiPhraseQuery mpq = new MultiPhraseQuery();
>         mpq.setSlop(phraseSlop);
>         List multiTerms = new ArrayList();
>         int position = -1;
>         for (int i = 0; i < v.size(); i++) {
>           t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>           if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
>             if (enablePositionIncrements) {
>               mpq.add((Term[])multiTerms.toArray(new  
> Term[0]),position);
>             } else {
>               mpq.add((Term[])multiTerms.toArray(new Term[0]));
>             }
>             multiTerms.clear();
>           }
>           position += t.getPositionIncrement();
>           multiTerms.add(new Term(field, t.termText()));
>         }
>         if (enablePositionIncrements) {
>           mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>         } else {
>           mpq.add((Term[])multiTerms.toArray(new Term[0]));
>         }
>         return mpq;
>       }
>     }
>     else {
>       PhraseQuery pq = new PhraseQuery();
>       pq.setSlop(phraseSlop);
>       int position = -1;
>       for (int i = 0; i < v.size(); i++) {
>         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>         if (enablePositionIncrements) {
>           position += t.getPositionIncrement();
>           pq.add(new Term(field, t.termText()),position);
>         } else {
>           pq.add(new Term(field, t.termText()));
>         }
>       }
>       return pq;
>     }
>   }
> }
>
>
> Here is an example that works around the deprecated code:
> public void testShingleAnalyzerWrapperPhraseQuery() throws Exception {
>   Analyzer analyzer = new ShingleAnalyzerWrapper(new  
> WhitespaceAnalyzer(), 2);
>   searcher = setUpSearcher(analyzer);
>
>   PhraseQuery q = new PhraseQuery();
>
>   TokenStream ts = analyzer.tokenStream("content",
>                                         new StringReader("this  
> sentence"));
>   Token token;
>   int j = -1;
>   while ((token = ts.next()) != null) {
>     j += token.getPositionIncrement();
>     String termText = new String(token.termBuffer(), 0,  
> token.termLength());
>     q.add(new Term("content", termText), j);
>   }
>
>   Hits hits = searcher.search(q);
>   int[] ranks = new int[] { 0 };
>   compareRanks(hits, ranks);
> }
>
> -- DM
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message