lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Token implementation
Date Mon, 14 Jul 2008 15:50:01 GMT
Hiroaki Kawai wrote:
> DM Smith <dmsmith555@gmail.com> wrote:
>   
>> On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
>>
>>     
>>> Another suggestion from me:
>>> How about making token object as an singleton?
>>>       
>> Would that work for a multi-threaded application?
>>     
>
> Of cource. We should make that thread local singleton.
>   

In core and contrib, there are times where more than one token is used 
at a time. In a few places they are put into collections.

So a singleton wouldn't work.

>
>   
>>>
>>>       
>>>> Maybe we should un-deprecate the termText() method but add javadocs
>>>> explaining that for better performance you should use the char[]  
>>>> reuse
>>>> methods instead?
>>>>
>>>> Mike
>>>>
>>>> DM Smith wrote:
>>>>
>>>>         
>>>>> Michael McCandless wrote:
>>>>>           
>>>>>> DM Smith wrote:
>>>>>>
>>>>>>             
>>>>>>> Shouldn't Term have constructors that take a Token?
>>>>>>>               
>>>>>> I think that makes sense, though normally Token appears during
>>>>>> analysis and Term during searching (I think?) -- how often would
>>>>>> you need to make a Term from a Token?
>>>>>>
>>>>>>             
>>>>> The problem I'm addressing is that tokens are used in contexts that
>>>>> need String and not char[].
>>>>> The call to the deprecated
>>>>> String termText = token.termText();
>>>>> needs to be replaced with:
>>>>> String termText = new String(token.termBuffer(), 0,
>>>>> token.termLength());
>>>>>
>>>>> There are over 170 calls to token.termText(), each of these places
>>>>> have to be modified. In some, perhaps many, of these cases it may be
>>>>> possible to use char[] directly to get a performance gain.
>>>>>
>>>>> In the case of Term changing it to work with char[] buffer, int
>>>>> start, int length, does not seem quite right. I think the ripple
>>>>> would keep getting bigger. But logically, the Term's text is the
>>>>> text of a Token.
>>>>>
>>>>> To me it makes sense to have a method that returns the token as a
>>>>> String, but that method is deprecated and the suggested replacement
>>>>> is to directly use the buffer. So this leads to the above construct.
>>>>> Perhaps it would be good to add a new method and document that as
>>>>> one of two replacements.
>>>>> public String term() {
>>>>> return termText != null ? termText : new String(token.termBuffer(),
>>>>> 0, token.termLength());
>>>>> }
>>>>>
>>>>> Here is an example from QueryParser that has 5 instances, each
>>>>> calling the deprecated t.termText() method. In this example, there
>>>>> is the construction of a query from a token stream.
>>>>> Each of the problem lines are of the pattern:
>>>>> TermQuery currentQuery = new TermQuery(new Term(field,
>>>>> t.termText()));
>>>>>
>>>>> To remove the deprecated call to t.termText(), the Token's buffer
>>>>> needs to be marshalled with something like:
>>>>> String termText = new String(token.termBuffer(), 0,
>>>>> token.termLength());
>>>>> TermQuery currentQuery = new TermQuery(new Term(field, termText)));
>>>>>
>>>>> /**
>>>>> * @exception ParseException throw in overridden method to disallow
>>>>> */
>>>>> protected Query getFieldQuery(String field, String queryText)
>>>>> throws ParseException {
>>>>>  // Use the analyzer to get all the tokens, and then build a
>>>>> TermQuery,
>>>>>  // PhraseQuery, or nothing based on the term count
>>>>>
>>>>>  TokenStream source = analyzer.tokenStream(field, new
>>>>> StringReader(queryText));
>>>>>  Vector v = new Vector();
>>>>>  org.apache.lucene.analysis.Token t;
>>>>>  int positionCount = 0;
>>>>>  boolean severalTokensAtSamePosition = false;
>>>>>
>>>>>  while (true) {
>>>>>    try {
>>>>>      t = source.next();
>>>>>    }
>>>>>    catch (IOException e) {
>>>>>      t = null;
>>>>>    }
>>>>>    if (t == null)
>>>>>      break;
>>>>>    v.addElement(t);
>>>>>    if (t.getPositionIncrement() != 0)
>>>>>      positionCount += t.getPositionIncrement();
>>>>>    else
>>>>>      severalTokensAtSamePosition = true;
>>>>>  }
>>>>>  try {
>>>>>    source.close();
>>>>>  }
>>>>>  catch (IOException e) {
>>>>>    // ignore
>>>>>  }
>>>>>
>>>>>  if (v.size() == 0)
>>>>>    return null;
>>>>>  else if (v.size() == 1) {
>>>>>    t = (org.apache.lucene.analysis.Token) v.elementAt(0);
>>>>>    return new TermQuery(new Term(field, t.termText()));
>>>>>  } else {
>>>>>    if (severalTokensAtSamePosition) {
>>>>>      if (positionCount == 1) {
>>>>>        // no phrase query:
>>>>>        BooleanQuery q = new BooleanQuery(true);
>>>>>        for (int i = 0; i < v.size(); i++) {
>>>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>>          TermQuery currentQuery = new TermQuery(
>>>>>              new Term(field, t.termText()));
>>>>>          q.add(currentQuery, BooleanClause.Occur.SHOULD);
>>>>>        }
>>>>>        return q;
>>>>>      }
>>>>>      else {
>>>>>        // phrase query:
>>>>>        MultiPhraseQuery mpq = new MultiPhraseQuery();
>>>>>        mpq.setSlop(phraseSlop);
>>>>>        List multiTerms = new ArrayList();
>>>>>        int position = -1;
>>>>>        for (int i = 0; i < v.size(); i++) {
>>>>>          t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>>          if (t.getPositionIncrement() > 0 && multiTerms.size()
>  
>>>>> 0) {
>>>>>            if (enablePositionIncrements) {
>>>>>              mpq.add((Term[])multiTerms.toArray(new
>>>>> Term[0]),position);
>>>>>            } else {
>>>>>              mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>>>            }
>>>>>            multiTerms.clear();
>>>>>          }
>>>>>          position += t.getPositionIncrement();
>>>>>          multiTerms.add(new Term(field, t.termText()));
>>>>>        }
>>>>>        if (enablePositionIncrements) {
>>>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
>>>>>        } else {
>>>>>          mpq.add((Term[])multiTerms.toArray(new Term[0]));
>>>>>        }
>>>>>        return mpq;
>>>>>      }
>>>>>    }
>>>>>    else {
>>>>>      PhraseQuery pq = new PhraseQuery();
>>>>>      pq.setSlop(phraseSlop);
>>>>>      int position = -1;
>>>>>      for (int i = 0; i < v.size(); i++) {
>>>>>        t = (org.apache.lucene.analysis.Token) v.elementAt(i);
>>>>>        if (enablePositionIncrements) {
>>>>>          position += t.getPositionIncrement();
>>>>>          pq.add(new Term(field, t.termText()),position);
>>>>>        } else {
>>>>>          pq.add(new Term(field, t.termText()));
>>>>>        }
>>>>>      }
>>>>>      return pq;
>>>>>    }
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> Here is an example that works around the deprecated code:
>>>>> public void testShingleAnalyzerWrapperPhraseQuery() throws  
>>>>> Exception {
>>>>>  Analyzer analyzer = new ShingleAnalyzerWrapper(new
>>>>> WhitespaceAnalyzer(), 2);
>>>>>  searcher = setUpSearcher(analyzer);
>>>>>
>>>>>  PhraseQuery q = new PhraseQuery();
>>>>>
>>>>>  TokenStream ts = analyzer.tokenStream("content",
>>>>>                                        new StringReader("this
>>>>> sentence"));
>>>>>  Token token;
>>>>>  int j = -1;
>>>>>  while ((token = ts.next()) != null) {
>>>>>    j += token.getPositionIncrement();
>>>>>    String termText = new String(token.termBuffer(), 0,
>>>>> token.termLength());
>>>>>    q.add(new Term("content", termText), j);
>>>>>  }
>>>>>
>>>>>  Hits hits = searcher.search(q);
>>>>>  int[] ranks = new int[] { 0 };
>>>>>  compareRanks(hits, ranks);
>>>>> }
>>>>>
>>>>> -- DM
>>>>>
>>>>>           


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message