lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Trejkaz (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENE-1181) Token reuse is not ideal for avoiding array copies
Date Tue, 19 Feb 2008 01:22:34 GMT
Token reuse is not ideal for avoiding array copies
--------------------------------------------------

                 Key: LUCENE-1181
                 URL: https://issues.apache.org/jira/browse/LUCENE-1181
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
    Affects Versions: 2.3
            Reporter: Trejkaz


The way the Token API is currently written results in two unnecessary array copies which could
be avoided by changing the way it works.

1. setTermBuffer(char[],int,int) calls resizeTermBuffer(int) which copies the original term
text even though it's about to be overwritten.

#1 should be trivially fixable by introducing a private resizeTermBuffer(int,boolean) where
the new boolean parameter specifies whether the existing term data gets copied over or not.

2. setTermBuffer(char[],int,int) copies what you pass in, instead of actually setting the
term buffer.

Setting aside the fact that the setTermBuffer method is misleadingly named, consider a token
filter which performs Unicode normalisation on each token.

How it has to be implemented at present:
  once:
    - create a reusable char[] for storing the normalisation result
  every token:
    - use getTermBuffer() and getTermLength() to get the buffer and relevant length
    - normalise the original string into our temporary buffer   (if it isn't big enough, grow
the temp buffer size.)
    - setTermBuffer(byte[],int,int) - this does an extra copy.

The following sequence would be much better:
  once:
    - create a reusable char[] for storing the normalisation result
  every token:
    - use getTermBuffer() and getTermLength() to get the buffer and relevant length
    - normalise the original string into our temporary buffer   (if it isn't big enough, grow
the temp buffer size.)
    - setTermBuffer(byte[],int,int) sets in our buffer by reference
    - set the term buffer which used to be in the Token such that it becomes our new temp
buffer.

The latter sequence results in no copying with the exception of the normalisation itself,
which is unavoidable.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message