lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (LUCENE-1181) Token reuse is not ideal for avoiding array copies
Date Wed, 23 Apr 2008 13:23:21 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless resolved LUCENE-1181.
----------------------------------------

    Resolution: Won't Fix

> Token reuse is not ideal for avoiding array copies
> --------------------------------------------------
>
>                 Key: LUCENE-1181
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1181
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Trejkaz
>
> The way the Token API is currently written results in two unnecessary array copies which
could be avoided by changing the way it works.
> 1. setTermBuffer(char[],int,int) calls resizeTermBuffer(int) which copies the original
term text even though it's about to be overwritten.
> #1 should be trivially fixable by introducing a private resizeTermBuffer(int,boolean)
where the new boolean parameter specifies whether the existing term data gets copied over
or not.
> 2. setTermBuffer(char[],int,int) copies what you pass in, instead of actually setting
the term buffer.
> Setting aside the fact that the setTermBuffer method is misleadingly named, consider
a token filter which performs Unicode normalisation on each token.
> How it has to be implemented at present:
>   once:
>     - create a reusable char[] for storing the normalisation result
>   every token:
>     - use getTermBuffer() and getTermLength() to get the buffer and relevant length
>     - normalise the original string into our temporary buffer   (if it isn't big enough,
grow the temp buffer size.)
>     - setTermBuffer(byte[],int,int) - this does an extra copy.
> The following sequence would be much better:
>   once:
>     - create a reusable char[] for storing the normalisation result
>   every token:
>     - use getTermBuffer() and getTermLength() to get the buffer and relevant length
>     - normalise the original string into our temporary buffer   (if it isn't big enough,
grow the temp buffer size.)
>     - setTermBuffer(byte[],int,int) sets in our buffer by reference
>     - set the term buffer which used to be in the Token such that it becomes our new
temp buffer.
> The latter sequence results in no copying with the exception of the normalisation itself,
which is unavoidable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message