lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1799) Unicode compression
Date Thu, 19 Nov 2009 00:16:49 GMT
btw, does anyone have a guess at how expensive this
ByteBuffer/CharBuffer.wrap() is?

Looking at the collation support, we could maybe improve
IndexableBinaryStringTools by using char[]/byte[] with offset and length.
The existing ByteBuffer/CharBuffer methods could stay, they are consistent
with Charset api and are not wrong imo,
but instead defer to the new char[]/byte[] ones... the current buffer-based
ones require the buffer to have a backing array anyway or will throw an
exception.

On Wed, Nov 18, 2009 at 2:12 PM, Earwin Burrfoot (JIRA) <jira@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779602#action_12779602]
>
> Earwin Burrfoot commented on LUCENE-1799:
> -----------------------------------------
>
> bq. as far as the encoding itself, BOCU-1 is available in the ICU library
> ICU's API requires to use ByteBuffer and CharBuffer for input/output. And
> even if I missed some nice method, encoder/decoder operates internally on
> said buffers. Thus, a wrap/unwrap for each String is inevitable.
>
> > Unicode compression
> > -------------------
> >
> >                 Key: LUCENE-1799
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1799
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Store
> >    Affects Versions: 2.4.1
> >            Reporter: DM Smith
> >            Priority: Minor
> >
> > In lucene-1793, there is the off-topic suggestion to provide compression
> of Unicode data. The motivation was a custom encoding in a Russian analyzer.
> The original supposition was that it provided a more compact index.
> > This led to the comment that a different or compressed encoding would be
> a generally useful feature.
> > BOCU-1 was suggested as a possibility. This is a patented algorithm by
> IBM with an implementation in ICU. If Lucene provide it's own implementation
> a freely avIlable, royalty-free license would need to be obtained.
> > SCSU is another Unicode compression algorithm that could be used.
> > An advantage of these methods is that they work on the whole of Unicode.
> If that is not needed an encoding such as iso8859-1 (or whatever covers the
> input) could be used.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message