lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DM Smith (JIRA)" <>
Subject [jira] Created: (LUCENE-1799) Unicode compression
Date Tue, 11 Aug 2009 03:54:15 GMT
Unicode compression

                 Key: LUCENE-1799
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Store
    Affects Versions: 2.4.1
            Reporter: DM Smith
            Priority: Minor

In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data.
The motivation was a custom encoding in a Russian analyzer. The original supposition was that
it provided a more compact index.

This led to the comment that a different or compressed encoding would be a generally useful

BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation
in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license
would need to be obtained.

SCSU is another Unicode compression algorithm that could be used. 

An advantage of these methods is that they work on the whole of Unicode. If that is not needed
an encoding such as iso8859-1 (or whatever covers the input) could be used.    

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message