lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir" <rcm...@gmail.com>
Subject stored fields / unicode compression
Date Sat, 27 Dec 2008 00:00:11 GMT
Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for
stored fields?
Personally I don't put huge amounts of text in stored fields but these
encodings/compression work extremely well on short strings like titles, etc.
Removing the unicode penalty for non-latin text (i.e. cut in half) is
nothing to sneeze at since with lots of docs my stored fields still become
pretty huge, biggest part of the index.

I know I could use one of these schemes right now and store everything as
bytes... but just thinking it might be something of more general use. The
GZIP compression that is supported isn't very useful as it typically makes
short snippets bigger...

Performance compared to UTF-8 is here... seems like a general win to me (but
maybe I am missing something)
http://unicode.org/notes/tn6/#Performance

-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message