ant-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Bodewig <bode...@apache.org>
Subject encoding in ZIP package
Date Thu, 26 Feb 2009 14:16:26 GMT
Hi all,

over the past two weeks commons-compress has been adding stuff for
more advanced ZIP features and I've merged the changes over to our zip
package.  The changes bring two new options with them and I'd like to
get some feedback as to which defaults our tasks should use wrt these
options.

First some background:

Traditionally file names are encoded using Windows CodePage 437 inside
ZIP archives.  This is insufficient for many characters and thus
people have chosen multiple incompatible ways to use different
encodings.  jar uses UTF-8.  Ant's tasks provide options to set the
encoding when reading/writing archives and defaults to the platform's
default encoding for zip/unzip or UTF-8 for jar/unjar.

Now the new stuff.

Language Encoding Flag
----------------------

PKWARE as the definer of the ZIP standard have desiganted a bit inside
the "general purpose bits" part of the entry's metadata to say "my
file name is in UTF-8".  This flag is recognized by more modern PKWARE
archivers, 7ZIP and very recent InfoZIP tools (if compiled using the
correct options).  7ZIP creates archives using that flag.

WinZIP and Windows' "compressed folders" completely ignore the flag.

The ZipOutputStream code right now sets the flag if encoding is UTF-8
(i.e. we are writing JARs) which makes those who understand it
immediately pick up the correct file names.  Those who don't know the
flag are no better off than before - java.util.zip seems to be happy
with and without the flag.

The ZipFile code right now recognizes the flag and ignores any
explicitly specified encoding if the flag is set - and uses UTF-8
instead, assuming the archiver knew what it has been doing.

I think either are fine defaults and I'm not even sure we need to make
them user configurable on the reading side.  We may add an option on
the writing side if there is some rare archiver that chokes on an
unknown bit in the general purpose bit area.

InfoZip Unicode Extra Fields
----------------------------

The InfoZIP folks have defined new ZIP extra fields that store UTF-8
versions of file names and comments in the entry's metadata - no
matter what the encoding of the normal name and comment fields may be.

PKWARE and WinZIP recognize these extra fields, 7ZIP and Windows'
"compressed folders" ignore them.  WinZIP creates archives using them
(but we won't benefit from that unless we fix 
<https://issues.apache.org/bugzilla/show_bug.cgi?id=46637>).

For maximum interop it may be a good idea to write the extra fields,
but it will make the archives bigger.  That's why the current
ZipOutputStream doesn't write them by default - but it can be told to
do so.

ZipFile currently ignores the extra fields by default but can be told
to look for them.  It will ignore them if the language encoding flag
has been set.  It may be a good idea to look for the extra fields by
default since it really doesn't cost too much.

Defaults?
---------

I want to add new flags to <zip> and <unzip> (and thus the
subclasses).

<zip>:

* setLanguageEncodingFlag - doesn't do anything if the encoding is not
  UTF-8.  Controls whether ZipOutputStream sets the flag.

  I'd make that default to true.

* createUnicodeExtraFields

  Controls whether ZipOutputStream writes Unicode extra fields.

  I'd make that default to false.

<unzip>:

* parseUnicodeExtraFields

  Controls whether ZipFile searches for Unicode extra fields.

  I'm uncertain as to what the default should be.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Mime
View raw message