commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wolfgang Glas (JIRA)" <>
Subject [jira] Updated: (SANDBOX-176) Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters
Date Sat, 07 Feb 2009 19:20:59 GMT


Wolfgang Glas updated SANDBOX-176:


Hi all,

  As Stefan has just merged ant's ZipOutputStream implementation and I have developed a
patch to improve UniCode filename handling in parallel, I replicate  the information of an
I sent to Stefan here. (I didn't find this issue before...)

 We came across the ZIP filename encoding problems in one of our productive web
applications. Because I knew commons-compress from the bzip2-code therein I
found a version of ZipOutputStream (now: ZipArchiveOutputStream) with the
possibility to control the encoding used to encode filenames.

  I wrote several testcases, read the ZIP specification at, installed 7-zip and
WinZip and came to the following conclusions:

1) Version 6.3 of the spec introduced the Language encoding flag (EFS), bit 11,
which switches the encoding from cp437 to utf-8.

2) Version 6.3.2 of the spec introduced UniCode information extra fields for
filenames and comment, which brings in the possibility to leave the fileneme in
the directory entry in CP437 encoding, while adding an utf-8 filename for newer

3) Windows XPs built-in zip engine ignores both approaches.

4) 7-zip supports the EFS flag, but ignores UniCode extra fields. 7-zip
consistently writes the EFS flag for entries, which cannot be encoded using CP437.

5) WinZip honors both approaches and writes Files using UniCode extra fields. It
never sets the EFS flag.

6) Linux programs do no decoding of filenames, they barely write the filename as
encoded in the file system. (Nowadys, mostly UTF-8) These programs do not set
the EFS flag when the filesystem is encoded as utf-8.

  I wrote the attached patch for commons-compress rev. 741897, which introduces
the following features:

1) Correctly set the EFS flag, if ZipArchiveoutputStream.setEncoding("UTF-8") is

2) Add support for UniCode extra fields.

3) Encode unencodable characters in the form %Uxxx, avoiding headaches if
Windows users click on their zip-files.

4) An interoperability testcase, which currently fails, because
java.util.ZipInpuStream (openjdk-6 in my case) cowardly tries to interpret the
filenames as utf-8 instead of CP437 if the EFS flag is zero and UTF-8 if the EFS
flag is set.

  Here are my thoughts for further enhancements of full UniCode support in the
commons-compress ZIP archiver.

1) We need an own ZipInputStream implementation in order to have full control on
the decoding of filenames and comment. Only such an implementation will get us
ready for full interoperability with 7-zip and winzip.

2) Maybe we should replace setEncoding() with method setUtf8Mode(boolean),
because the specification definitely allows for cP437 or UTF-8 encodings only
based on the value of the EFS flag.

3) The output stream could be adapted in way, that only entriy which are not
encodable by CP437 are encoded using the EFS flag an UTF-8.

  So I kindly ask you to review my patch and to integrate it in
commons-compress, so we can make substantial progress on the UniCode filename
issue, so that users will finally receive a standard solution in the form of a
robust commons-compress-1.0 release.

  Best regards,


> Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters
> ---------------------------------------------------------------------------------------------
>                 Key: SANDBOX-176
>                 URL:
>             Project: Commons Sandbox
>          Issue Type: Improvement
>          Components: Compress
>         Environment: Any / All
>            Reporter: Christian Gosch
>            Assignee: Stefan Bodewig
>         Attachments: commons-compress-utf8-creation-svn741897.patch,,
> Currently it is not possible to generate externally readable ZIP archives with*
or org.apache.commons.compress.* when entries to include shall have names with characters
outside US-ASCII. This should be changed to enable at least org.apache.commons.compress.*
to produce ZIP archives in international context which are readable by usual ZIP archiver
tools like pkzip, gzip, WinZIP, PowerArchiver, WinRAR / rar, StuffIt...
> For* this is due to a really old flaw on handling entry names: They are
just always rendered as UTF-8, which is kind of Java specific, and not as Cp437, which is
expected and written by most ZIP archiver tools (or eventually all). For more details see:
> For* the "compress & save" operation can
be easily improved by extending ZipArchive:
> // Add member:
>     protected String m_encoding = null;
> // Add constructor:
>     public ZipArchive(String encoding) {
>         m_encoding = encoding;
>     }
> // Extend doSave(FileOutputStream):
> // ...
> 		// Pack-Operation
> 		ZipOutputStream out = null;
> 		try {
> 			out = new ZipOutputStream(new BufferedOutputStream(output));
>             if (m_encoding != null) {   // added
>                 out.setEncoding(m_encoding);   // added
>             }  // added
> 			while(iterator.hasNext()) {
> // ...
> Now it is possible to instantiate a ZipArchive with "Cp437" as encoding, and external
tools can figure out the original entry names even if they contain non-ASCII characters. (On
the other hand, Java cannot read back & deflate such an archive since it expects UTF-8!)
> The "read & deflate" operation for ZipArchive is more difficult to extend since it
currently relies completely on* . The other reason is, that ZIP archives do
not contain any hint on the character encoding used for file names etc. It seems that the
usual tools simply use Cp437 and Java simply uses UTF-8 -- without any declaration of reasons.
Thus a deflater has to try.
> For TarArchive the problem is unclear. Here the commons-compress implementation does
not rely on third-party code as far as I can see, and TAR is no Java-bound file type (like
JAR, which is Java-bound). Thus chances are, that everything works well, even when entry names
with non-ASCII characters come into play.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message