commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wolfgang Glas (JIRA)" <>
Subject [jira] Commented: (SANDBOX-176) Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters
Date Wed, 11 Feb 2009 10:37:00 GMT


Wolfgang Glas commented on SANDBOX-176:

Hi Stefan,

  Right now, I've got some more time to study your comments and to review the ZIP appnote
as well as

1) I did not find any hint in the appnote that indicates, which "version needed to extract"
should be set when the EFS flag is set. I've already experimented with setting "version needed
to extract" to 6.3 when the EFS flag is set. However, all ZIP programs seem to ignore the
"version needed to extract".

2) Nevertheless, using the EFS seems to be discouraged, because there arer to many programs
out ther which do not cope with EFS/utf-8. At least winzip seems to stick to this policy as
it writes unicode extra fields for entries not encodable by CP437.

3) If understand you right,  we should keep the 'setEncoding(String)' interface as too many
ant API-users are used to this.

4) I think we should introduce a parameter 'setFallbackToEFS(boolean)' , which uses utf-8
and the EFS flag for entries not encodable by the encoding set through setEncoding(String).

5) Additionally, we may add a parameter 'setFallbackToUnicodeExtras(boolean)', which triggers
the creation of UnicodePath/UnicodeComment extra fields for names not encodable by the encoding
set through setEncoding(String).

6) We might conider adding a method tuneForUnicodeComaptibility(), which arranges for the
default setting in away, that is compatible with most decompressors as to the knowledge of
the implementors. This method may arrange for a different setting of parameters as decoders
adopt new standards in the future.

7) IMHO there are many situations, where someone might decode a ZIP stream instead of a zip
file: Resources, which are read from a jar-file rather than a classes folder, servlet input
streams, etc... Therefore I'd like to see aunicode-enabeld version of ZipArchiveInputStream
in commons-compress. ZipArchiveInputStream code in of openjdk-6 is about 600
LoC (including the base class InfalterInputStream), so I think it should be possible to providefor
a reimplementation.

8) Do you know whether it is possible to take openjdk-6 code and to import it into commoms-compress?
Are there license issues with such an import ?

9) How about JDK/JRE compliance? My implementation of ZipEncodingHelper uses java.nio.charset.Charset,
whih is part of jre-1.4. Does commons-compress still target jre-1.3 or is it OK to use 1.4

  Best regards,


> Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters
> ---------------------------------------------------------------------------------------------
>                 Key: SANDBOX-176
>                 URL:
>             Project: Commons Sandbox
>          Issue Type: Improvement
>          Components: Compress
>         Environment: Any / All
>            Reporter: Christian Gosch
>            Assignee: Stefan Bodewig
>         Attachments: commons-compress-utf8-creation-svn741897.patch,,
> Currently it is not possible to generate externally readable ZIP archives with*
or org.apache.commons.compress.* when entries to include shall have names with characters
outside US-ASCII. This should be changed to enable at least org.apache.commons.compress.*
to produce ZIP archives in international context which are readable by usual ZIP archiver
tools like pkzip, gzip, WinZIP, PowerArchiver, WinRAR / rar, StuffIt...
> For* this is due to a really old flaw on handling entry names: They are
just always rendered as UTF-8, which is kind of Java specific, and not as Cp437, which is
expected and written by most ZIP archiver tools (or eventually all). For more details see:
> For* the "compress & save" operation can
be easily improved by extending ZipArchive:
> // Add member:
>     protected String m_encoding = null;
> // Add constructor:
>     public ZipArchive(String encoding) {
>         m_encoding = encoding;
>     }
> // Extend doSave(FileOutputStream):
> // ...
> 		// Pack-Operation
> 		ZipOutputStream out = null;
> 		try {
> 			out = new ZipOutputStream(new BufferedOutputStream(output));
>             if (m_encoding != null) {   // added
>                 out.setEncoding(m_encoding);   // added
>             }  // added
> 			while(iterator.hasNext()) {
> // ...
> Now it is possible to instantiate a ZipArchive with "Cp437" as encoding, and external
tools can figure out the original entry names even if they contain non-ASCII characters. (On
the other hand, Java cannot read back & deflate such an archive since it expects UTF-8!)
> The "read & deflate" operation for ZipArchive is more difficult to extend since it
currently relies completely on* . The other reason is, that ZIP archives do
not contain any hint on the character encoding used for file names etc. It seems that the
usual tools simply use Cp437 and Java simply uses UTF-8 -- without any declaration of reasons.
Thus a deflater has to try.
> For TarArchive the problem is unclear. Here the commons-compress implementation does
not rely on third-party code as far as I can see, and TAR is no Java-bound file type (like
JAR, which is Java-bound). Thus chances are, that everything works well, even when entry names
with non-ASCII characters come into play.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message