commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wolfgang Glas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SANDBOX-176) Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters
Date Thu, 12 Feb 2009 10:57:59 GMT

    [ https://issues.apache.org/jira/browse/SANDBOX-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672946#action_12672946
] 

Wolfgang Glas commented on SANDBOX-176:
---------------------------------------

Hi Christian,

  AFAIKS, you agree with my suggestion to introduce what I would call 'tuning" methods. Maybe
we should call them

tuneForToolsCompatibility()
tuneForJavaCompatibility()

>From the technical point of view we have to introduce two or three detailed settings ("expert
options") as mentioned in my earlier post, but as I understand you, that seems to be alright.

The point why I use the java.nio.charset API in my patch is, that 'String.getBytes(String
encoding)' returns question marks ('?') for unencodable characters.  A question mark is an
invalid character for a Windows file name, so Windows skips questionable entries. At this
point the CharsetDecoder CharsetEncoder classes enter the stage. With help of them you can
implement a fine-grained file name encoding strategy, which replaces unencodable characters
by some ASCII-construct (my patch introduces "%Uxxxx", but that's not normative...), which
is not pretty but does not lead to "hidden" entries.

If you tell me how to implement such a strategy by JRE-1.3 means, I will try to stick to your
suggestion ;-)

   Wolfgang

> Enable creation of tool-readable ZIP archives with file names containing non-ASCII characters
> ---------------------------------------------------------------------------------------------
>
>                 Key: SANDBOX-176
>                 URL: https://issues.apache.org/jira/browse/SANDBOX-176
>             Project: Commons Sandbox
>          Issue Type: Improvement
>          Components: Compress
>         Environment: Any / All
>            Reporter: Christian Gosch
>            Assignee: Stefan Bodewig
>         Attachments: commons-compress-utf8-creation-svn741897.patch, utf8-7zip-test.zip,
utf8-winzip-test.zip
>
>
> Currently it is not possible to generate externally readable ZIP archives with java.util.zip.*
or org.apache.commons.compress.* when entries to include shall have names with characters
outside US-ASCII. This should be changed to enable at least org.apache.commons.compress.*
to produce ZIP archives in international context which are readable by usual ZIP archiver
tools like pkzip, gzip, WinZIP, PowerArchiver, WinRAR / rar, StuffIt...
> For java.util.zip.* this is due to a really old flaw on handling entry names: They are
just always rendered as UTF-8, which is kind of Java specific, and not as Cp437, which is
expected and written by most ZIP archiver tools (or eventually all). For more details see:
> http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4244499
> http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4820807
> For org.apache.commons.compress.archivers.zip.* the "compress & save" operation can
be easily improved by extending ZipArchive:
> // Add member:
>     protected String m_encoding = null;
> // Add constructor:
>     public ZipArchive(String encoding) {
>         m_encoding = encoding;
>     }
> // Extend doSave(FileOutputStream):
> // ...
> 		// Pack-Operation
> 		ZipOutputStream out = null;
> 		try {
> 			out = new ZipOutputStream(new BufferedOutputStream(output));
>             if (m_encoding != null) {   // added
>                 out.setEncoding(m_encoding);   // added
>             }  // added
> 			while(iterator.hasNext()) {
> // ...
> Now it is possible to instantiate a ZipArchive with "Cp437" as encoding, and external
tools can figure out the original entry names even if they contain non-ASCII characters. (On
the other hand, Java cannot read back & deflate such an archive since it expects UTF-8!)
> The "read & deflate" operation for ZipArchive is more difficult to extend since it
currently relies completely on java.util.zip.* . The other reason is, that ZIP archives do
not contain any hint on the character encoding used for file names etc. It seems that the
usual tools simply use Cp437 and Java simply uses UTF-8 -- without any declaration of reasons.
Thus a deflater has to try.
> For TarArchive the problem is unclear. Here the commons-compress implementation does
not rely on third-party code as far as I can see, and TAR is no Java-bound file type (like
JAR, which is Java-bound). Thus chances are, that everything works well, even when entry names
with non-ASCII characters come into play.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message