commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Glas <wolfgang.g...@ev-i.at>
Subject Re: [compress] [PATCH] Refactoring of zip encoding support.
Date Tue, 03 Mar 2009 08:12:08 GMT
Stefan Bodewig schrieb:
> On 2009-03-02, Wolfgang Glas <wolfgang.glas@ev-i.at> wrote:
> 
>> Stefan Bodewig schrieb:
>>> On 2009-03-01, Wolfgang Glas <wolfgang.glas@ev-i.at> wrote:
> 
>>>> 1) Unicode extra fields are written for all ZIP entries and not only
>>>> for entries, which are not encodable by the encoding set to
>>>> ZipArchiveOutputStream.
> 
>>> Maybe room for yet another flag?  Or an enum-like option
> 
>>> setCreateUnicodeExtraFields(NEVER | ALWAYS | NOT_ENCODABLE)
> 
> Consider the WinZIP case, WinZIP wouldn't recognize the EFS.  If you
> set the encoding to UTF-8 and use your code and only add extra fields
> for non-encodable paths, WinZIP will never see the correct path.

Acccording to my tests WinZip recognizes the EFS flag upon reading. Upon writing
WinZip uses extra fields and encodes filenames as Cp437, which is really the
most useful variant these days.

Secondly, if you set the encoding to UTF-8, there's no need for unicode extra
fields anyway. But as mentioned above, the most portable tool-readable variant
as requested by the reporter of the original SANDBOX-176 issue is writing Cp437
and adding unicode extra fields. EFS support in the wild is not really
widespread, propably due to a mid-air collision between specification writing
and omplementation of widespread ZIP-Implementations....

>> I like the idea of a unicode policy flag ;-)
> 
> May be a better approach, agreed.  But only if we manage to cover all
> border cases.
> 
>> My suggestion is
> 
>> setUnicodePolicy(
>>   SURROGATES   | /* no extra fields, no utf-8 fallback, only %Uxxxx surrogates*/
>>   EXTRA_FIELDS | /* extra fields for unencodable entriey, no utf-8 fallback   */
>>   EXTRA_FIELDS_ALWAYS | /* extra fields for all entries, no utf-8 fallback    */
>>   UTF8_FALLBACK| /* fall back to utf-8 plus EFS flag for unencodable entries. */
>>   UTF8_FALLBACK_EXTRA_FIELDS| /* fall back to utf-8 plus EFS flag plus extra
>>                                  fields for unencodable */
>>   UTF8_FALLBACK_EXTRA_FIELDS_ALWAYS /* fall back to utf-8 plus EFS flag for
>>                                        unencodable entries, exta fields for all
>>                                        entries. */
>> )
> 
>> We might drop the last two options and we might choose a better
>> wording, however the direction should IMHO be as above mentioned...
> 
> This covers all permutations, agreed.
> 
> Names, names, I'm really bad at them.
> 
> EXTRA_FIELDS                      => ADD_EXTRA_FIELDS_FOR_UNENCODABLE
> EXTRA_FIELDS_ALWAYS               => ADD_EXTRA_FIELDS
> UTF8_FALLBACK                     => FALL_BACK_TO_UTF8
> UTF8_FALLBACK_EXTRA_FIELDS        => FALL_BACK_TO_UTF8_PLUS_EXTRA_FIELD
> UTF8_FALLBACK_EXTRA_FIELDS_ALWAYS => FALL_BACK_TO_UTF8_ADD_EXTRA_FIELDS
> 
> but looking at the names we may be better off with two independent
> options.  Hmm, yes, right now I prefer two flags because they seem to
> be orthogonal.

I think you should choose, which approach better fits your needs in ant ;-) At
least you have to write an XML parser for these settings and the documentation,
so you might choose the approach which may be explained in brief words.

I can live very well with two options ;-)

  Wolfgang

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message