commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damjan Jovanovic <damjan....@gmail.com>
Subject Re: [compress] Detecting LZMA standalone files
Date Mon, 10 Jun 2013 07:53:53 GMT
On Mon, Jun 10, 2013 at 6:25 AM, Stefan Bodewig <bodewig@apache.org> wrote:
> Hi,
>
> when I added support for decompressing .lzma files I left out matches()
> and you can only get an LZMACompressorInputStream from
> CompressorStreamFactory if you use the version that explicitly specifies
> the format.
>
> The reason is that the old .lzma format doesn't have any sort of
> signature at all.  I've been told that if you'd try to "unlzma" a plain
> text file the most likely outcome is an OutOfMemoryError.
>
> The native XZUtil which is used for xz as well as lzma contains some
> heuristic that allows the xz command to guess the input format.  It
> first checks whether the input is xz and if not whether the settings
> that would make up the start of an LZMA stream don't look to strange.
>
> We could do something similar by placing the LZMA check at the end in
> the CompressorStreamFactory's autodetect method and perform the same
> plausibility checks on the input.  This would still run the risk of
> false positives and - maybe less likely - false negatives.  Do we want
> to do something like this?
>
> Stefan

The problem is not unique to LZMA, and since LZMA can contain almost
any bytes at the beginning, it could also be misdetected as another
compression format.

If we can't autodetect all compression formats from the file contents,
then shouldn't we only try to autodetect them from the file extension
or MIME type? Or not do autodetection at all?

Damjan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message