commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Bodewig (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (COMPRESS-403) Block and Record Size issues in TarArchiveOutputStream
Date Sat, 10 Jun 2017 16:51:20 GMT

     [ https://issues.apache.org/jira/browse/COMPRESS-403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stefan Bodewig updated COMPRESS-403:
------------------------------------
    Description: 
According to the pax spec 
 [§4.100.13.01| http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_01]

bq. A pax archive tape or file produced in the -x pax format shall contain a series of blocks.
The physical layout of the archive shall be identical to the ustar format

[§ 4.100.13.06| http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_06]


bq. A ustar archive tape or file shall contain a series of logical records. Each logical record
shall be a fixed-size logical record of 512 octets.
...
bq. The logical records *may* be grouped for physical I/O operations, as described under the
-b blocksize and -x ustar options. Each group of logical records *may* be written with a single
operation equivalent to the write() function. On magnetic tape, the result of this write *shall*
be a single tape physical block. The last physical block *shall* always be the full size,
so logical records after the two zero logical records *may* contain undefined data.

bq. pax. The default blocksize for this format for character special archive files *shall*
be 5120. Implementations *shall* support all blocksize values less than or equal to 32256
that are multiples of 512.

bq. ustar. The default blocksize for this format for character special archive files *shall*
be 10240. Implementations *shall* support all blocksize values less than or equal to 32256
that are multiples of 512.

bq. Implementations are permitted to modify the block-size value based on the archive format
or the device to which the archive is being written. This is to provide implementations with
the opportunity to take advantage of special types of devices, and it should not be used without
a great deal of consideration as it almost certainly decreases archive portability.

The current implementation of TarArchiveOutputStream
# Allows the logical record size to be altered
# Has a default block size of 10240  
# has two separate logical-record size buffers, and frequently double buffers in order to
write to the wrapped outputstream in units of a logical record, rather than a physical block.

I would hazard a guess that very few users commons-compress are writing directly to a tape
drive, where the block-size is of great import.  It is also not possible to guarantee that
a subordinate output stream won't buffer in chunks  of a different size (5120 and 10240 bytes
aren't ideal for modern hard drives with 4096 byte sectors, or filesystems like ZFS with a
default recordsize of 128K).  

The main effect of the record and block size have is the extra padding they require. For the
purposes of the java output  device, the optimal blocksize  to modify to is probably just
a single record; since all implementations must handle 512 byte blocks, and must detect block
size on input (or simulate same), this cannot affect compatibility. 
Fixed length blocking in multiples of 512 Bytes can be supported by wrapping the destination
output stream in a modified BufferedOutputStream that does not permit flushing of partial
blocks, and pads on close. This would only be used as necessary. 
 
If a record size of 512 bytes is being used, it could be useful to store that information
in an extended header at the start of the file. That allows for in-place appending to an archive
without having to read the entire archive first (as long as the original end-of-archive location
is journaled to support recovery). 

There is even an advantage for xz compressed files, as every block but the last can be copied
without having to decompress then recompress, 
In the latter scenario, it would be useful to be able to signal to the subordinate layer to
start a new block before  writing the final 1024 nulls; in that situation, either a new block
can be started overwriting the EOA and xz index blocks, with the saved index info saved at
the end; or the block immediately preceding the EOA markers can be decompressed and recompressed,
which will rebuild the dictionary and index structures to allow the block to be continued.
That's a different issue.  


  was:
According to the pax spec 
 [§4.100.13.01| http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_01]

bq. A pax archive tape or file produced in the -x pax format shall contain a series of blocks.
The physical layout of the archive shall be identical to the ustar format

[§ 4.100.13.06| http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_06]


bq. A ustar archive tape or file shall contain a series of logical records. Each logical record
shall be a fixed-size logical record of 512 octets.
...
bq. The logical records *may* be grouped for physical I/O operations, as described under the
-b blocksize and -x ustar options. Each group of logical records *may* be written with a single
operation equivalent to the write() function. On magnetic tape, the result of this write *shall*
be a single tape physical block. The last physical block *shall* always be the full size,
so logical records after the two zero logical records *may* contain undefined data.

.bq pax. The default blocksize for this format for character special archive files *shall*
be 5120. Implementations *shall* support all blocksize values less than or equal to 32256
that are multiples of 512.
ustar. The default blocksize for this format for character special archive files *shall* be
10240. Implementations *shall* support all blocksize values less than or equal to 32256 that
are multiples of 512.

.bq Implementations are permitted to modify the block-size value based on the archive format
or the device to which the archive is being written. This is to provide implementations with
the opportunity to take advantage of special types of devices, and it should not be used without
a great deal of consideration as it almost certainly decreases archive portability.

The current implementation of TarArchiveOutputStream
# Allows the logical record size to be altered
# Has a default block size of 10240  
# has two separate logical-record size buffers, and frequently double buffers in order to
write to the wrapped outputstream in units of a logical record, rather than a physical block.

I would hazard a guess that very few users commons-compress are writing directly to a tape
drive, where the block-size is of great import.  It is also not possible to guarantee that
a subordinate output stream won't buffer in chunks  of a different size (5120 and 10240 bytes
aren't ideal for modern hard drives with 4096 byte sectors, or filesystems like ZFS with a
default recordsize of 128K).  

The main effect of the record and block size have is the extra padding they require. For the
purposes of the java output  device, the optimal blocksize  to modify to is probably just
a single record; since all implementations must handle 512 byte blocks, and must detect block
size on input (or simulate same), this cannot affect compatibility. 
Fixed length blocking in multiples of 512 Bytes can be supported by wrapping the destination
output stream in a modified BufferedOutputStream that does not permit flushing of partial
blocks, and pads on close. This would only be used as necessary. 
 
If a record size of 512 bytes is being used, it could be useful to store that information
in an extended header at the start of the file. That allows for in-place appending to an archive
without having to read the entire archive first (as long as the original end-of-archive location
is journaled to support recovery). 

There is even an advantage for xz compressed files, as every block but the last can be copied
without having to decompress then recompress, 
In the latter scenario, it would be useful to be able to signal to the subordinate layer to
start a new block before  writing the final 1024 nulls; in that situation, either a new block
can be started overwriting the EOA and xz index blocks, with the saved index info saved at
the end; or the block immediately preceding the EOA markers can be decompressed and recompressed,
which will rebuild the dictionary and index structures to allow the block to be continued.
That's a different issue.  



> Block and Record Size issues in  TarArchiveOutputStream 
> --------------------------------------------------------
>
>                 Key: COMPRESS-403
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-403
>             Project: Commons Compress
>          Issue Type: Improvement
>          Components: Archivers
>    Affects Versions: 1.14
>            Reporter: Simon Spero
>            Priority: Minor
>
> According to the pax spec 
>  [§4.100.13.01| http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_01]

> bq. A pax archive tape or file produced in the -x pax format shall contain a series of
blocks. The physical layout of the archive shall be identical to the ustar format
> [§ 4.100.13.06| http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_06]

> bq. A ustar archive tape or file shall contain a series of logical records. Each logical
record shall be a fixed-size logical record of 512 octets.
> ...
> bq. The logical records *may* be grouped for physical I/O operations, as described under
the -b blocksize and -x ustar options. Each group of logical records *may* be written with
a single operation equivalent to the write() function. On magnetic tape, the result of this
write *shall* be a single tape physical block. The last physical block *shall* always be the
full size, so logical records after the two zero logical records *may* contain undefined data.
> bq. pax. The default blocksize for this format for character special archive files *shall*
be 5120. Implementations *shall* support all blocksize values less than or equal to 32256
that are multiples of 512.
> bq. ustar. The default blocksize for this format for character special archive files
*shall* be 10240. Implementations *shall* support all blocksize values less than or equal
to 32256 that are multiples of 512.
> bq. Implementations are permitted to modify the block-size value based on the archive
format or the device to which the archive is being written. This is to provide implementations
with the opportunity to take advantage of special types of devices, and it should not be used
without a great deal of consideration as it almost certainly decreases archive portability.
> The current implementation of TarArchiveOutputStream
> # Allows the logical record size to be altered
> # Has a default block size of 10240  
> # has two separate logical-record size buffers, and frequently double buffers in order
to write to the wrapped outputstream in units of a logical record, rather than a physical
block.
> I would hazard a guess that very few users commons-compress are writing directly to a
tape drive, where the block-size is of great import.  It is also not possible to guarantee
that a subordinate output stream won't buffer in chunks  of a different size (5120 and 10240
bytes aren't ideal for modern hard drives with 4096 byte sectors, or filesystems like ZFS
with a default recordsize of 128K).  
> The main effect of the record and block size have is the extra padding they require.
For the purposes of the java output  device, the optimal blocksize  to modify to is probably
just a single record; since all implementations must handle 512 byte blocks, and must detect
block size on input (or simulate same), this cannot affect compatibility. 
> Fixed length blocking in multiples of 512 Bytes can be supported by wrapping the destination
output stream in a modified BufferedOutputStream that does not permit flushing of partial
blocks, and pads on close. This would only be used as necessary. 
>  
> If a record size of 512 bytes is being used, it could be useful to store that information
in an extended header at the start of the file. That allows for in-place appending to an archive
without having to read the entire archive first (as long as the original end-of-archive location
is journaled to support recovery). 
> There is even an advantage for xz compressed files, as every block but the last can be
copied without having to decompress then recompress, 
> In the latter scenario, it would be useful to be able to signal to the subordinate layer
to start a new block before  writing the final 1024 nulls; in that situation, either a new
block can be started overwriting the EOA and xz index blocks, with the saved index info saved
at the end; or the block immediately preceding the EOA markers can be decompressed and recompressed,
which will rebuild the dictionary and index structures to allow the block to be continued.
That's a different issue.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message