hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yongqiang he <heyongqiang...@gmail.com>
Subject Re: RCFile - some queries
Date Fri, 18 Mar 2011 18:17:13 GMT
>> but the recordLength is not the actual on-disk length of the record.
It is acutal on-disk length. It is compressed key length plus the
compressed value length

>>Similarly, the next field - key length - is not the on-disk length of the compressed
key.

There are two keyLengths, one is compressed key length, the other is
uncompressed keyLength

For 2, it wo't be a problem. record length is compressed length

>>Thread-Safety.
It is not thread safe. Application should do it themselves.
 It is initially designed for Hive. Thread safety is there at first
time, and then removed because Hive does not need that, and
'synchronized' may need extra overhead

>>3.1
Reader.nextBlock() is later added for file merge. So the normal reader
should not use this method.
>>3.2.
True.

On Fri, Mar 18, 2011 at 8:30 AM, Krishna Kumar <krishnak@yahoo-inc.com> wrote:
> Hello,
>
>    I was looking into the RCFile format, esp when used with compression; a
> picture of the file layout as I understand it in this case is attached.
>
>    Some queries/potential issues:
>
>    1. RCFile makes a claim of being sequence file compatible; but the
> recordLength is not the actual on-disk length of the record. As shown in the
> picture, it is the uncompressed key length plus the compressed value length.
> Similarly, the next field - key length - is not the on-disk length of the
> compressed key.
>
>    2. Record Length is also used for seeking on the inputstream. See
> Reader.seekToNextKeyBuffer(). Since record length is overstated for
> compressed records, this can result in incorrect positioning.
>
>    3. Thread-Safety: Is the RCFile.Reader class meant to be thread-safe?
> Some public methods are marked synchronized which gives that appearance but
> there are a few thread-safety issues I think.
>
>        3.1 Other public methods, such as Reader.nextBlock() are not
> synchronized which operate on the same data structures.
>
>        3.2. Callbacks such as LazyDecompressionCallbackImpl.decompress
> operates on the valuebuffer currentValue, which can be simultaneously
> modified by the public methods on the Reader.
>
> Cheers,
>  Krishna
>
>

Mime
View raw message