hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Krishna Kumar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2065) RCFile issues
Date Fri, 08 Apr 2011 01:31:05 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017229#comment-13017229
] 

Krishna Kumar commented on HIVE-2065:
-------------------------------------

The minor version is needed so that we can still read 6.0 files correctly. To recap, 6.0 files
have incorrect record length and while reading, we make the necessary recalculations to fix
it up, while 6.1 onwards have the correct record length stored on disk.

[PS. I had suggested bumping up the sequence file version to 7 in a comment above, but I think
a minor version is a better idea. The layout itself is still 'kinda sorta' version-6-compatible.
For all we know, there may be a sequence file version 7, and then sequence file version 7
and rc file version 7 would be divergent.]

[PPS. For the sake of completeness of documentation, here are the reason why the layout, even
after the current patch, is still short of complete version-6 compatibility : [a] The KeyBuffer,
denoted as the key class, is unable to read or write itself from/to the disk stream as the
reading/writing the 4-byte key contents length field and the compression/decompression are
being done by the reader/writer and not the KeyBuffer class and [b] The ValueBuffer, the value
class, must be compressed as a unit to be compatible to sequence file reader/writer, but it
is actually compressed as several units.] 

> RCFile issues
> -------------
>
>                 Key: HIVE-2065
>                 URL: https://issues.apache.org/jira/browse/HIVE-2065
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>         Attachments: HIVE.2065.patch.0.txt, HIVE.2065.patch.1.txt, Slide1.png, proposal.png
>
>
> Some potential issues with RCFile
> 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per yongqiang
he, the class is not meant to be thread-safe (and it is not). Might as well get rid of the
confusing and performance-impacting lock acquisitions.
> 2. Record Length overstated for compressed files. IIUC, the key compression happens after
we have written the record length.
> {code}
>       int keyLength = key.getSize();
>       if (keyLength < 0) {
>         throw new IOException("negative length keys not allowed: " + key);
>       }
>       out.writeInt(keyLength + valueLength); // total record length
>       out.writeInt(keyLength); // key portion length
>       if (!isCompressed()) {
>         out.writeInt(keyLength);
>         key.write(out); // key
>       } else {
>         keyCompressionBuffer.reset();
>         keyDeflateFilter.resetState();
>         key.write(keyDeflateOut);
>         keyDeflateOut.flush();
>         keyDeflateFilter.finish();
>         int compressedKeyLen = keyCompressionBuffer.getLength();
>         out.writeInt(compressedKeyLen);
>         out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen);
>       }
> {code}
> 3. For sequence file compatibility, the compressed key length should be the next field
to record length, not the uncompressed key length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message