hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <>
Subject [jira] [Commented] (HIVE-2065) RCFile issues
Date Thu, 31 Mar 2011 23:14:06 GMT


He Yongqiang commented on HIVE-2065:

The column-specific compression is very interesting, but it is not directly related to make
RCFile compatible with Seqfile. We can still do that without this compatibility. 

Some inputs maybe useful to you:
we examined column groups, and sort the data internally based on one column in one column
group. (But we did not try different compressions across column groups.) Tried this with 3-4
tables, and we see ~20% storage savings on one table compared the previous RCFile. The main
problems for this approach is that it is hard to find out the correct/most efficient column
group definitions.
One example, table tbl_1 has 20 columns, and user can define:


This will put col_1, col_2,col_11, col_13 into one column group, and reorder that column group
based on sorting col_1 (0 is the first column in this column group), and put col_3, col_4,
col_15,col_16 into another column group, and reorder this column group based on sorting col_4,
and finally put all other columns into the default column group with original order.
And should be easy to allow different compression codec for different column groups.

The main block issue for this approach is have a full set of utils to find out the best column
group definition.

Instead of doing that in the existing RCFile, do you think it would be better if we can explore
it in the new one that i just mentioned. If you think interesting, we can share you the existing
code that we have for things i mentioned. And you can work on the compression codec based
on the new one, and provide a util tool to find out the best column group definition.

what do you think?

> RCFile issues
> -------------
>                 Key: HIVE-2065
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>         Attachments: HIVE.2065.patch.0.txt, Slide1.png, proposal.png
> Some potential issues with RCFile
> 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per yongqiang
he, the class is not meant to be thread-safe (and it is not). Might as well get rid of the
confusing and performance-impacting lock acquisitions.
> 2. Record Length overstated for compressed files. IIUC, the key compression happens after
we have written the record length.
> {code}
>       int keyLength = key.getSize();
>       if (keyLength < 0) {
>         throw new IOException("negative length keys not allowed: " + key);
>       }
>       out.writeInt(keyLength + valueLength); // total record length
>       out.writeInt(keyLength); // key portion length
>       if (!isCompressed()) {
>         out.writeInt(keyLength);
>         key.write(out); // key
>       } else {
>         keyCompressionBuffer.reset();
>         keyDeflateFilter.resetState();
>         key.write(keyDeflateOut);
>         keyDeflateOut.flush();
>         keyDeflateFilter.finish();
>         int compressedKeyLen = keyCompressionBuffer.getLength();
>         out.writeInt(compressedKeyLen);
>         out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen);
>       }
> {code}
> 3. For sequence file compatibility, the compressed key length should be the next field
to record length, not the uncompressed key length.

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message