hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: DISCUSS : HFile V3 proposal for tags in 0.96
Date Wed, 24 Jul 2013 17:30:17 GMT
I was reading Owen's presentation at Hadoop Summit on ORC.

Slide #14 describes how codecs are used for generic compression.

I think we can adopt some of their ideas in HFile v3.


On Fri, Jul 19, 2013 at 9:48 AM, Andrew Purtell <apurtell@apache.org> wrote:

> On Fri, Jul 19, 2013 at 4:23 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
> > If tags are activated but empty, is it going to be the
> > same thing? Or are we going to have all the tags overhead? Like can we
> have
> > a byte to say "no tags in that file" in addition to "tags are activated
> for
> > that file"?
> >
> This reminds me of an interesting discussion we had. So like with
> memstoreTS, if we determine that no cells in a file have tags (or
> timestamps) then we can flag that in file metadata and turn off any related
> persistence when writing out the data blocks. With millions of KVs in a
> file that can achieve substantial space savings. Having a new file format
> on the table also opens up possibilities like block headers: an N-byte
> structure (where N is something like 4 or 8 bytes maybe) at the start of
> each block that describes the encoding strategy taken for the block:
> whether tags are present or not, if we used FAST_DIFF, or some new packing
> together of related values (we put the keys up front with one or two byte
> pointers into the block where their values are, de-dup values in the latter
> part of the block), or a dictionary scheme (and with which dictionary in
> what meta block) etc. We might borrow ideas from Parquet or ORC. We can
> stop serializing HFile blocks as individual cells into streams and look at
> them as a group of cells to write into a bytebuffer, providing a lot more
> freedom for efficiently structuring the internal details of the block. Let
> me make sure this point makes it out into the public discussion, to
> highlight the additional benefit of having an experimental file format
> available in the 0.96 cycle - it's a place where we and users can go off on
> new directions far beyond inline tags. Of course such changes in unreleased
> trunk code could make that possible too, but what I have observed is
> "professional" HBase devs are much more likely to look at trunk than a
> user. Users really want to work on and contribute a patch for what they are
> running in production. Consider recent contributions from Yahoo and Taobao
> as an example of what I mean. The bar for putting something into V2 is
> extremely high as it should be on account of how performance critical that
> code is. I'm not suggesting less rigor for V3, what I am suggesting is V3
> can provide design freedom by going in different directions than the legacy
> V2 code.
> --
> Best regards,
>    - Andy
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message