hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcos Luis Ortiz Valmaseda <marcosluis2...@gmail.com>
Subject Re: should i use compression?
Date Wed, 03 Apr 2013 15:41:16 GMT
Here´s the API documentation:

*FAST_DIFF*:
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.html

"Encoder similar to
DiffKeyDeltaEncoder<http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.html>
but
supposedly faster.
Compress using:
 - store size of common prefix
- save column family once in the first KeyValue
- use integer compression for key, value and prefix (7-bit encoding)
- use bits to avoid duplication key length, value length and type if it
same as previous
- store in 3 bits length of prefix timestamp with previous KeyValue's
timestamp
- one bit which allow to omit value if it is the same Format:
- 1 byte: flag
- 1-5 bytes: key length (only if FLAG_SAME_KEY_LENGTH is not set in flag)
- 1-5 bytes: value length (only if FLAG_SAME_VALUE_LENGTH is not set in
flag)
- 1-5 bytes: prefix length
- ... bytes: rest of the row (if prefix length is small enough)
- ... bytes: qualifier (or suffix depending on prefix length)
- 1-8 bytes: timestamp suffix - 1 byte: type (only if FLAG_SAME_TYPE is not
set in the flag)
- ... bytes: value (only if FLAG_SAME_VALUE is not set in the flag)"

*DIFF*:
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.html

"Compress using:
- store size of common prefix
- save column family once, it is same within HFile
- use integer compression for key, value and prefix (7-bit encoding)
- use bits to avoid duplication key length, value length and type if it
same as previous
- store in 3 bits length of timestamp field
- allow diff in timestamp instead of actual value Format:
- 1 byte: flag
- 1-5 bytes: key length (only if FLAG_SAME_KEY_LENGTH is not set in flag)
- 1-5 bytes: value length (only if FLAG_SAME_VALUE_LENGTH is not set in
flag)
- 1-5 bytes: prefix length
- ... bytes: rest of the row (if prefix length is small enough)
- ... bytes: qualifier (or suffix depending on prefix length)
- 1-8 bytes: timestamp or diff - 1 byte: type (only if FLAG_SAME_TYPE is
not set in the flag) - ... bytes: value"

I was reading the FAQ´s and there is not anything related to this topic. It
would be nice to include it in the documentation.

Lars, What do you think? It would be nice if you could write a detailed
blog post about this topic.





2013/4/3 Jean-Marc Spaggiari <jean-marc@spaggiari.org>

> I read the JIRA already but it was not clear to me. However Cloudera's
> link is very clear. Thanks for that. Any idea what's the difference
> between DIFF and FAST_DIFF?
>
> 2013/4/3 Marcos Luis Ortiz Valmaseda <marcosluis2186@gmail.com>:
> > You can read this JIra issue for this too:
> > https://issues.apache.org/jira/browse/HBASE-4218
> >
> >
> >
> > 2013/4/3 Marcos Luis Ortiz Valmaseda <marcosluis2186@gmail.com>
> >>
> >> Regards, Jean-Marc.
> >> The best resource that I found for this is a great blog post called
> Apache
> >> HBase I/O - HFile  from Matteo Bertozzi in Cloudera´s blog. Here´s the
> link:
> >> http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
> >>
> >>
> >>
> >>
> >> 2013/4/3 Jean-Marc Spaggiari <jean-marc@spaggiari.org>
> >>>
> >>> Is there any documentation anywhere regarding the differences between
> >>> PREFIX, DIFF and FAST_DIFF?
> >>>
> >>> 2013/4/3 prakash kadel <prakash.kadel@gmail.com>:
> >>> > thank you very much.
> >>> > i will try with snappy compression with data_block_encoding
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Apr 3, 2013 at 11:21 PM, Kevin O'dell
> >>> > <kevin.odell@cloudera.com>wrote:
> >>> >
> >>> >> Prakash,
> >>> >>
> >>> >>   Yes, I would recommend Snappy Compression.
> >>> >>
> >>> >> On Wed, Apr 3, 2013 at 10:18 AM, Prakash Kadel
> >>> >> <prakash.kadel@gmail.com>
> >>> >> wrote:
> >>> >> > Thanks,
> >>> >> >     is there any specific compression that is recommended
of the
> use
> >>> >> case i have?
> >>> >> >    Since my values are all null will compression help?
> >>> >> >
> >>> >> >  I am thinking of using prefix data_block_encoding..
> >>> >> > Sincerely,
> >>> >> > Prakash Kadel
> >>> >> >
> >>> >> >
> >>> >> > On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
> >>> >> >
> >>> >> >> You should use data block encoding (in 0.94.x releases
only). It
> is
> >>> >> helpful
> >>> >> >> for reads.
> >>> >> >>
> >>> >> >> You can also enable compression.
> >>> >> >>
> >>> >> >> Cheers
> >>> >> >>
> >>> >> >>
> >>> >> >> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel
> >>> >> >> <prakash.kadel@gmail.com
> >>> >> >wrote:
> >>> >> >>
> >>> >> >>> Hello,
> >>> >> >>>    I have a question.
> >>> >> >>>    I have a table where i store data in the column
> qualifiers(the
> >>> >> values
> >>> >> >>> itself are null).
> >>> >> >>>    I just have 1 column family.
> >>> >> >>>   The number of columns per row is variable (1~ few
thousands)
> >>> >> >>>
> >>> >> >>> Currently i don't use compression or the data_block_encoding.
> >>> >> >>>
> >>> >> >>> Should i?
> >>> >> >>> I want to have faster reads.
> >>> >> >>>
> >>> >> >>> Please suggest.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Sincerely,
> >>> >> >>> Prakash Kadel
> >>> >> >
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Kevin O'Dell
> >>> >> Systems Engineer, Cloudera
> >>> >>
> >>
> >>
> >>
> >>
> >> --
> >> Marcos Ortiz Valmaseda,
> >> Data-Driven Product Manager at PDVSA
> >> Blog: http://dataddict.wordpress.com/
> >> LinkedIn: http://www.linkedin.com/in/marcosluis2186
> >> Twitter: @marcosluis2186
> >
> >
> >
> >
> > --
> > Marcos Ortiz Valmaseda,
> > Data-Driven Product Manager at PDVSA
> > Blog: http://dataddict.wordpress.com/
> > LinkedIn: http://www.linkedin.com/in/marcosluis2186
> > Twitter: @marcosluis2186
>



-- 
Marcos Ortiz Valmaseda,
*Data-Driven Product Manager* at PDVSA
*Blog*: http://dataddict.wordpress.com/
*LinkedIn: *http://www.linkedin.com/in/marcosluis2186
*Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message