hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Wide table vs narrower table with blob
Date Thu, 10 Sep 2015 18:00:43 GMT
w.r.t. FuzzyRowFilter, there is a bug fix (HBASE-14269) which is not in any
release yet.

Look for future release (1.2.0, 1.1.3, 0.98.15) which would contain the fix.

FYI

On Thu, Sep 10, 2015 at 10:36 AM, Vladimir Rodionov <vladrodionov@gmail.com>
wrote:

> It depends on your read pattern. If you mostly read small subset of columns
> (you have a lot of them) both approaches are bad. You will need to scan all
> your columns and deserialize blobs to extract only few of them (that is 5MB
> at least). Consider adding more data (columns) to rowkey and using
> FuzzyRowFilter, should be faster.
>
> From write perf point of view, blobs are better, of course.
>
> -Vlad
>
> On Thu, Sep 10, 2015 at 9:33 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > You may have seen this:
> http://hbase.apache.org/book.html#schema.smackdown
> >
> > bq. are part of one column family
> >
> > Are the columns equally likely to be read ?
> > I ask this because you may be able to utilize essential column family
> > feature by separating columns which tend to be more frequently accessed
> > into their own column family.
> >
> > 0.94 is quite old.
> > Any chance of rerunning your benchmark on hbase 1.x ?
> >
> > Thanks
> >
> > On Thu, Sep 10, 2015 at 9:00 AM, Melvin Kanasseril <
> > Melvin.Kanasseril@sophos.com> wrote:
> >
> > > Hi,
> > >
> > > This probably has come up before but I wanted to know if there is a
> > > recommendation around having tables with all attribute data as separate
> > > columns v/s an approach with most of the attribute data stored as a
> blob
> > in
> > > a single column and the rest as separate columns(for column filter
> > > searches). I am aware of the limitations with lumping the data into a
> > blob
> > > but was curious to see if there is an improvement on
> throughput/latency.
> > >
> > > I am leaning towards there not being much of a difference or this
> being a
> > > micro-optimization not worth the tradeoff but when we ran a set of
> > > benchmarks to test this(on ver 0.94), the hybrid approach with the blob
> > > data seem to show a 10-12% improvement in write throughput for the same
> > > number of client threads with evenly distributed puts over a pre-spit
> > table
> > > on a 12 node cluster. I used Avro for serialization and all the columns
> > > (there are about 40 without the blob column and 10 with it) are part of
> > one
> > > column family. The size of data for a row is around 5 MB before
> > > serialization. Any thoughts whether this is worth pursuing?
> > >
> > > Thanks,
> > > Melvin
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message