hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yonghu <yongyong...@gmail.com>
Subject Re: multiple data versions vs. multiple rows?
Date Tue, 20 Jan 2015 10:34:43 GMT
I think we need to take a look different situations.

1. One column gets frequently updated and the others not. If we use row
representation, we will include the unchanged data value for each tuple.
This may cause a large data redundancy. So, I think it can explain why in
my test the multiple data version approach is better than multiple row
approach.

2. All columns get even updates. Hence, there will be not much data volume
difference between these two, as each data version is actually stored as a
key-value pair. In this situation, the performance between these two
approaches will not be significant.

Yong

On Tue, Jan 20, 2015 at 8:16 AM, Serega Sheypak <serega.sheypak@gmail.com>
wrote:

> does performance should differ significantly if row value size is small and
> we don't have too much versions.
> Assume, that a pack of versions for key is less than recommended HFile
> block (8KB to 1MB
>
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html
> ),
> which is minimal "read unit", should we see any difference at all?
> Am I right?
>
>
> 2015-01-20 0:33 GMT+03:00 Jean-Marc Spaggiari <jean-marc@spaggiari.org>:
>
> > Hi Yong,
> >
> > If you want to compare the performances, you need to run way bigger and
> > longer tests. Dont run them in parallete. Run them at least 10 time each
> to
> > make sure you have a good trend. Is the difference between the 2
> > significant? It should not.
> >
> > JM
> >
> > 2015-01-19 15:17 GMT-05:00 yonghu <yongyong313@gmail.com>:
> >
> > > Hi,
> > >
> > > Thanks for your suggestion. I have already considered the first issue
> > that
> > > one row  is not allowed to be split between 2 regions.
> > >
> > > However, I have made a small scan-test with MapReduce. I first created
> a
> > > table t1 with 1 million rows and allowed each column to store 10 data
> > > versions. Then, I translated t1 into t2 in which multiple data versions
> > in
> > > t1 were transformed into multiple rows in t2. I wrote two MapReduce
> > > programs to scan t1 and t2 individually. What I got is the table
> scanning
> > > time of t1 is shorter than t2. So, I think for performance reason,
> > multiple
> > > data versions may be a better option than multiple rows.
> > >
> > > But just as you said, which approach to use depends on how many
> > historical
> > > events you want to keep.
> > >
> > > regards!
> > >
> > > Yong
> > >
> > >
> > > On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Hi Yong,
> > > >
> > > > A row will not split between 2 regions. If you plan having thousands
> of
> > > > versions, based on the size of your data, you might end up having a
> row
> > > > bigger than your preferred region size.
> > > >
> > > > If you plan just keep few versions of the history to have a look at
> > it, I
> > > > will say go with it. If you plan to have one million version because
> > you
> > > > want to keep all the events history, go with the row approach.
> > > >
> > > > You can also consider going with the Column Qualifier approach. This
> > has
> > > > the same constraint as the versions regarding the split in 2 regions,
> > but
> > > > it might me easier to manage and still give you the consistency of
> > being
> > > > within a row.
> > > >
> > > > JM
> > > >
> > > > 2015-01-19 14:28 GMT-05:00 yonghu <yongyong313@gmail.com>:
> > > >
> > > > > Dear all,
> > > > >
> > > > > I want to record the user history data. I know there exists two
> > > options,
> > > > > one is to store user events in a single row with multiple data
> > versions
> > > > and
> > > > > the other one is to use multiple rows. I wonder which one is better
> > for
> > > > > performance?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Yong
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message