hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Serega Sheypak <serega.shey...@gmail.com>
Subject Re: multiple data versions vs. multiple rows?
Date Tue, 20 Jan 2015 07:16:20 GMT
does performance should differ significantly if row value size is small and
we don't have too much versions.
Assume, that a pack of versions for key is less than recommended HFile
block (8KB to 1MB
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html),
which is minimal "read unit", should we see any difference at all?
Am I right?


2015-01-20 0:33 GMT+03:00 Jean-Marc Spaggiari <jean-marc@spaggiari.org>:

> Hi Yong,
>
> If you want to compare the performances, you need to run way bigger and
> longer tests. Dont run them in parallete. Run them at least 10 time each to
> make sure you have a good trend. Is the difference between the 2
> significant? It should not.
>
> JM
>
> 2015-01-19 15:17 GMT-05:00 yonghu <yongyong313@gmail.com>:
>
> > Hi,
> >
> > Thanks for your suggestion. I have already considered the first issue
> that
> > one row  is not allowed to be split between 2 regions.
> >
> > However, I have made a small scan-test with MapReduce. I first created a
> > table t1 with 1 million rows and allowed each column to store 10 data
> > versions. Then, I translated t1 into t2 in which multiple data versions
> in
> > t1 were transformed into multiple rows in t2. I wrote two MapReduce
> > programs to scan t1 and t2 individually. What I got is the table scanning
> > time of t1 is shorter than t2. So, I think for performance reason,
> multiple
> > data versions may be a better option than multiple rows.
> >
> > But just as you said, which approach to use depends on how many
> historical
> > events you want to keep.
> >
> > regards!
> >
> > Yong
> >
> >
> > On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Yong,
> > >
> > > A row will not split between 2 regions. If you plan having thousands of
> > > versions, based on the size of your data, you might end up having a row
> > > bigger than your preferred region size.
> > >
> > > If you plan just keep few versions of the history to have a look at
> it, I
> > > will say go with it. If you plan to have one million version because
> you
> > > want to keep all the events history, go with the row approach.
> > >
> > > You can also consider going with the Column Qualifier approach. This
> has
> > > the same constraint as the versions regarding the split in 2 regions,
> but
> > > it might me easier to manage and still give you the consistency of
> being
> > > within a row.
> > >
> > > JM
> > >
> > > 2015-01-19 14:28 GMT-05:00 yonghu <yongyong313@gmail.com>:
> > >
> > > > Dear all,
> > > >
> > > > I want to record the user history data. I know there exists two
> > options,
> > > > one is to store user events in a single row with multiple data
> versions
> > > and
> > > > the other one is to use multiple rows. I wonder which one is better
> for
> > > > performance?
> > > >
> > > > Thanks!
> > > >
> > > > Yong
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message