hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Insert into tall table 50% faster than wide table
Date Thu, 23 Dec 2010 19:54:06 GMT
Hi all,

What does the region count look like between your tall and wide
tables?  If you dont get a good spread of regions across your cluster
you don't get full parallelism on all your hardware.

The row lock thing is another thing to watch out for, concurrent puts
will serialize along the row lock.

-ryan

On Thu, Dec 23, 2010 at 5:20 AM, Michael Segel
<michael_segel@hotmail.com> wrote:
>
> Uhm... just a couple of thoughts...
>
> For clarification ... lets call Bryan's "order's columns" the detail of the order. Columns
of columns is a bit confusing...
>
> Its becoming more apparent that schema design will play a large consideration in terms
of performance, and because its going to be dependent on HBase's internals, its very possible
that it can be tied to versions.
> This means that as HBase evolves, those seeking optimum performance may have to periodically
review their schema decisions.
>
> The first thing I'd recommend on the 'wide table' schema is to not store the individual
order's columns as separate columns, but as part of the order itself. The main reason for
this is that you will never fetch an order's detail by itself. A quick and cheap way of serializing
the order detail is to use something Dick Pick did around 40 years ago. In the Pick databases
(ie Revelation), a non-printable ASCII character was used as a column delimiter. You could
use the '|' (pipe) character, but someone could point out that its possible that it could
occur in the data. A non-printable ascii character (char 254??) would less likely be part
of the data. This works well because when you want to get the order, you can fetch it from
HBase, then parse the order based on a string token. (Very fast and efficient)
>
> This will make life easier in the long run...
>
> It will also have a positive impact on your code.
> On each Mapper.map() iteration, or rather code iteration [see assumption below], you
have your row_id, and then one put for the column write (that contains the 10 detail items.)
Note: What has a higher cost? Using a string buffer and concatenation of your 10 detail items
then taking its bytecode, or doing 10 put()s?
>
> Note the following: The discussion above is for uber performance gains. There will be
code improvements, however they will be relatively modest when compared to other potential
gains.
>
> Assumption(s):
> Bryan is attempting to create a simulation with 10K customers, 600 orders each. (10 items
per order). This is a performance test.
> This probably isn't a m/r program but a single client doing an insert. Note that its
a relative performance issue and it would be easier to do as a single program and not a distributed
one. This could be a m/r if Bryan pre-builds the list of customer orders before starting the
job... Or it could be a multi-threaded client where each thread reads from the pre-built list
and performs an insert.
>
> If the assumption is true, then Bryan is going to randomly pick a customer id, create
an order and insert the order in to HBase. (randomly pick a number between 1 and N where N
represents the number of customers who haven't placed 600 orders, and then count the number
of orders and remove each customer with 600 orders from the list)
>
> So this really wouldn't be a bulk load app, but a simulation of multiple clients hitting
HBase and its relative performance.
>
> If this is the case, I don't know if you want to use the HFileOutput format...
>
> With respect to the 'wide' row, I'd hash the key. (You wouldn't want to do this in the
'tall' table because you want each customer's orders to be near each other.)
>
> HTH
>
> -Mike
>
>
>> Date: Thu, 23 Dec 2010 10:55:43 +0000
>> Subject: Re: Insert into tall table 50% faster than wide table
>> From: lars.george@gmail.com
>> To: user@hbase.apache.org
>>
>> Writing data only hits the WAL and MemStore, so that should equal in
>> the same performance for both models. One thing that Mike mentioned is
>> how you distribute the load. How many servers are you using? How are
>> inserting your data (sequential or random)? Why do you use a Put since
>> this sounds like a bulk insert and hence should be much better done
>> with a HFileOutputFormat based MapReduce job?
>>
>> You do have some row locking happening as mentioned earlier, which may
>> block concurrent updates to the same row. Are you sending updates for
>> one row in a single Put instance? Or are you creating many Put's for
>> each order but the same row?
>>
>> Lars
>>
>> On Thu, Dec 23, 2010 at 9:57 AM, Andrey Stepachev <octo47@gmail.com> wrote:
>> > 2010/12/23 Ted Dunning <tdunning@maprtech.com>
>> >
>> >> But the tall table is FASTER than the wide table.
>> >>
>> >
>> > Opps. :).
>> >
>> > Maybe you put more data? Do you using compression? (in case of prefixed
>> > qualifiers you
>> > get more data, that uuid can has comparable length as an order row)
>> >
>> >
>> >>
>> >> On Wed, Dec 22, 2010 at 11:14 PM, Andrey Stepachev <octo47@gmail.com>
>> >> wrote:
>> >>
>> >> > I think row locks slows down here. Each row you inserted tries to aquire
>> >> > lock, and then release it. Wide table has significally less rows, and
>> >> much
>> >> > less locks acquired during insert.
>> >> >
>> >> >
>> >> > 2010/12/23 Bryan Keller <bryanck@gmail.com>
>> >> >
>> >> > > I have been testing a couple of different approaches to storing
>> >> customer
>> >> > > orders. One is a tall table, where each order is a row. The other
is a
>> >> > wide
>> >> > > table where each customer is a row, and orders are columns in
the row.
>> >> I
>> >> > am
>> >> > > finding that inserts into the tall table, i.e. adding rows for
every
>> >> > order,
>> >> > > is roughly 50% faster than inserts into the wide table, i.e. adding
a
>> >> row
>> >> > > for a customer and then adding columns for orders.
>> >> > >
>> >> > > In my test, there are 10,000 customers, each customer has 600
orders
>> >> and
>> >> > > each order has 10 columns. The tall table approach results in
6 mil
>> >> rows
>> >> > of
>> >> > > 10 columns. The wide table approach results is 10,000 rows of
6,000
>> >> > columns.
>> >> > > I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the
orders
>> >> > > using a Put for each order, submitted in batches of 1000 as a
list of
>> >> > Puts.
>> >> > >
>> >> > > Are there techniques to speed up inserts with the wide table approach
>> >> > that
>> >> > > I am perhaps overlooking?
>> >> > >
>> >> > >
>> >> >
>> >>
>> >
>

Mime
View raw message