hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Keller <brya...@gmail.com>
Subject Re: Insert into tall table 50% faster than wide table
Date Thu, 23 Dec 2010 22:44:21 GMT
Correction, I ran the wrong test. Consolidating the Puts increased performance back to that
of the tall table. So it appears row locks were the issue. Thanks for the help everyone.

On Dec 23, 2010, at 2:28 PM, Bryan Keller wrote:

> I revised the test so that it creates a single Put for each customer. Previously I was
creating a separate Put for each order, even if the order was for the same customer. I submit
batches of Puts using HTable.put(List<Put>).
> Performance with both approaches was about the same. It doesn't appear as if row locks
are an issue in my case, perhaps because the Puts for a customer's orders are mostly in the
same List<Put>?
> As to cluster setup, I am testing tall vs wide on the exact same cluster. Keys are all
random UUIDs so I'm assuming I should get a good spread. Are there configuration options I
should be looking at that could help wide table performance for inserts?
> I was thinking about serializing the order data, but then I will run into issues of versioning
and such, and then I am back to a tightly structured schema. Thus I did like storing the order
fields in separate columns. Read performance seems to be very good, it is the writes that
are slower.
> On Dec 23, 2010, at 11:54 AM, Ryan Rawson wrote:
>> Hi all,
>> What does the region count look like between your tall and wide
>> tables?  If you dont get a good spread of regions across your cluster
>> you don't get full parallelism on all your hardware.
>> The row lock thing is another thing to watch out for, concurrent puts
>> will serialize along the row lock.
>> -ryan
>> On Thu, Dec 23, 2010 at 5:20 AM, Michael Segel
>> <michael_segel@hotmail.com> wrote:
>>> Uhm... just a couple of thoughts...
>>> For clarification ... lets call Bryan's "order's columns" the detail of the order.
Columns of columns is a bit confusing...
>>> Its becoming more apparent that schema design will play a large consideration
in terms of performance, and because its going to be dependent on HBase's internals, its very
possible that it can be tied to versions.
>>> This means that as HBase evolves, those seeking optimum performance may have
to periodically review their schema decisions.
>>> The first thing I'd recommend on the 'wide table' schema is to not store the
individual order's columns as separate columns, but as part of the order itself. The main
reason for this is that you will never fetch an order's detail by itself. A quick and cheap
way of serializing the order detail is to use something Dick Pick did around 40 years ago.
In the Pick databases (ie Revelation), a non-printable ASCII character was used as a column
delimiter. You could use the '|' (pipe) character, but someone could point out that its possible
that it could occur in the data. A non-printable ascii character (char 254??) would less likely
be part of the data. This works well because when you want to get the order, you can fetch
it from HBase, then parse the order based on a string token. (Very fast and efficient)
>>> This will make life easier in the long run...
>>> It will also have a positive impact on your code.
>>> On each Mapper.map() iteration, or rather code iteration [see assumption below],
you have your row_id, and then one put for the column write (that contains the 10 detail items.)
Note: What has a higher cost? Using a string buffer and concatenation of your 10 detail items
then taking its bytecode, or doing 10 put()s?
>>> Note the following: The discussion above is for uber performance gains. There
will be code improvements, however they will be relatively modest when compared to other potential
>>> Assumption(s):
>>> Bryan is attempting to create a simulation with 10K customers, 600 orders each.
(10 items per order). This is a performance test.
>>> This probably isn't a m/r program but a single client doing an insert. Note that
its a relative performance issue and it would be easier to do as a single program and not
a distributed one. This could be a m/r if Bryan pre-builds the list of customer orders before
starting the job... Or it could be a multi-threaded client where each thread reads from the
pre-built list and performs an insert.
>>> If the assumption is true, then Bryan is going to randomly pick a customer id,
create an order and insert the order in to HBase. (randomly pick a number between 1 and N
where N represents the number of customers who haven't placed 600 orders, and then count the
number of orders and remove each customer with 600 orders from the list)
>>> So this really wouldn't be a bulk load app, but a simulation of multiple clients
hitting HBase and its relative performance.
>>> If this is the case, I don't know if you want to use the HFileOutput format...
>>> With respect to the 'wide' row, I'd hash the key. (You wouldn't want to do this
in the 'tall' table because you want each customer's orders to be near each other.)
>>> HTH
>>> -Mike
>>>> Date: Thu, 23 Dec 2010 10:55:43 +0000
>>>> Subject: Re: Insert into tall table 50% faster than wide table
>>>> From: lars.george@gmail.com
>>>> To: user@hbase.apache.org
>>>> Writing data only hits the WAL and MemStore, so that should equal in
>>>> the same performance for both models. One thing that Mike mentioned is
>>>> how you distribute the load. How many servers are you using? How are
>>>> inserting your data (sequential or random)? Why do you use a Put since
>>>> this sounds like a bulk insert and hence should be much better done
>>>> with a HFileOutputFormat based MapReduce job?
>>>> You do have some row locking happening as mentioned earlier, which may
>>>> block concurrent updates to the same row. Are you sending updates for
>>>> one row in a single Put instance? Or are you creating many Put's for
>>>> each order but the same row?
>>>> Lars
>>>> On Thu, Dec 23, 2010 at 9:57 AM, Andrey Stepachev <octo47@gmail.com>
>>>>> 2010/12/23 Ted Dunning <tdunning@maprtech.com>
>>>>>> But the tall table is FASTER than the wide table.
>>>>> Opps. :).
>>>>> Maybe you put more data? Do you using compression? (in case of prefixed
>>>>> qualifiers you
>>>>> get more data, that uuid can has comparable length as an order row)
>>>>>> On Wed, Dec 22, 2010 at 11:14 PM, Andrey Stepachev <octo47@gmail.com>
>>>>>> wrote:
>>>>>>> I think row locks slows down here. Each row you inserted tries
to aquire
>>>>>>> lock, and then release it. Wide table has significally less rows,
>>>>>> much
>>>>>>> less locks acquired during insert.
>>>>>>> 2010/12/23 Bryan Keller <bryanck@gmail.com>
>>>>>>>> I have been testing a couple of different approaches to storing
>>>>>> customer
>>>>>>>> orders. One is a tall table, where each order is a row. The
other is a
>>>>>>> wide
>>>>>>>> table where each customer is a row, and orders are columns
in the row.
>>>>>> I
>>>>>>> am
>>>>>>>> finding that inserts into the tall table, i.e. adding rows
for every
>>>>>>> order,
>>>>>>>> is roughly 50% faster than inserts into the wide table, i.e.
adding a
>>>>>> row
>>>>>>>> for a customer and then adding columns for orders.
>>>>>>>> In my test, there are 10,000 customers, each customer has
600 orders
>>>>>> and
>>>>>>>> each order has 10 columns. The tall table approach results
in 6 mil
>>>>>> rows
>>>>>>> of
>>>>>>>> 10 columns. The wide table approach results is 10,000 rows
of 6,000
>>>>>>> columns.
>>>>>>>> I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding
the orders
>>>>>>>> using a Put for each order, submitted in batches of 1000
as a list of
>>>>>>> Puts.
>>>>>>>> Are there techniques to speed up inserts with the wide table
>>>>>>> that
>>>>>>>> I am perhaps overlooking?

View raw message