hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Rowkey design question
Date Thu, 09 Apr 2015 02:43:30 GMT
When you say coprocessor, do you mean HBase coprocessors or do you mean a physical hardware
coprocessor? 

In terms of queries… 

HBase can perform a single get() and return the result back quickly. (The size of the data
being returned will impact the overall timing.) 

HBase also caches the results so that your first hit will take the longest, but as long as
the row is cached, the results are returned quickly. 

If you’re trying to do a scan with a start/stop row set … your timing then could vary
between sub-second and minutes depending on the query. 


> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <stoffe@gmail.com> wrote:
> 
> But if the coprocessor is omitted then CPU cycles from region servers are
> lost, so where would the query execution go?
> 
> Queries needs to be quick (sub-second rather than seconds) and HDFS is
> quite latency hungry, unless there are optimizations that i'm unaware of?
> 
> 
> 
> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <michael_segel@hotmail.com>
> wrote:
> 
>> I think you misunderstood.
>> 
>> The suggestion was to put the data in to HDFS sequence files and to use
>> HBase to store an index in to the file. (URL to the file, then offset in to
>> the file for the start of the record…)
>> 
>> The reason you want to do this is that you’re reading in large amounts of
>> data and its more efficient to do this from HDFS than through HBase.
>> 
>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <stoffe@gmail.com> wrote:
>>> 
>>> Yes, I think you're right. Adding one or more dimensions to the rowkey
>>> would indeed make the table narrower.
>>> 
>>> And I guess it also make sense to store actual values (bigger qualifiers)
>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
>> SSD
>>> caches would be an interesting solution. And quite a bit simpler.
>>> 
>>> Good call and thanks for the tip! :-)
>>> 
>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <michael_segel@hotmail.com
>>> 
>>> wrote:
>>> 
>>>> Ok…
>>>> 
>>>> First, I’d suggest you rethink your schema by adding an additional
>>>> dimension.
>>>> You’ll end up with more rows, but a narrower table.
>>>> 
>>>> In terms of compaction… if the data is relatively static, you won’t have
>>>> compactions because nothing changed.
>>>> But if your data is that static… why not put the data in sequence files
>>>> and use HBase as the index. Could be faster.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <stoffe@gmail.com>
>> wrote:
>>>>> 
>>>>> I just read through HBase MOB design document and one thing that caught
>>>> my
>>>>> attention was the following statement.
>>>>> 
>>>>> "When HBase deals with large numbers of values > 100kb and up to ~10MB
>> of
>>>>> data, it encounters performance degradations due to write amplification
>>>>> caused by splits and compactions."
>>>>> 
>>>>> Is there any chance to run into this problem in the read path for data
>>>> that
>>>>> is written infrequently and never changed?
>>>>> 
>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <stoffe@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> A small set of qualifiers will be accessed frequently so keeping
them
>> in
>>>>>> block cache would be very beneficial. Some very seldom. So this sounds
>>>> very
>>>>>> promising!
>>>>>> 
>>>>>> The reason why i'm considering a coprocessor is that I need to provide
>>>>>> very specific information in the query request. Same thing with the
>>>>>> response. Queries are also highly parallelizable across rows and
each
>>>>>> individual query produce a valid result that may or may not be
>>>> aggregated
>>>>>> with other results in the client, maybe even inside the region if
it
>>>>>> contained multiple rows targeted by the query.
>>>>>> 
>>>>>> So it's a bit like Phoenix but with a different storage format and
>> query
>>>>>> engine.
>>>>>> 
>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <ndimiduk@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> Those rows are written out into HBase blocks on cell boundaries.
Your
>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may
have
>> no
>>>>>>> overridden the default of 64k. Cells are written into a block
until
>> is
>>>> it
>>>>>>>> = the target block size. So your single 500mb row will be
broken
>> down
>>>>>>> into
>>>>>>> thousands of HFile blocks in some number of HFiles. Some of those
>>>> blocks
>>>>>>> may contain just a cell or two and be a couple MB in size, to
hold
>> the
>>>>>>> largest of your cells. Those blocks will be loaded into the Block
>>>> Cache as
>>>>>>> they're accessed. If your careful with your access patterns and
only
>>>>>>> request cells that you need to evaluate, you'll only ever load
the
>>>> blocks
>>>>>>> containing those cells into the cache.
>>>>>>> 
>>>>>>>> Will the entire row be loaded or only the qualifiers I ask
for?
>>>>>>> 
>>>>>>> So then, the answer to your question is: it depends on how you're
>>>>>>> interacting with the row from your coprocessor. The read path
will
>> only
>>>>>>> load blocks that your scanner requests. If your coprocessor is
>>>> producing
>>>>>>> scanner with to seek to specific qualifiers, you'll only load
those
>>>>>>> blocks.
>>>>>>> 
>>>>>>> Related question: Is there a reason you're using a coprocessor
>> instead
>>>> of
>>>>>>> a
>>>>>>> regular filter, or a simple qualified get/scan to access data
from
>>>> these
>>>>>>> rows? The "default stuff" is already tuned to load data sparsely,
as
>>>> would
>>>>>>> be desirable for your schema.
>>>>>>> 
>>>>>>> -n
>>>>>>> 
>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <stoffe@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry I should have explained my use case a bit more.
>>>>>>>> 
>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case.
Normally
>>>>>>> there
>>>>>>>> would be fewer qualifiers and the largest qualifiers would
be
>> smaller.
>>>>>>>> 
>>>>>>>> The reason why these rows gets big is because they stores
aggregated
>>>>>>> data
>>>>>>>> in indexed compressed form. This format allow for extremely
fast
>>>> queries
>>>>>>>> (on local disk format) over billions of rows (not rows in
HBase
>>>> speak),
>>>>>>>> when touching smaller areas of the data. If would store the
data as
>>>>>>> regular
>>>>>>>> HBase rows things would get very slow unless I had many many
region
>>>>>>>> servers.
>>>>>>>> 
>>>>>>>> The coprocessor is used for doing custom queries on the indexed
data
>>>>>>> inside
>>>>>>>> the region servers. These queries are not like a regular
row scan,
>> but
>>>>>>> very
>>>>>>>> specific as to how the data is formatted withing each column
>>>> qualifier.
>>>>>>>> 
>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB
each time i
>>>>>>> want
>>>>>>>> to perform this custom query on a row. Hence my question
:-)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>>>>>>> michael_segel@hotmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry, but your initial problem statement doesn’t seem
to parse …
>>>>>>>>> 
>>>>>>>>> Are you saying that you a single row with approximately
100,000
>>>>>>> elements
>>>>>>>>> where each element is roughly 1-5KB in size and in addition
there
>> are
>>>>>>> ~5
>>>>>>>>> elements which will be between one and five MB in size?
>>>>>>>>> 
>>>>>>>>> And you then mention a coprocessor?
>>>>>>>>> 
>>>>>>>>> Just looking at the numbers… 100K * 5KB means that
each row would
>> end
>>>>>>> up
>>>>>>>>> being 500MB in size.
>>>>>>>>> 
>>>>>>>>> That’s a pretty fat row.
>>>>>>>>> 
>>>>>>>>> I would suggest rethinking your strategy.
>>>>>>>>> 
>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren
<stoffe@gmail.com
>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi
>>>>>>>>>> 
>>>>>>>>>> I have a row with around 100.000 qualifiers with
mostly small
>> values
>>>>>>>>> around
>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor
do
>>>>>>> random
>>>>>>>>>> access of 1-10 qualifiers per row.
>>>>>>>>>> 
>>>>>>>>>> I would like to understand how HBase loads the data
into memory.
>>>>>>> Will
>>>>>>>> the
>>>>>>>>>> entire row be loaded or only the qualifiers I ask
for (like
>> pointer
>>>>>>>>> access
>>>>>>>>>> into a direct ByteBuffer) ?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> -Kristoffer
>>>>>>>>> 
>>>>>>>>> The opinions expressed here are mine, while they may
reflect a
>>>>>>> cognitive
>>>>>>>>> thought, that is purely accidental.
>>>>>>>>> Use at your own risk.
>>>>>>>>> Michael Segel
>>>>>>>>> michael_segel (AT) hotmail.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a cognitive
>>>> thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is
purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Mime
View raw message