hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanislav Barton <stanislav.bar...@internetmemory.net>
Subject Re: bulk loading and RegionObservers
Date Thu, 12 Jan 2012 11:03:49 GMT
Andrew Purtell <apurtell@...> writes:

> Yes this is correct.
> Coprocessors / RegionObservers and bulk loading have been developing
separately in parallel. 
> Now that bulk loading changes are settling down, I've been considering adding
CP hooks into the bulk load
> process, at the HRegion level, without complicating atomicity. A simple and
straightforward course of
> action is to give the CP the option of rewriting the submitted store file(s)
before the regionserver
> attempts to validate and move them into the store. This is similar to how CPs
are hooked into compaction.
> Would this be sufficient for what you want to do?
> Best regards,
>        - Andy
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via
Tom White)
> >________________________________
> > From: Stanislav Barton <stanislav.barton <at> internetmemory.net>
> >To: user@... 
> >Sent: Wednesday, January 11, 2012 6:47 AM
> >Subject: bulk loading and RegionObservers
> > 
> >Hello,
> >
> >I tried to find the information in the documentation but it is still
> >not clear to me. I do a lot of bulk loading using the MapReduce job
> >whose output is HFiles that are automatically loaded to HBase and I
> >was wondering whether this way (my guess is that it is so) I do bypass
> >the RegionObserver mechanisms. Meaning that such defined coprocessors
> >won't get fired up when the new data is loaded in HBase. Is my
> >assumption correct?
> >
> >Stan
> >
> >
> >

I think that the people demanding such method of access would like to have the
ability to trigger the action on a row level (so again when a Put with new
values come). But I think that this would not scale - it would take a long time
to scan the new region and fire prePut() call on RO for the new region? I have
experience in doing 30GB bulk load steps to pre-splitted table in order to
maintain highest throughput and diminish overhead as possible (on fairly small
cluster (~10) of small machines). 



View raw message