hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Coprocessors vs MapReduce?
Date Wed, 25 Jul 2012 08:09:55 GMT

As Andrew pointed out, Cascading is indeed for MapReduce. I know the use
case was discussed, I wanted to know what was the state now. (The blog
entry is from 2010.) The use case is simple. I am doing log analysis and
would like to perform fast aggregations. These aggregations are common
(count/sum/average) but what is exactly aggregated will depend on the type
of logs. With apache, http code errors may be interesting. With other logs,
request durations could be. I (well, any users) would like to be able to
reuse common codes and write the coprocessors with a concise (domain
specific) language. So I was checking not to miss any new
projects/development on that idea. From what I understand of your answers,
implementing a kind of Cascading for coprocessors may be possible but not
done and may not really be pertinent/safe/efficient with the current
architecture of coprocessors.


I forgot that the shell still require the table to be offline. Thanks for
pointing that out. So, coprocessors are not meant to be loaded that often.

I am not sure to understand your answers. I have read about big table/hbase
architecture but I may also have  not expressed correctly my problem. The
way I see it, coprocessors would allow me to aggregate information from
recent logs. The problem I have with vanilla MapReduce is that if the logs
do not fill a full hfs block then MapReduce is a bit overkill. I though
that for those cases, coprocessors would be more appropriate. Is that a
right way to see it? If so is there any rule of thump for knowing when to
select MapReduce versus Coprocessors? On the other side of the scale, I
also assume that if I had 1 TeraByte of data, MapReduce would be faster
because it allows more parallelism. Well... I hope my concern is clearer


I was talking specifically of coprocessorExec.

Since the return value is a Map, I should assume that all the results are
gathered before returning it. So that would be a wait for all servers to
complete there work.
But theoretically, it should be possible to return early results so that
the one calling the method could perform early aggregation of the results
while waiting for the remaining results to come. (Or I may be
misunderstanding something.)

Thanks for the previous feedback. That's already clearer for me.



On Tue, Jul 24, 2012 at 9:05 PM, Andrew Purtell <apurtell@apache.org> wrote:

> On Tue, Jul 24, 2012 at 7:59 AM, Bertrand Dechoux <dechouxb@gmail.com>
> wrote:
> > First, I thought coprocessors needed a restart but it seems a shell can
> be
> > used to add/remove them without requiring a restart. However, at the
> moment
> > the coprocessors are defined within jar and can not be dynamically
> created.
> > Could you confirm that?
> You can dynamically load new coprocessors by deploying a jarfile to
> HDFS, using the shell to disable the table, add the coprocessor, and
> then enable the table.
> To remove a coprocessor from a table, you can use the shell to disable
> the table, remove the coprocessor, and then enable the table again.
> However, whatever was loaded by the JVM will remain resident until the
> regionserver process is restarted.
> >  (I am thinking about the Cascading way of creating
> > the implementation which will then be serialized, send and executed.)
> ... as a MapReduce job.
> MR jobs in Hadoop are really each individual submissions of
> application code to run on the cluster each and every time.
> In contrast, HBase coprocessors can be thought of like Linux loadable
> kernel modules. You add them to your infrastructure. HBase becomes
> more like an application deployment platform where the details of data
> colocation with the application code at scale is handled for you
> automatically, as is client side dispatch to the appropriate
> locations.
> An early design of coprocessors considered code shipping at request
> time, but that doesn't fit the extension model above well.
> But also consider that HBase is a short-request system. The latency of
> processing each individual RPC is important and expected to be a short
> as possible. If for a table where you want to extend server side
> function, imagine the overhead if that extension is shipped in every
> request. Each RPC would be what? 10x? 100x? larger? And there would be
> the client side latency of figuring the transitive closure of classes
> to send up, and then server side latency of installing the bytecode
> for execution and then removing it for GC.
> > Second, I didn't see any way to give parameters to coprocessors. Is that
> > really the case? If not, how would the parameters be handled?
> A coprocessor can be an Observer, linked in to server side function.
> Parameters are handed to your installed extension via upcall from
> HBase code.
> Or, a coprocessor can be an Endpoint. This is a dynamic RPC endpoint.
> You can send up any parameter to an endpoint via Exec as long as HBase
> RPC can serialize it.
> For more information see:
> https://blogs.apache.org/hbase/entry/coprocessor_introduction
> > Third, I assume coprocessors are using the processus/thread of the region
> > server. Does that means that, if multiple blocks need to be processed,
> > MaReduce should be more efficient? Are there other ways to know whether
> > coprocessors or MapReduce should be chosen?
> Coprocessors operate on requests (RPCs), not blocks.
> If you address a coprocessor request to the whole table, whatever
> happens will happen on all regionservers in parallel. This is as far
> as the similarity to MapReduce goes.
> Conceivably you could implement a map() and reduce() interface on top
> of HBase using Coprocessors, but CPs themselves are a lower level
> extension framework.
> > Fourth, I know this is a really broad question but how would you compare
> > coprocessors to YARN? I have yet to know more about both subjects but I
> > feel that the concepts are not totally unrelated.
> Coprocessors are a low level extension framework, YARN is a general
> purpose high level cluster resource manager. Not in the same
> engineering ballpark.
> > Lastly, this is an implementation detail but how the client side waits
> for
> > the results? Is it possible to perform early aggregation or does the
> client
> > need to receive all the information before doing anything else?
> >
> > Regards
> >
> > Bertrand
> >
> >
> > Ps : My two sources for that subject are for HBase 0.92 :
> > * https://blogs.apache.org/hbase/entry/coprocessor_introduction
> > * HBase The Definitive Guide.
> --
> Best regards,
>    - Andy
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)

Bertrand Dechoux

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message