hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Questions about data distribution in HBase
Date Tue, 30 Mar 2010 00:33:08 GMT
This use case is an ideal one for coprocessors. Alas, the coprocessor feature is not finished

More inline.

> From: William Kang
> Subject: Re: Questions about data distribution in HBase
> What I need  is a low latency system can perform some videos
> processes on the fly. For this reason, a M/R probably won't
> do the job. The reason I chose hadoop is because its
> parallelization.

Unless I somehow misunderstand, Hadoop parallelization == M/R.

That is, the only parallel scheduling for user tasks on the Hadoop platform is MapReduce.

> I am trying to use the multiple machines to make the video
> process at the same time. Each video
> clip should be around 50M to 100M. A whole video has been
> sliced into around 10 video clips already. These clips should
> be stored in HBase's table for fast retrieval. But to make a
> process on the fly for real application, I need these 10 video
> clips to be processed at the same time where they are
> stored.
> To satisfy this purpose, I need to implement "local
> awareness", that is to say, my program which process video
> clips should be run on the machine which store the video
> clips. So, my question can be rephrased into:
> 1. Dose HBase provide local awareness of where the data is
> stored?

You know the row key, so you can find via the master the region server currently hosting the
region which contains the key. Over time, after major compaction, regionservers bring the
HDFS blocks backing a region local.

> 2. If yes to question 1, is there any current framework I
> can use to distribute my processes with hbase?

Coprocessors. HBASE-2000, HBASE-2001


Alas, unfinished. 

> If no the question 2, I think I will have to make some
> custom rpc interfaces in my program.

It might be easier to help work on HBASE-2001. 
> The reason I need local awareness and run the processes at
> local data node is that I want to avoid transporting data
> over network and use multiple cpus.

You can transcode at put time or at get time with a coprocessor.

   - Andy


View raw message