hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Kang <weliam.cl...@gmail.com>
Subject Re: Questions about data distribution in HBase
Date Tue, 30 Mar 2010 00:14:42 GMT
Hi,
Thanks a lot for your detailed suggestions.

To answer Tim's question, let me elaborate a little bit of the case I am
working on. What I need  is a low latency system can perform some videos
processes on the fly. For this reason, a M/R probably won't do the job. The
reason I chose hadoop is because its parallelization. I am trying to use the
multiple machines to make the video process at the same time. Each video
clip should be around 50M to 100M. A whole video has been sliced into around
10 video clips already. These clips should be stored in HBase's table for
fast retrieval. But to make a process on the fly for real application, I
need these 10 video clips to be processed at the same time where they are
stored.

To satisfy this purpose, I need to implement "local awareness", that is to
say, my program which process video clips should be run on the machine which
store the video clips. So, my question can be rephrased into:
1. Dose HBase provide local awareness of where the data is stored?
2. If yes to question 1, is there any current framework I can use to
distribute my processes with hbase?
If no the question 2, I think I will have to make some custom rpc interfaces
in my program.

The reason I need local awareness and run the processes at local data node
is that I want to avoid transporting data over network and use multiple
cpus. The reason I need hbase instead of hadoop m/r or hdfs with rpc is
because the latency is quite important for this on the fly process.

If it is necessary, I can give more detailed description of my case. Thanks
a lot.


William


On Mon, Mar 29, 2010 at 7:25 PM, Karthik K <oss.akk@gmail.com> wrote:

> William -
>  If you are processing video files (depending on how big they are), a
> better prospect might be to store video files in hdfs only and exploit
> hadoop rpc (see -  avro) for a custom protocol to process the same. Katta
> suggested inline is a great example of that ( custom protocol on top of
> avro
> / hadoop rpc ).
>
>   To give a hint about the locality of the files on hdfs, you can use the
> following in DistributedFileSystem .
> BlockLocation[] DistributedFileSystem#getBlockLocations(String src, long
> start, long length);
>   and can have as a guiding factor for your protocol , for locality.
>
>
>
>
>
> On Mon, Mar 29, 2010 at 2:27 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>
> > Hi William,
> >
> > I think you are slightly confused about the usage and intention of HBase.
> > Let me first say that HBase is a *storage* system designed for low
> latency,
> > random access retrieval - built on top of HDFS for high availability.
> That
> > is, it's a storage system, not a processing system. It solves the "large
> > file problems" of HDFS wherein access to arbitrary slices of a file
> require
> > scanning through every segment preceding that segment, giving random
> access
> > by record key.
> >
> > For further details of HBase, I'll +1 the suggestion of reviewing the
> post
> > by Lars George.
> >
> > You haven't contributed any details re: what kind of "processing" you
> wish
> > to accomplish over this video data. Based on your focus on low latency, I
> > will assume the m/r batch processing suggested earlier is not acceptable
> > and
> > you require some kind of low-latency, immediate response solution. If
> this
> > is indeed the case, I suggest you look at using Katta (
> > http://katta.sourceforge.net/) for your low-latency processing. It says
> > "Distributed Lucene" but they actually mean "Distributed, Low-Latency
> > Aggregates." Perhaps an acceptable solution for you is random-access
> > storage
> > of your video data in HBase combined with a custom Katta server for
> > processing of low-latency requests.
> >
> > Any further details you can provide about your project will aid in the
> > direction and advice The List can provide.
> >
> > Cheers,
> > -Nick
> >
> > On Sat, Mar 27, 2010 at 7:42 PM, William Kang <weliam.cloud@gmail.com
> > >wrote:
> >
> > > Hi Dan,
> > > Thanks for your reply.
> > > But I still have some questions about your answers:
> > > 1. What's the differences makes using the HMaster or any other machine
> > > since
> > > you mention "If you run the program from a single machine (don't use
> the
> > > HMaster)
> > > then yes, it would have to transfer the data to that machine using the
> > > network." Is there a way to run the program in multiple machines
> without
> > > using M/R?
> > > 2. Still, what about the latency if we use M/R in HBase?
> > > Thanks.
> > >
> > >
> > > Willliam
> > >
> > > On Sat, Mar 27, 2010 at 6:38 PM, Dan Washusen <dan@reactive.org>
> wrote:
> > >
> > > > Hi William,
> > > > I've put a few comments inline...
> > > >
> > > > On 28 March 2010 04:06, William Kang <weliam.cloud@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > > I am quite confused about the distributions of data in a HBase
> > system.
> > > > > For instance, if I store 10 videos in 10 HTable rows' cell, I
> assume
> > > that
> > > > > these 10 videos will be stored in different data nodes
> > (regionservers)
> > > in
> > > > > HBase.
> > > >
> > > > The distribution of the data would depend on the size of the videos.
> > > > Assuming the videos are 10MB each then all videos will be contained
> > > > within a single region and served by a single region server.  Once a
> > > > region contains more than 256MB of data (default) the region is split
> > > > in two.  The two regions will then (probably) be served by two region
> > > > servers, etc...
> > > >
> > > > You may also be getting the terminologies a little mixed.  I'd
> suggest
> > > > having a read of the excellent HBase Architecture 101 article that
> > > > Lars George wrote:
> > > >
> http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
> > > >
> > > > > Now, if I wrote a program that do some processes for these 10
> videos
> > > > > parallel, what' going to happen?
> > > > > Since I only deployed the program in a jar to the master server in
> > > HBase,
> > > > > will all videos in the HBase system have to be transfered into the
> > > master
> > > > > server to get processed?
> > > >
> > > > If you run the program from a single machine (don't use the HMaster)
> > > > then yes, it would have to transfer the data to that machine using
> the
> > > > network.
> > > >
> > > > > 1. Or do I have another option to assign where the computing should
> > > > happen
> > > > > so I do not have to transfer the data over the network and use the
> > > region
> > > > > server's cpu to calculate the process?
> > > > > 2. Or should I deploy the program jar to each region server so the
> > > region
> > > > > server can use local cpu on the local data? Will HBase system do
> that
> > > > > automatically?
> > > > > 3. Or I need plug M/R into HBase in order to use the local data and
> > > > > parallelization in processes?
> > > > > Many thanks.
> > > >
> > > > HBase uses HDFS to store files.  The data that a region server is
> > > > serving does not necessarily reside on the same machine as the region
> > > > server.  As a result options 1 and 2 don't really make sense...
> > > >
> > > > As Tim Robertson suggests you are left option 3 to consider...
> > > >
> > > > >
> > > > >
> > > > > William
> > > >
> > > > I hope that helps a little.  I'd really strongly recommend that you
> > > > have a read of the HBase Architecture 101 article...
> > > >
> > > > Cheers,
> > > > Dan
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message