Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of timrobertson100@gmail.com
 designates 209.85.220.213 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=Fal4J3YQg9UtYlnbGSVGSKueddlSNbuZBsqEut9mspUXhe8/KQG/O2WyewMQ0ryor1
         7OkkA2oO9mTjbEq9mhNJYAVSSCL6YWfLzEJo4vqv/bVRJ+HezWX6klibWZiEcBwo0UXG
         /qa/qS3CjHS5AjOlWIdEDH/GjKk3WHm5fEoW8=
MIME-Version: 1.0
In-Reply-To: <ca73442f1003271006t5d25228du3d333bc45dad50bc@mail.gmail.com>
References: <ca73442f1003271006t5d25228du3d333bc45dad50bc@mail.gmail.com>
Date: Sat, 27 Mar 2010 18:54:44 +0100
Message-ID: <32120a6a1003271054h79777b3bsf0cb575b6e21f161@mail.gmail.com>
Subject: Re: Questions about data distribution in HBase
From: Tim Robertson <timrobertson100@gmail.com>
To: hbase-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

I would consider option 3) if it were me (I am not an expert).  It is
common to use HBase tables as the input format for map reduce jobs.
I don't think it is as easy as assuming that the 3 videos will go over
3 machines when storing, but certainly as the volume grows it will
distribute, and by using MR the processing will try and run as close
to the data as possible.

Cheers,
Tim


On Sat, Mar 27, 2010 at 6:06 PM, William Kang <weliam.cloud@gmail.com> wrote:
> Hi,
> I am quite confused about the distributions of data in a HBase system.
> For instance, if I store 10 videos in 10 HTable rows' cell, I assume that
> these 10 videos will be stored in different data nodes (regionservers) in
> HBase. Now, if I wrote a program that do some processes for these 10 videos
> parallel, what' going to happen?
> Since I only deployed the program in a jar to the master server in HBase,
> will all videos in the HBase system have to be transfered into the master
> server to get processed?
> 1. Or do I have another option to assign where the computing should happen
> so I do not have to transfer the data over the network and use the region
> server's cpu to calculate the process?
> 2. Or should I deploy the program jar to each region server so the region
> server can use local cpu on the local data? Will HBase system do that
> automatically?
> 3. Or I need plug M/R into HBase in order to use the local data and
> parallelization in processes?
> Many thanks.
>
>
> William
>