Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 009A4F97A for ; Mon, 29 Apr 2013 05:47:03 +0000 (UTC) Received: (qmail 13938 invoked by uid 500); 29 Apr 2013 05:47:01 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 13353 invoked by uid 500); 29 Apr 2013 05:46:57 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 13320 invoked by uid 99); 29 Apr 2013 05:46:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Apr 2013 05:46:56 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of asaf.mesika@gmail.com designates 209.85.214.182 as permitted sender) Received: from [209.85.214.182] (HELO mail-ob0-f182.google.com) (209.85.214.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Apr 2013 05:46:48 +0000 Received: by mail-ob0-f182.google.com with SMTP id dn14so5109363obc.27 for ; Sun, 28 Apr 2013 22:46:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=DGMdPJvhoUU0sYwry2WVqWsCgc3c4IXLT9FK/oWPS6E=; b=NaubtxRSPuYTf24aPwWR6GAuMwWNBgQPd2WqK9qkrFIaU7ZiFQ9PtbVlLT0BV/lQ7C o19HwhykKlDWjNW4aPDjIyYQcm2OG630+LywXRO5t2SJNHMtpW9/ninNSz3arf1tE5yi OL9GArQ/nsF9Z8GVozcV1llM/uo+Gw0jnOJtrnfllS0Pue8sT2pnslvh7CIAOKBB8vhF gaXzcE9jicFPWaF0D6cLeGXAUqahDIUUkUYkp8ef0XCfUOMaBBhzeam92nhEPPKOVRr6 q+udZ+lGUcB214utVnt2kTbinnplBK0G82laUpRAyFVRW+Aa1xZ5z0J/9n8HN7ycB3G7 dPeA== MIME-Version: 1.0 X-Received: by 10.60.99.2 with SMTP id em2mr277031oeb.119.1367214387084; Sun, 28 Apr 2013 22:46:27 -0700 (PDT) Received: by 10.60.179.51 with HTTP; Sun, 28 Apr 2013 22:46:27 -0700 (PDT) In-Reply-To: References: Date: Mon, 29 Apr 2013 08:46:27 +0300 Message-ID: Subject: Re: Schema Design Question From: Asaf Mesika To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7b33d904dfc33904db796728 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b33d904dfc33904db796728 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I actually don't see the benefit of saving the data into HBase if all you do is read per job id and purges it. Why not accumulate into HDFS per job id and then dump the file? The way I see it, HBase is good for querying parts of your data, even if it is only 10 rows. In your case your average is 1 billion, so streaming it from hdfs seems faster . On Saturday, April 27, 2013, Enis S=C3=B6ztutar wrote: > Hi, > > Interesting use case. I think it depends on job many jobId's you expect t= o > have. If it is on the order of thousands, I would caution against going t= he > one table per jobid approach, since for every table, there is some master > overhead, as well as file structures in hdfs. If jobId's are managable, > going with separate tables makes sense if you want to efficiently delete > all the data related to a job. > > Also pre-splitting will depend on expected number of jobIds / batchIds an= d > their ranges vs desired number of regions. You would want to keep number = of > regions hosted by a single region server in the low tens, thus, your spli= ts > can be across jobs or within jobs depending on cardinality. Can you share > some more? > > Enis > > > On Fri, Apr 26, 2013 at 2:34 PM, Ted Yu > > wrote: > > > My understanding of your use case is that data for different jobIds wou= ld > > be continuously loaded into the underlying table(s). > > > > Looks like you can have one table per job. This way you drop the table > > after map reduce is complete. In the single table approach, you would > > delete many rows in the table which is not as fast as dropping the > separate > > table. > > > > Cheers > > > > On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia > > >wrote: > > > > > Hi > > > > > > I am new to HBase, I have been trying to POC an application and have = a > > > design questions. > > > > > > Currently we have a single table with the following key design > > > > > > jobId_batchId_bundleId_uniquefileId > > > > > > This is an offline processing system so data would be bulk loaded int= o > > > HBase via map/reduce jobs. We only need to support report generation > > > queries using map/reduce over a batch (And possibly a single column > > filter) > > > with the batchId as the start/end scan key. Once we have finished > > > processing a job we are free to remove the data from HBase. > > > > > > We have varied workloads so a job could be made up of 10 rows, 100,00= 0 > > rows > > > or 1 billion rows with the average falling somewhere around 10 millio= n > > > rows. > > > > > > My question is related to pre-splitting. If we have a billion rows al= l > > with > > > the same batchId (Our map/reduce scan key) my understanding is we > should > > > perform pre-splitting to create buckets hosted by different regions. > If a > > > jobs workload can be so varied would it make sense to have a single > table > > > containing all jobs? Or should we create 1 table per job and pre-spli= t > > the > > > table for the given workload? If we had separate table we could drop > them > > > when no longer needed. > > > > > > If we didn't have a separate table per job how should we perform > > splitting? > > > Should we choose our largest possible workload and split for that? ev= en > > > though 90% of our jobs would fall in the lower bound in terms of row > > count. > > > Would we experience any issue purging jobs of varying sizes if > everything > > > was in a single table? > > > > > > any advice would be greatly appreciated. > > > > > > Thanks > > > > > > --047d7b33d904dfc33904db796728--