Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F122112C4 for ; Tue, 5 Aug 2014 12:26:52 +0000 (UTC) Received: (qmail 84959 invoked by uid 500); 5 Aug 2014 12:26:50 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 84888 invoked by uid 500); 5 Aug 2014 12:26:50 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 84876 invoked by uid 99); 5 Aug 2014 12:26:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Aug 2014 12:26:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of alokawi@gmail.com designates 74.125.82.178 as permitted sender) Received: from [74.125.82.178] (HELO mail-we0-f178.google.com) (74.125.82.178) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Aug 2014 12:26:48 +0000 Received: by mail-we0-f178.google.com with SMTP id w61so941488wes.9 for ; Tue, 05 Aug 2014 05:26:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Boq6fmqYIfaSkHB2GjnGo+wEeNhI+zrT3ED43Rszg5I=; b=hKOvbNWI0GMCAMCPAOqX4N4R47mjPwOh8H3Jqah+3CxadwV9Kbnu4g+QnB6vcd1ru1 VZ++Q23bykxtqJ7X/lHhCMGJErkNh7uypvLeUqn0DjYedJ1UcTUdId0HLKEJz3rTr9hS swc13HzmjJJgq68aGUpr5pwdEVcraVqlIRPllrbyLMfTSErwvRjkhVBc9myoid+pJIHp V3vk9B2AqIpnziA3V/P6nG7BSLZJFzVQjuyBrsxAR2lw6gqX99j4DcHHGSLzGeIup741 bHLx5G+XF27Va72UQIK9/bbqSgfk/6FTtrMfFOik8uXeO0ChmfGF7PmMUbkq8wiVirvB oEhw== MIME-Version: 1.0 X-Received: by 10.194.110.7 with SMTP id hw7mr5465365wjb.38.1407241582243; Tue, 05 Aug 2014 05:26:22 -0700 (PDT) Received: by 10.217.140.130 with HTTP; Tue, 5 Aug 2014 05:26:22 -0700 (PDT) In-Reply-To: <000401cfb0a2$f8519380$e8f4ba80$@innowireless.co.kr> References: <000001cfb09d$df123460$9d369d20$@innowireless.co.kr> <000801cfb09f$47a09a70$d6e1cf50$@innowireless.co.kr> <000301cfb0a1$7a8910a0$6f9b31e0$@innowireless.co.kr> <000401cfb0a2$f8519380$e8f4ba80$@innowireless.co.kr> Date: Tue, 5 Aug 2014 17:56:22 +0530 Message-ID: Subject: Re: Question on the number of column families From: Alok Kumar To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=047d7bf10ade9f78fc04ffe0f6da X-Virus-Checked: Checked by ClamAV on apache.org --047d7bf10ade9f78fc04ffe0f6da Content-Type: text/plain; charset=UTF-8 You could narrow the number of rows to scan by using Filters. I don't think, you could reach/optimize to column level I/O. Block Cache is related to actual data read from HDFS per column family. If your scan is fetching random (all) columns, then you are any way going to hit all the column-family-blocks and "irrelevant" data in block cache!! You could limit or set columns you want to fetch on client side after scan, that will save network IO. Do you have 130 * 5 = 650MB of row size? Thanks Alok On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr> wrote: > Plus, > Since most of the time a client will display the area that does not fit in > 500x500, Scan operations are required. (Get is not enough) > So, I'm worried that on scanning, many irrelevant column data (those have > the same rowkey, which is the position on the grid) would be read into the > block cache, unless the columns are separated by individual column family. > > > -----Original Message----- > From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr] > Sent: Tuesday, August 05, 2014 8:36 PM > To: user@hbase.apache.org > Subject: RE: Question on the number of column families > > Thank you for your reply. > > I can decrease the size of column value if it's not good for HBase. > BTW, The values are for a point on a grid cell on a map. > 250000 is 500x500, and 500x500 is somewhat related to the size of the > client screen that displays the values on a map. > Normally a client requests the values for the area that is displayed on > the screen. > > > -----Original Message----- > From: Alok Kumar [mailto:alokawi@gmail.com] > Sent: Tuesday, August 05, 2014 8:24 PM > To: user@hbase.apache.org > Subject: Re: Question on the number of column families > > Hi, > > Hbase creates HFile per column-family. Having 130 column-family is really > not recommended. > It will increase number of file pointer ( open file count) underneath. > > If you are sure which columns are "frequently" accessed by users, you > could consider putting them in one column family. And "Non frequently" ones > in another. > Btw, ~5MB size of column value is something to consider. We should wait > for some expert advise here!! > > > Thanks > Alok > > > On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim < > taeyun.kim@innowireless.co.kr> wrote: > > > Plus, > > the size of the value of each field can be ~5MB, since max 250000 > > lines of the source data will be merged into one record, to match the > > request pattern. > > > > > > -----Original Message----- > > From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr] > > Sent: Tuesday, August 05, 2014 8:11 PM > > To: user@hbase.apache.org > > Subject: Question on the number of column families > > > > Hi, > > > > > > > > According to http://hbase.apache.org/book/number.of.cfs.html, having > > more than 2~3 column families are strongly discouraged. > > > > > > > > BTW, in my case, records on a table have the following characteristics: > > > > > > > > - The table is read-only. It is bulk-loaded once. When a new data is > > ready, A new table is created and the old table is deleted. > > > > - The size of the source data can be hundreds of gigabytes. > > > > - A record has about 130 fields. > > > > - The number of fields in a record is fixed. > > > > - The names of the fields are also fixed. (it's like a table in RDBMS) > > > > - About 40(it varies) fields mostly have value, while other fields are > > mostly empty(null in RDBMS). > > > > - It is unknown which field will be dense. It depends on the source data. > > > > - Fields are accessed independently. Normally a user requests just one > > field. A user can request several fields. > > > > - The range on the range query is the same for all fields. (No wider, > > no narrower, regardless the data density) > > > > For me, it seems that it would be more efficient if there is one > > column family for each field, since it would cost less disk I/O, for > > only the needed column data will be read. > > > > > > > > Can the table have 130 column families for this case? > > > > Or the whole columns must be in one column family? > > > > > > > > Thanks. > > > > > > > > > > > > > -- > Alok Kumar > Email : alokawi@gmail.com > http://sharepointorange.blogspot.in/ > http://www.linkedin.com/in/alokawi > > --047d7bf10ade9f78fc04ffe0f6da--