Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of alokawi@gmail.com designates
 74.125.82.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <000401cfb0a2$f8519380$e8f4ba80$@innowireless.co.kr>
References: <000001cfb09d$df123460$9d369d20$@innowireless.co.kr>
	<000801cfb09f$47a09a70$d6e1cf50$@innowireless.co.kr>
	<CAP=P7hs-KiUqCLrnZnz+rUxU+KRryJGCotMx+6GWeS5OSZz03Q@mail.gmail.com>
	<000301cfb0a1$7a8910a0$6f9b31e0$@innowireless.co.kr>
	<000401cfb0a2$f8519380$e8f4ba80$@innowireless.co.kr>
Date: Tue, 5 Aug 2014 17:56:22 +0530
Message-ID: 
 <CAP=P7huLWFS5526i+7SAPU+ftN6D3U7enVODx73th=qVm-ZQUw@mail.gmail.com>
Subject: Re: Question on the number of column families
From: Alok Kumar <alokawi@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=047d7bf10ade9f78fc04ffe0f6da

--047d7bf10ade9f78fc04ffe0f6da
Content-Type: text/plain; charset=UTF-8

You could narrow the number of rows to scan by using Filters. I don't
think, you could reach/optimize to column level I/O.

Block Cache is related to actual data read from HDFS per column family. If
your scan is fetching random (all) columns, then you are any way going to
hit all the column-family-blocks and "irrelevant" data in block cache!!
You could limit or set columns you want to fetch on client side after scan,
that will save network IO.

Do you have 130 * 5 = 650MB of row size?

Thanks
Alok

On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim <
taeyun.kim@innowireless.co.kr> wrote:

> Plus,
> Since most of the time a client will display the area that does not fit in
> 500x500, Scan operations are required. (Get is not enough)
> So, I'm worried that on scanning, many irrelevant column data (those have
> the same rowkey, which is the position on the grid) would be read into the
> block cache, unless the columns are separated by individual column family.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Tuesday, August 05, 2014 8:36 PM
> To: user@hbase.apache.org
> Subject: RE: Question on the number of column families
>
> Thank you for your reply.
>
> I can decrease the size of column value if it's not good for HBase.
> BTW, The values are for a point on a grid cell on a map.
> 250000 is 500x500, and 500x500 is somewhat related to the size of the
> client screen that displays the values on a map.
> Normally a client requests the values for the area that is displayed on
> the screen.
>
>
> -----Original Message-----
> From: Alok Kumar [mailto:alokawi@gmail.com]
> Sent: Tuesday, August 05, 2014 8:24 PM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> Hi,
>
> Hbase creates HFile per column-family. Having 130 column-family is really
> not recommended.
> It will increase number of file pointer ( open file count) underneath.
>
> If you are sure which columns are "frequently" accessed by users, you
> could consider putting them in one column family. And "Non frequently" ones
> in another.
> Btw, ~5MB size of column value is something to consider. We should wait
> for some expert advise here!!
>
>
> Thanks
> Alok
>
>
> On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Plus,
> > the size of the value of each field can be ~5MB, since max 250000
> > lines of the source data will be merged into one record, to match the
> > request pattern.
> >
> >
> > -----Original Message-----
> > From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> > Sent: Tuesday, August 05, 2014 8:11 PM
> > To: user@hbase.apache.org
> > Subject: Question on the number of column families
> >
> > Hi,
> >
> >
> >
> > According to http://hbase.apache.org/book/number.of.cfs.html, having
> > more than 2~3 column families are strongly discouraged.
> >
> >
> >
> > BTW, in my case, records on a table have the following characteristics:
> >
> >
> >
> > - The table is read-only. It is bulk-loaded once. When a new data is
> > ready, A new table is created and the old table is deleted.
> >
> > - The size of the source data can be hundreds of gigabytes.
> >
> > - A record has about 130 fields.
> >
> > - The number of fields in a record is fixed.
> >
> > - The names of the fields are also fixed. (it's like a table in RDBMS)
> >
> > - About 40(it varies) fields mostly have value, while other fields are
> > mostly empty(null in RDBMS).
> >
> > - It is unknown which field will be dense. It depends on the source data.
> >
> > - Fields are accessed independently. Normally a user requests just one
> > field. A user can request several fields.
> >
> > - The range on the range query is the same for all fields. (No wider,
> > no narrower, regardless the data density)
> >
> > For me, it seems that it would be more efficient if there is one
> > column family for each field, since it would cost less disk I/O, for
> > only the needed column data will be read.
> >
> >
> >
> > Can the table have 130 column families for this case?
> >
> > Or the whole columns must be in one column family?
> >
> >
> >
> > Thanks.
> >
> >
> >
> >
> >
>
>
> --
> Alok Kumar
> Email : alokawi@gmail.com
> http://sharepointorange.blogspot.in/
> http://www.linkedin.com/in/alokawi
>
>

--047d7bf10ade9f78fc04ffe0f6da--