hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Hampson <hamps...@gmail.com>
Subject Re: Table/column layout
Date Mon, 13 Jun 2016 02:33:46 GMT
Hi, Anil:

Thanks for the feedback! I'll proceed with the non-short column-naming.
It's good to have some feedback from real-world, production cases.

Thanks again,
- Ken

On Sat, Jun 11, 2016 at 2:47 PM anil gupta <anilgupta84@gmail.com> wrote:

> My 2 cents:
>
> #1. HBase version timestamp is purely used for storing & purging historical
> data on basis of TTL. If you try to build an app toying around timestamps
> you might run into issues. So, you might need to be very careful with that.
>
> #2. Usually HBase suggests that column name to be around 5-6 chars because
> HBase store data as KV. But, its hard to keep on doing that in **real world
> apps**. When you use block encoding/compression, the performance penalty of
> wide columns is reduced. For example, Apache Phoenix uses Fast_Diff
> encoding by default due to non-short column name.
> Here is another blogpost that discuss perf of encoding/compression:
>
> http://hadoop-hbase.blogspot.com/2016/02/hbase-compression-vs-blockencoding_17.html
> I have been using user friendly column names(more readable rather than
> short abbreviation) and i still get decent performance in my
> apps.(Obviously, YMMV. My apps are performing within our SLA.)
> In prod, I have a table that has 1100+ columns, column names are not short.
> Hence, i would recommend you to go ahead with your non-short column naming.
> You might need to try out different encoding/compression to see what
> provides you best performance.
>
> HTH,
> Anil Gupta
>
> On Fri, Jun 10, 2016 at 8:16 PM, Ken Hampson <hampsonk@gmail.com> wrote:
>
> > I realize that was probably a bit of a wall of text... =)
> >
> > So, TL;DR: I'm wondering:
> > 1) If people have used and had good experiences with caller-specified
> > version timestamps (esp. given the caveats in the HBase book doc re:
> issues
> > with deletions and TTLs).
> >
> > 2) About suggestions for optimal column naming for potentially large
> > numbers of different column groupings for very wide tables.
> >
> > Thanks,
> > - Ken
> >
> > On Tue, Jun 7, 2016 at 10:52 PM Ken Hampson <hampsonk@gmail.com> wrote:
> >
> > > Hi:
> > >
> > > I'm currently using HBase 1.1.2 and am in the process of determining
> how
> > > best to proceed with the column layout for an upcoming expansion of our
> > > data pipeline.
> > >
> > > Background:
> > >
> > > Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is
> > sha1
> > > Table B: billions of rows (more than Table A), 1.8 TB (with snappy
> > > compression), rowkey is sha1
> > >
> > >
> > > These tables represent data obtained via a combination batch/streaming
> > > process. We want to expand our data pipeline to run an assortment of
> > > analyses on these tables (both batch and streaming) and be able to
> store
> > > the results in each table as appropriate. Table A is a set of unique
> > > entries with some example data, whereas Table B is correlated to Table
> A
> > > (via Table A's sha1), but is not de-duplicated (that is to say, it
> > contains
> > > contextual data).
> > >
> > > For the expansion of the data pipeline, we want to store the data
> either
> > > in Table A if context is not needed, and Table B if context is needed.
> > > Since we have a theoretically unlimited number of different analyses
> that
> > > we may want to perform and store the results for (that is to say, I
> need
> > to
> > > assume there will be a substantial number of data sets that need to be
> > > stored in these tables, which will grow over time and could each
> > themselves
> > > potentially be somewhat wide in terms of columns).
> > >
> > > Originally, I had considered storing these in column families, where
> each
> > > analysis is grouped together in a different column family. However, I
> > have
> > > read in the HBase book documentation that HBase does not  perform well
> > with
> > > many column families (a few default, ~10 max), so I have discarded this
> > > option.
> > >
> > > The next two options both involve using wide tables with many columns
> in
> > a
> > > separate column family (e.g. "d"), where all the various analysis would
> > be
> > > grouped into the same family in a potentially wide amount of columns in
> > > total. Each of these analyses needs to maintain their own versions so
> we
> > > can correlate the data from each one. The variants which come to mind
> to
> > > accomplish that, and on which I would appreciate some feedback on are:
> > >
> > >    1. Use HBase's native versioning to store the version of the
> analysis
> > >    2. Encode a version in the column name itself
> > >
> > > I know the HBase native versions use the server's timestamp by default,
> > > but can take any long value. So we could assign a particular time value
> > to
> > > be a version of a particular analysis. However, the doc also warned
> that
> > > there could be negative ramifications of this because HBase uses the
> > > versions internally for things like TTL for deletes/maintenance. Do
> > people
> > > use versions in this way? Are the TTL issues of great concern? (We
> likely
> > > won't be deleting things often from the tables, but can't guarantee
> that
> > we
> > > won't ever do so).
> > >
> > > Encoding a version in the column name itself would make the column
> names
> > > bigger, and I know it's encouraged for column names to be as small as
> > > possible.
> > >
> > > Adjacent to the native-version-or-not question, there's the general
> > column
> > > naming. I was originally thinking maybe having a prefix followed by the
> > > column name, optionally with the version in the middle depending on
> > whether
> > > 1 or 2 is chosen above. This would allow prefix filters to be used
> during
> > > gets/scans to gather all columns for a given analysis type, etc. but it
> > > would perhaps result in larger column names across billions of rows.
> > >
> > > e.g. *analysisfoo_4_column1*
> > >
> > > In practice, is this done and can it perform well? Or is it better to
> > pick
> > > a fixed width and use some number in its place, that's then translated
> > via,
> > > say, another table?
> > >
> > > e.g. *100000_1000_100000* (or something to that effect -- fixed width
> > > numbers that are stand-in ids for potentially longer descriptions).
> > >
> > > Thanks,
> > > - Ken
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message