hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@gmail.com>
Subject Re: 1 table, 1 dense CF => N tables, 1 dense CF ?
Date Fri, 09 Jan 2015 20:03:21 GMT
I haven't written against this API yet, so I don't know all these answers
off the top of my head. The interface you're interested in are the
preCompact* methods in
http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html

On Fri, Jan 9, 2015 at 6:35 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

> Hi,
>
> What Nick suggests below about using Compaction Coprocessor sounds
> potentially very useful for us.  Q below.
>
> On Wed, Jan 7, 2015 at 8:21 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>
> > Not to dig too deep into ancient history, but Tsuna's comments are mostly
> > still relevant today, except for...
> >
> > You also generally end up with fewer, bigger regions, which is almost
> > > always better.  This entails that your RS are writing more data to
> fewer
> > > WALs, which leads to more sequential writes across the board.  You'll
> end
> > > up with fewer HLogs, which is also a good thing.
> >
> >
> > HBase is one WAL per region server and has been for as long as I've paid
> > attention. Unless I've missed something, number of tables doesn't change
> > this fixed number.
> >
> > If you use HBase's client (which is most likely the case as the only
> other
> > > alternative is asynchbase), beware that you need to create one HTable
> > > instance per table per thread in your application code.
> >
> >
> > You can still write your client application this way, but the preferred
> > idiom is to use a single Connection instance from which all these
> resources
> > are shared across HTable instances. This pattern is reinforced in the new
> > client API introduced in 1.0
> >
> > FYI, I think you can write a Compaction coprocessor that implements your
> > data expiration policy through normal compaction operations, thereby
> > removing the necessity of the (expensive?) scan + write delete pattern
> > entirely.
> >
>
> We actually do 2 types of full scans:
> 1) scan everything and delete rows > N days old, where N can be different
> for different users
> 2) scan everything and merge multiple rows into 1 row via HBaseHUT -
> https://github.com/sematext/HBaseHUT
>
> 2) is more expensive than 1).
> I'm wondering if we could use Compaction Coprocessor for 2)?  HBaseHUT
> needs to be able to grab N rows and merge them into 1, delete those N rows,
> and just write that 1 new row.  This N could be several thousand rows.
> Could Compaction Coprocessor really be used for that?
>
> Also, would that come into play during minor or major compactions or both?
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> >
> > -n
> >
> > On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic <
> > otis.gospodnetic@gmail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > It's been asked before, but I didn't find any *definite* answers and a
> > lot
> > > of answers I found via  are from a whiiiile back.
> > >
> > > e.g. Tsuna provided pretty convincing info here:
> > >
> > >
> >
> http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+
> > >
> > > ... but that is from 3 years ago.  Maybe things changed?
> > >
> > > Here's our use case:
> > >
> > > Data/table layout:
> > > * HBase is used for storing metrics at different granularities (1min, 5
> > > min.... - a total of 6 different granularities)
> > > * It's a multi-tenant system
> > > * Keys are carefully crafted and include userId + number, where this
> > number
> > > contains the time and the granularity
> > > * Everything's in 1 table and 1 CF
> > >
> > > Access:
> > > * We only access 1 system at a time, for a specific time range, and
> > > specific granularity
> > > * We periodically scan ALL data and delete data older than N days,
> where
> > N
> > > varies from user to user
> > > * We periodically scan ALL data and merge multiple rows (of the same
> > > granularity) into 1
> > >
> > > Question:
> > > Would there be any advantage in having 6 tables - one for each
> > granularity
> > > - instead of having everything in 1 table?
> > > Assume each table would still have just 1 CF and the keys would remain
> > the
> > > same.
> > >
> > > Thanks,
> > > Otis
> > > --
> > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message