hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jeremy p <athomewithagroove...@gmail.com>
Subject Re: Is there a problem with having 4000 tables in a cluster?
Date Wed, 25 Sep 2013 00:11:48 GMT
Perhaps there has been some confusion.  I'm concerned about hotspotting on
read, not on write.

So, for example, let's say it's time for me to process a 'document'.  For
the sake of this example, let's say the words are all 10 characters long.
 I spin up 200 mapreduce jobs, each one takes a 'line' from the 'document'.


Mapper #1 gets to the first word in the line.  That word is 'FO?BARQUUX'
where ? is a wildcard.  So, if my keys take the form [POSITION]_[WORD],
that means I'm asking HBase to do a filtered range query starting at
'1_FOABARQUUX' and ending at '1_FOZBARQUUX'.

Mapper #2 is also starting at the first word in the line.  Let's say it's
word is '1_BA?BA?QUUX'.  This means it needs to do a filtered range query,
starting at '1_BAABAAQUUX' and ending at '1_BAZBAZQUUX'.  If I split this
table based on POSITION, both of these requests will go to the region that
holds POSITION 1.  Now realize that I have 200 mappers all working at the
same time.  This means all of my mappers will be hitting that region at the
same time.  Not good.

So my solution is to have a separate table for each POSITION, and use the
WORD as a key.  I'll split the tables such that each region has roughly the
same number of words.  This way, I can do my filtered range queries, and
have the keys distributed fairly well around my cluster.

Does this make sense?  I'm not sure how salting the keys in the reduce
phase would help, but it's quite possible that I'm missing something.

Thanks for the help!

--Jeremy


On Tue, Sep 24, 2013 at 4:32 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Who talked about salting in the reducer? Why do you want to do that? The
> usecase did not even talk about any reduce phase.
>
> Seems we need more details on what Jeremy want to achieve.
>
> JM
>
>
> 2013/9/24 Varun Sharma <varun@pinterest.com>
>
> > So you should salt the keys in the reduce phase but u donot salt the keys
> > in HBase. That basically means that reducers do not see the keys in
> sorted
> > order but they do see all the values for a specific key together.
> >
> > So the Hash essentially is a trick that stays within the mapreduce does
> not
> > make it into HBase. This prevents you from hotspotting a region and since
> > the Hash is not being written into HBase - you can do your prefix/regex
> > scans etc.
> >
> >
> > On Tue, Sep 24, 2013 at 4:16 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > If you have a fixed length like:
> > > XXXX_AAAAAAA
> > >
> > > Where XXXX is a number from 0000 to 4000 and AAAA is your word, then
> > simply
> > > split by the number?
> > >
> > > Then when you will instead each line, it will write to 4000 different
> > > regions, which can be hosted in 4000 different servers if you have
> that.
> > > And there will be no hot-spotting?
> > >
> > > Then when you run MR job, you will have one mapper per region. Each
> > region
> > > will be evenly loaded. And if you start 200 jobs, then each region will
> > > have 200 mappers, still evenly loaded, no?
> > >
> > > Or I might be missing something from your usecase.
> > >
> > > JM
> > >
> > >
> > > 2013/9/24 jeremy p <athomewithagroovebox@gmail.com>
> > >
> > > > Varun : I'm familiar with that method of salting.  However, in this
> > > case, I
> > > > need to do filtered range scans.  When I do a lookup for a given WORD
> > at
> > > a
> > > > given POSITION, I'll actually be doing a regex on a range of WORDs at
> > > that
> > > > POSITION.  If I salt the keys with a hash, the WORDs will no longer
> be
> > > > sorted, and so I would need to do a full table scan for every lookup.
> > > >
> > > > Jean-Marc : What problems do you see with my solution?  Do you have a
> > > > better suggestion?
> > > >
> > > > --Jeremy
> > > >
> > > >
> > > > On Tue, Sep 24, 2013 at 3:16 PM, Varun Sharma <varun@pinterest.com>
> > > wrote:
> > > >
> > > > > Its better to do some "salting" in your keys for the reduce phase.
> > > > > Basically, make ur key be something like "KeyHash + Key" and then
> > > decode
> > > > it
> > > > > in your reducer and write to HBase. This way you avoid the
> > hotspotting
> > > > > problem on HBase due to MapReduce sorting.
> > > > >
> > > > >
> > > > > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> > > > > jean-marc@spaggiari.org> wrote:
> > > > >
> > > > > > Hi Jeremy,
> > > > > >
> > > > > > I don't see any issue for HBase to handle 4000 tables. However,
I
> > > don't
> > > > > > think it's the best solution for your use case.
> > > > > >
> > > > > > JM
> > > > > >
> > > > > >
> > > > > > 2013/9/24 jeremy p <athomewithagroovebox@gmail.com>
> > > > > >
> > > > > > > Short description : I'd like to have 4000 tables in my
HBase
> > > cluster.
> > > > > >  Will
> > > > > > > this be a problem?  In general, what problems do you run
into
> > when
> > > > you
> > > > > > try
> > > > > > > to host thousands of tables in a cluster?
> > > > > > >
> > > > > > > Long description : I'd like the performance advantage of
> > pre-split
> > > > > > tables,
> > > > > > > and I'd also like to do filtered range scans.  Imagine
a
> keyspace
> > > > where
> > > > > > the
> > > > > > > key consists of : [POSITION]_[WORD] , where POSITION is
a
> number
> > > > from 1
> > > > > > to
> > > > > > > 4000, and WORD is a string consisting of 96 characters.
 The
> > value
> > > in
> > > > > the
> > > > > > > cell would be a single integer.  My app will examine a
> > 'document',
> > > > > where
> > > > > > > each 'line' consists of 4000 WORDs.  For each WORD, it'll
do a
> > > > filtered
> > > > > > > regex lookup.  Only problem?  Say I have 200 mappers and
they
> all
> > > > start
> > > > > > at
> > > > > > > POSITION 1, my region servers would get hotspotted like
crazy.
> So
> > > my
> > > > > idea
> > > > > > > is to break it into 4000 tables (one for each POSITION),
and
> then
> > > > > > pre-split
> > > > > > > the tables such that each region gets an equal amount of
the
> > > traffic.
> > > > >  In
> > > > > > > this scenario, the key would just be WORD.  Dunno if this
a bad
> > > idea,
> > > > > > would
> > > > > > > be open to suggestions
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > --J
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message