Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of bbeaudreault@hubspot.com
 designates 74.125.149.71 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAHXz3_FUTYFKBnid26y+ePDMYFPamA3CBcvGz32rpPpd-XHN6g@mail.gmail.com>
References: 
 <CAHXz3_FUTYFKBnid26y+ePDMYFPamA3CBcvGz32rpPpd-XHN6g@mail.gmail.com>
From: Bryan Beaudreault <bbeaudreault@hubspot.com>
Date: Sat, 12 May 2012 22:07:36 -0400
Message-ID: 
 <CANZDn9uazkZGMsmk+MM4_S40U96bfk2D_ntcpHK_r1JA1Nhk=g@mail.gmail.com>
Subject: Re: MR job for creating splits
To: user@hbase.apache.org
Cc: mapreduce-user@hadoop.apache.org, hbase-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=e89a8f3ba2a720063004bfe17027

--e89a8f3ba2a720063004bfe17027
Content-Type: text/plain; charset=ISO-8859-1

I did a very similar approach and it worked fine for me.  Just spot check
the regions after to make sure they look lexicographically sorted.  I used
ImmutableBytesWritable as my key, and the default hadoop sorting for that
turned out to sort lexicographically as required.  Our hbase rows varied in
size, so instead of doing a count of the number of rows, we tallied up the
KeyValue.getLenght() for each KeyValue in a row until the size reached a
certain limit.

On Sat, May 12, 2012 at 7:21 PM, Something Something <
mailinglists19@gmail.com> wrote:

> Hello,
>
> This is really a MapReduce question, but the output from this will be used
> to create regions for an HBase table.  Here's what I want to do:
>
> Take an input file that contains data about users.
> Sort this file by a key (which consists of a few fields from the row)
> After every x # of rows write the key.
>
>
> Here's how I was going to structure my MapReduce:
>
> public Splitter {
>
>    static int counter;
>
>    private Mapper {
>        map() {
>            Build key by concatenating fields
>            Write key
>            increment counter;
>        }
>    }
>
>    //  # of reducers will be set to 1.  My understanding is that this will
> send the lines to reducer in sorted order one at a time - is this a correct
> assumption?
>    private Reducer {
>         static long i;
>         reduce() {
>             static long splitSize = counter / 300;  //  300 is region size
>             if (i == 0 || i == splitSize) {
>                 Write key;  // this will be used as a 'startkey'.
>                  i = 0;
>             }
>             i++;
>         }
>    }
> }
>
> To summarize, there are 2 questions:
>
> 1)  I am passing # of rows processed by Mapper to Reducer via a static
> counter.  Would this work?  Is there a better way?
> 2)  If I set # of reducers to 1, would the lines be sent to reducer in
> sorted order one at a time?
>
> Thanks in advance for the help.
>

--e89a8f3ba2a720063004bfe17027--