Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 03EF9CF7B for ; Sun, 13 May 2012 02:08:30 +0000 (UTC) Received: (qmail 25685 invoked by uid 500); 13 May 2012 02:08:28 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 25640 invoked by uid 500); 13 May 2012 02:08:28 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 25627 invoked by uid 99); 13 May 2012 02:08:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 May 2012 02:08:28 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bbeaudreault@hubspot.com designates 74.125.149.71 as permitted sender) Received: from [74.125.149.71] (HELO na3sys009aog103.obsmtp.com) (74.125.149.71) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 13 May 2012 02:08:18 +0000 Received: from mail-ob0-f170.google.com ([209.85.214.170]) (using TLSv1) by na3sys009aob103.postini.com ([74.125.148.12]) with SMTP ID DSNKT68XfRAbCpwGsypBnSeMqXXJ14bJJF8L@postini.com; Sat, 12 May 2012 19:07:58 PDT Received: by obbuo13 with SMTP id uo13so11448666obb.1 for ; Sat, 12 May 2012 19:07:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-gm-message-state; bh=P+NAaO0z7OfwWqfPH0blquYtAI75e54CbuHpggzwRbM=; b=XvY7PUcJib2Lzoom3nz9kly5903mqh1r1xYTO0xocAI6ucFWoN72PjukHumXE9o9cI QxD9eTy464SJywTSaEdDHLfqiIQEgW5m5pogoZSMKvwbH19aT86BiWxkdk/YUDMnIw/N DmvnzSZWEmHE+vpFNhXljrvICHXMTYtAz0j1+fpGvrndCe43ToClFPrC3M2L8D6+fPit subvK64iv2Nr/2zq8M12e0kNaQAniuCsUfT7Ln3q1wO/e+Udjz+IdYKnw5n6YScUA5dF mvDfaNocVHhWnKCBvHCapf1kZywXNqZX4zrwJkwQjQ3nIH9CfBBVVo4x9lCIT7lCeS/F L0iQ== Received: by 10.50.89.230 with SMTP id br6mr14161igb.63.1336874876536; Sat, 12 May 2012 19:07:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.200.130 with HTTP; Sat, 12 May 2012 19:07:36 -0700 (PDT) In-Reply-To: References: From: Bryan Beaudreault Date: Sat, 12 May 2012 22:07:36 -0400 Message-ID: Subject: Re: MR job for creating splits To: user@hbase.apache.org Cc: mapreduce-user@hadoop.apache.org, hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8f3ba2a720063004bfe17027 X-Gm-Message-State: ALoCoQnBzce6E4PePgEnJluQyh/DS8MC8YnfucBX/lobHzjIcMR9XKLB6zRLDGPHbPH9vN50f7Co --e89a8f3ba2a720063004bfe17027 Content-Type: text/plain; charset=ISO-8859-1 I did a very similar approach and it worked fine for me. Just spot check the regions after to make sure they look lexicographically sorted. I used ImmutableBytesWritable as my key, and the default hadoop sorting for that turned out to sort lexicographically as required. Our hbase rows varied in size, so instead of doing a count of the number of rows, we tallied up the KeyValue.getLenght() for each KeyValue in a row until the size reached a certain limit. On Sat, May 12, 2012 at 7:21 PM, Something Something < mailinglists19@gmail.com> wrote: > Hello, > > This is really a MapReduce question, but the output from this will be used > to create regions for an HBase table. Here's what I want to do: > > Take an input file that contains data about users. > Sort this file by a key (which consists of a few fields from the row) > After every x # of rows write the key. > > > Here's how I was going to structure my MapReduce: > > public Splitter { > > static int counter; > > private Mapper { > map() { > Build key by concatenating fields > Write key > increment counter; > } > } > > // # of reducers will be set to 1. My understanding is that this will > send the lines to reducer in sorted order one at a time - is this a correct > assumption? > private Reducer { > static long i; > reduce() { > static long splitSize = counter / 300; // 300 is region size > if (i == 0 || i == splitSize) { > Write key; // this will be used as a 'startkey'. > i = 0; > } > i++; > } > } > } > > To summarize, there are 2 questions: > > 1) I am passing # of rows processed by Mapper to Reducer via a static > counter. Would this work? Is there a better way? > 2) If I set # of reducers to 1, would the lines be sent to reducer in > sorted order one at a time? > > Thanks in advance for the help. > --e89a8f3ba2a720063004bfe17027--