Return-Path: Delivered-To: apmail-hadoop-hbase-issues-archive@minotaur.apache.org Received: (qmail 33479 invoked from network); 29 Mar 2010 07:21:50 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Mar 2010 07:21:50 -0000 Received: (qmail 30933 invoked by uid 500); 29 Mar 2010 07:21:50 -0000 Delivered-To: apmail-hadoop-hbase-issues-archive@hadoop.apache.org Received: (qmail 30901 invoked by uid 500); 29 Mar 2010 07:21:49 -0000 Mailing-List: contact hbase-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hbase-issues@hadoop.apache.org Received: (qmail 30877 invoked by uid 99); 29 Mar 2010 07:21:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Mar 2010 07:21:49 +0000 X-ASF-Spam-Status: No, hits=-1159.2 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Mar 2010 07:21:47 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8C582234C1EE for ; Mon, 29 Mar 2010 07:21:27 +0000 (UTC) Message-ID: <1579027249.542141269847287563.JavaMail.jira@brutus.apache.org> Date: Mon, 29 Mar 2010 07:21:27 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: hbase-issues@hadoop.apache.org Subject: [jira] Updated: (HBASE-2378) Bulk insert with multiple reducers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HBASE-2378: ------------------------------- Attachment: hbase-2378.txt Here's a fix as well as a test case. The core of the issue was that ImmutableBytesWritable, when comparing Java objects in RAM (as opposed to in byte form) counted shorter strings as less than longer strings, even if the first characters were higher. I ran Ruslan's code and was able to successfully use the resulting table. The one modification I made was to explicitly set the sorting class to ImmutableBytesWritable.Comparator, though I think it would work without that change. > Bulk insert with multiple reducers > ---------------------------------- > > Key: HBASE-2378 > URL: https://issues.apache.org/jira/browse/HBASE-2378 > Project: Hadoop HBase > Issue Type: Bug > Components: mapreduce > Affects Versions: 0.20.3 > Reporter: Ruslan Salyakhov > Assignee: Todd Lipcon > Attachments: hbase-2378.txt, HFileOutputFormat.java, my_sample_log_1k.txt, MyHFilesWriter.java, MyKeyComparator.java, MyReadPerformance.java, MySampler.java, TestTotalOrderPartitionerForMyKeys.java, TotalOrderPartitioner.java > > > If I run MR to prepare HFIles with more than one reducer then some values for keys are not appeared in the table after loadtable.rb script execution. With one reducer everything works fine. > References: > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk > - the row id must be formatted as a ImmutableBytesWritable > - MR job should ensure a total ordering among all keys > MAPREDUCE-366 (patch-5668-3.txt) > - TotalOrderPartitioner that uses the new API (attached) > HBASE-2063 > - patched HFileOutputFormat (attached) > Input data (attached): > * my_sample_log_1k.txt - sample data, input for MyHFilesWriter > Source (attached): > * MyKeyComparator.java - comparator for my ImmutableBytesWritable keys > * TestTotalOrderPartitionerForMyKeys.java - test case for my keys (note that I've set up MyKeyComparator to pass that test) > * MyHFilesWriter.java - My MR job to prepare HFiles > * HFileOutputFormat.java - from MAPREDUCE-366 > * TotalOrderPartitioner.java - from MAPREDUCE-366 > * MySampler.java - My RandomSampler based on Sampler from MAPREDUCE-366 BUT I've put the following string into getSample method (without that string it doesn't work): > {code} > reader.initialize(splits.get(i), new TaskAttemptContext(job.getConfiguration(), new TaskAttemptID())); > {code} > Test case: > # comment the following string in MyHFilesWriter: //job.setSortComparatorClass(MyKeyComparator.class); > # hadoop jar keyvalue-poc.jar MyHFilesWriter -in /test_hbase/my_sample_log_1k.txt -out /test_hbase/hfiles/01/ -r 1 > # hadoop jar keyvalue-poc.jar MyHFilesWriter -in /test_hbase/my_sample_log_1k.txt -out /test_hbase/hfiles/02/ -r 2 > # hbase> create 'tst_hfiles_01', {NAME => 'vals'} > # hbase> create 'tst_hfiles_02', {NAME => 'vals'} > # hbase org.jruby.Main /usr/lib/hbase-0.20/bin/loadtable.rb tst_hfiles_01 /test_hbase/hfiles/01 > # hbase org.jruby.Main /usr/lib/hbase-0.20/bin/loadtable.rb tst_hfiles_02 /test_hbase/hfiles/02 > # check values for keys > # uncomment the following string in MyHFilesWriter: //job.setSortComparatorClass(MyKeyComparator.class); > # hadoop jar keyvalue-poc.jar MyHFilesWriter -in /test_hbase/my_sample_log_1k.txt -out /test_hbase/hfiles/03/ -r 2 > for example, results: > {code} > hbase(main):006:0* count 'tst_hfiles_01', 100 > Current count: 100, row: 0.14.USA.IL.602.ELMHURST.1.1.0.0 > Current count: 200, row: 0.245.USA.ME.500.PORTLAND.1.1.0.0 > Current count: 300, row: 0.34.USA.FL.Rollup.Rollup.1.1.0.0 > Current count: 400, row: 0.443.USA.CA.803.LOS.ANGELES.1.1.0 > Current count: 500, row: 0.8.USA.CO.751.CASTLE.ROCK.1.1.0 > Current count: 600, row: 1.14.DZA.Rollup.Rollup.Rollup.1.1.0.1 > Current count: 700, row: 1.159.SWE.AB.Rollup.Rollup.1.1.0.1 > Current count: 800, row: 1.17.USA.TN.659.CLARKSVILLE.1.1.0.1 > Current count: 900, row: 1.220.USA.MI.505.SOUTHFIELD.1.1.0.1 > 999 row(s) in 0.0930 seconds > hbase(main):007:0> count 'tst_hfiles_02', 100 > Current count: 100, row: 0.231.USA.GA.524.BUFORD.1.1.0.1 > Current count: 200, row: 0.4.USA.VA.573.Rollup.1.1.0.0 > Current count: 300, row: 0.9.ROU.B.-1.BUCHAREST.1.1.0.0 > Current count: 400, row: 1.16.USA.IA.679.Rollup.1.1.1.0 > Current count: 500, row: 1.245.NOR.03.-1.OSLO.1.1.0.0 > Current count: 600, row: 0.245.GBR.ENG.826005.BEXLEY.1.1.0.1 > Current count: 700, row: 0.48.GBR.ENG.826027.Rollup.1.1.0.1 > Current count: 800, row: 1.14.SWE.Rollup.Rollup.Rollup.1.1.0.1 > Current count: 900, row: 1.201.GBR.ENG.826005.LONDON.1.1.0.1 > 999 row(s) in 0.1630 seconds > hbase(main):008:0> get 'tst_hfiles_01', '0.14.USA.IL.602.ELMHURST.1.1.0.0' > COLUMN CELL > vals:key0 timestamp=1269542753914, value=0 > vals:key1 timestamp=1269542753914, value=14 > vals:key2 timestamp=1269542753914, value=USA > vals:key3 timestamp=1269542753914, value=IL > vals:key4 timestamp=1269542753914, value=602 > vals:key5 timestamp=1269542753914, value=ELMHURST > vals:key6 timestamp=1269542753914, value=1 > vals:key7 timestamp=1269542753914, value=1 > vals:key8 timestamp=1269542753914, value=0 > vals:key9 timestamp=1269542753914, value=0 > vals:val0 timestamp=1269542753914, value=2 > 11 row(s) in 0.0160 seconds > hbase(main):009:0> get 'tst_hfiles_02', '0.14.USA.IL.602.ELMHURST.1.1.0.0' > COLUMN CELL > 0 row(s) in 0.0220 seconds > {code} > with MyKeyComparator > {code} > java.io.IOException: Added a key not lexically larger than previous key=.103.FRA.V.-1.LYON.1.1.0.0valskey0'XXX, lastkey=1.20.USA.AOL.0.AOL.1.1.0.0valsval0'XXX > at org.apache.hadoop.hbase.io.hfile.HFile$Writer.checkKey(HFile.java:551) > at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:513) > at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:481) > at com.contextweb.hadoop.hbase.mapred.HFileOutputFormat$1.write(HFileOutputFormat.java:77) > at com.contextweb.hadoop.hbase.mapred.HFileOutputFormat$1.write(HFileOutputFormat.java:49) > at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:508) > at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer.reduce(KeyValueSortReducer.java:46) > at org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer.reduce(KeyValueSortReducer.java:35) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.