hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yeshwanth kumar <yeshwant...@gmail.com>
Subject Re: Spark HBase Bulk load using HFileFormat
Date Thu, 14 Jul 2016 19:13:50 GMT
Hi  ,

i have few questions regarding BulkLoad,
does the Rows needs to be in sorted order or, the KeyValues  in the row
needs to be in sorted order?

sometimes i see exception between two different rowkeys, sometime i see
exception between keyvalue pairs of same rowkey.

for example

current cell
123B51E8-574E-4029-BEA7-D0FF7B12DB30/C:Address/1468510623407/Put/vlen=176/seqid=0,
lastCell =
694E24E2-7484-4926-B587-466990F1A017/C:Year/1468510623407/Put/vlen=4/seqid=0

order mismatch is in between Keyvalues in two different rows,

whereas

 Current cell =
200065494/C:GENERALDEPENDENCYMEDIUM/1468522415075/Put/vlen=176/seqid=0,
 lastCell = 200065494/C:R.PAIDPREP/1468522415075/Put/vlen=10/seqid=0

order mismatch  is in between keyvalues of same row.
in which sorted order HFileFormat is expecting the Data.??







-Yeshwanth
Can you Imagine what I would do if I could do all I can - Art of War

On Thu, Jul 14, 2016 at 1:33 AM, yeshwanth kumar <yeshwanth43@gmail.com>
wrote:

>
> following is the code snippet for saveASHFile
>
> def saveAsHFile(putRDD: RDD[(ImmutableBytesWritable, KeyValue)], outputPath: String)
= {
>   val conf = ConfigFactory.getConf
>   val job = Job.getInstance(conf, "HBaseBulkPut")
>   job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
>   job.setMapOutputValueClass(classOf[Put])
>   val connection = ConnectionFactory.createConnection(conf)
>   val stTable= connection.getTable(TableName.valueOf("strecords"))
>   val regionLocator = new HRegionLocator(TableName.valueOf("strecords"), connection.asInstanceOf[ClusterConnection])
>   HFileOutputFormat2.configureIncrementalLoad(job, stTable, regionLocator)
>
>   putRDD.saveAsNewAPIHadoopFile(
>     outputPath,
>     classOf[ImmutableBytesWritable],
>     classOf[Put],
>     classOf[HFileOutputFormat2],
>     conf)
> }
>
> i just saw that i am using   job.setMapOutputValueClass(classOf[Put])
>
> where as i am writing KeyValue, does that cause any issue?
>
> i will update the code and will run it,
>
> can you suggest me sorting on partitions.
>
> Thanks,
>
> Yeshwanth
>
>
> -Yeshwanth
> Can you Imagine what I would do if I could do all I can - Art of War
>
> On Wed, Jul 13, 2016 at 7:46 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> Can you show the code inside saveASHFile ?
>>
>> Maybe the partitions of the RDD need to be sorted (for 1st issue).
>>
>> Cheers
>>
>> On Wed, Jul 13, 2016 at 4:29 PM, yeshwanth kumar <yeshwanth43@gmail.com>
>> wrote:
>>
>> > Hi i am doing bulk load into HBase as HFileFormat, by
>> > using saveAsNewAPIHadoopFile
>> >
>> > i am on HBase 1.2.0-cdh5.7.0 and spark 1.6
>> >
>> > when i try to write i am getting an exception
>> >
>> >  java.io.IOException: Added a key not lexically larger than previous.
>> >
>> > following is the code snippet
>> >
>> > case class HBaseRow(rowKey: ImmutableBytesWritable, kv: KeyValue)
>> >
>> > val kAvroDF =
>> > sqlContext.read.format("com.databricks.spark.avro").load(args(0))
>> > val kRDD = kAvroDF.select("seqid", "mi", "moc", "FID", "WID").rdd
>> > val trRDD = kRDD.map(a => preparePUT(a(1).asInstanceOf[String],
>> > a(3).asInstanceOf[String]))
>> > val kvRDD = trRDD.flatMap(a => a).map(a => (a.rowKey, a.kv))
>> > saveAsHFile(kvRDD, args(1))
>> >
>> >
>> > prepare put returns a list of HBaseRow( ImmutableBytesWritable,KeyValue)
>> > sorted on KeyValue, where i do a flat map on the rdd and
>> > prepare a RDD(ImmutableBytesWritable,KeyValue) and pass it to
>> saveASHFile
>> >
>> > i tried using Put api,
>> > it throws
>> >
>> > java.lang.Exception: java.lang.ClassCastException:
>> > org.apache.hadoop.hbase.client.Put cannot be cast to
>> > org.apache.hadoop.hbase.Cell
>> >
>> >
>> > is there any i can skip using KeyValue Api,
>> > and do the bulk load into HBase?
>> > please help me in resolving this issue,
>> >
>> > Thanks,
>> > -Yeshwanth
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message