hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: reading Hbase table in Spark
Date Mon, 10 Oct 2016 21:46:34 GMT
Thanks Ted,

So basically involves Java programming much like JDBC connection retrieval
etc.

Writing to Hbase is pretty fast. Now I have both views in Phoenix and Hive
on the underlying Hbase tables.

I am looking for flexibility here so I get I should use Spark on Hive
tables with a view on Hbase table.

Also I like tools like Zeppelin that work with both SQL and Spark
Functional programming.

Sounds like reading data from Hbase table is best done through some form of
SQL.

What are view on this approach?



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 10 October 2016 at 22:13, Ted Yu <yuzhihong@gmail.com> wrote:

> For org.apache.hadoop.hbase.client.Result, there is this method:
>
>   public byte[] getValue(byte [] family, byte [] qualifier) {
>
> which allows you to retrieve value for designated column.
>
>
> FYI
>
> On Mon, Oct 10, 2016 at 2:08 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > Hi,
> >
> > I am trying to do some operation on an Hbase table that is being
> populated
> > by Spark Streaming.
> >
> > Now this is just Spark on Hbase as opposed to Spark on Hive -> view on
> > Hbase etc. I also have Phoenix view on this Hbase table.
> >
> > This is sample code
> >
> > scala>     val tableName = "marketDataHbase"
> > >     val conf = HBaseConfiguration.create()
> > conf: org.apache.hadoop.conf.Configuration = Configuration:
> > core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
> > yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
> > hbase-default.xml, hbase-site.xml
> > scala>     conf.set(TableInputFormat.INPUT_TABLE, tableName)
> > scala>         //create rdd
> > scala>
> > *val hBaseRDD = sc.newAPIHadoopRDD(conf,
> > classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io
> > <http://hbase.io>.ImmutableBytesWritable],classOf[org.apache.hadoop.
> > hbase.client.Result])*hBaseRDD:
> > org.apache.spark.rdd.RDD[(org.apache.hadoop.hbase.io.
> > ImmutableBytesWritable,
> > org.apache.hadoop.hbase.client.Result)] = NewHadoopRDD[4] at
> > newAPIHadoopRDD at <console>:64
> > scala> hBaseRDD.count
> > res11: Long = 22272
> >
> > scala>     // transform (ImmutableBytesWritable, Result) tuples into an
> RDD
> > of Result's
> > scala> val resultRDD = hBaseRDD.map(tuple => tuple._2)
> > resultRDD: org.apache.spark.rdd.RDD[org.apache.hadoop.hbase.client.
> Result]
> > = MapPartitionsRDD[8] at map at <console>:41
> >
> > scala>  // transform into an RDD of (RowKey, ColumnValue)s  the RowKey
> has
> > the time removed
> >
> > scala> val keyValueRDD = resultRDD.map(result =>
> > (Bytes.toString(result.getRow()).split(" ")(0),
> > Bytes.toString(result.value)))
> > keyValueRDD: org.apache.spark.rdd.RDD[(String, String)] =
> > MapPartitionsRDD[9] at map at <console>:43
> >
> > scala> keyValueRDD.take(2).foreach(kv => println(kv))
> > (000055e2-63f1-4def-b625-e73f0ac36271,43.89760813529593664528)
> > (000151e9-ff27-493d-a5ca-288507d92f95,57.68882040742382868990)
> >
> > OK above I am only getting the rowkey (UUID above) and the last
> > attribute (price).
> > However, I have the rowkey and 3 more columns there in Hbase table!
> >
> > scan 'marketDataHbase', "LIMIT" => 1
> > ROW                                                   COLUMN+CELL
> >  000055e2-63f1-4def-b625-e73f0ac36271
> > column=price_info:price, timestamp=1476133232864,
> > value=43.89760813529593664528
> >  000055e2-63f1-4def-b625-e73f0ac36271
> > column=price_info:ticker, timestamp=1476133232864, value=S08
> >  000055e2-63f1-4def-b625-e73f0ac36271
> > column=price_info:timecreated, timestamp=1476133232864,
> > value=2016-10-10T17:12:22
> > 1 row(s) in 0.0100 seconds
> > So how can I get the other columns?
> >
> > Thanks
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message