hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry He <jerry...@gmail.com>
Subject Re: Approach: Incremental data load from HBASE
Date Mon, 26 Dec 2016 23:25:13 GMT
There is no magic in the sqoop incremental import. You need a key column or
timestamp column to let sqoop know where to start each incremental.

HBase has built in timestamp.  Please look at the hbase bundled MR tool.
Export. https://hbase.apache.org/book.html#tools
There are options that let you specify starttime and endtime.
You can also write your own MR or Spark job to do incremental export of
HBase data by providing timestamps to the Scan or providing filter.

Jerry

On Sat, Dec 24, 2016 at 7:24 PM, Chetan Khatri <chetan.opensource@gmail.com>
wrote:

> Hello HBase Community,
>
> What is suggested approach for Incremental import from HBase to HDFS, like
> RDBMS to HDFS Sqoop provides support with below script
>
> sqoop job --create myssb1 -- import --connect
> jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
> district, city_id, postal_code, alast_update, cityid, city, country_id,
> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
> a.last_update as alast_update, c.city_id as cityid, c.city as city,
> c.country_id as country_id, c.last_update as clast_update FROM
> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
> address=String
>
>
> Thanks.
>
> On Wed, Dec 21, 2016 at 3:58 PM, Chetan Khatri <
> chetan.opensource@gmail.com>
> wrote:
>
> > Hello Guys,
> >
> > I would like to understand different approach for Distributed Incremental
> > load from HBase, Is there any *tool / incubactor tool* which satisfy
> > requirement ?
> >
> > *Approach 1:*
> >
> > Write Kafka Producer and maintain manually column flag for events and
> > ingest it with Linkedin Gobblin to HDFS / S3.
> >
> > *Approach 2:*
> >
> > Run Scheduled Spark Job - Read from HBase and do transformations and
> > maintain flag column at HBase Level.
> >
> > In above both approach, I need to maintain column level flags. such as 0
> -
> > by default, 1-sent,2-sent and acknowledged. So next time Producer will
> take
> > another 1000 rows of batch where flag is 0 or 1.
> >
> > I am looking for best practice approach with any distributed tool.
> >
> > Thanks.
> >
> > - Chetan Khatri
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message