hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <els...@apache.org>
Subject Re: Scan vs TableInputFormat to process data
Date Sun, 02 Jun 2019 01:43:39 GMT
Hi Guillermo,

Yes, you are missing something.

TableInputFormat uses the Scan API just like Spark would.

Bypassing the RegionServer and reading from HFiles directly is 
accomplished by using the TableSnapshotInputFormat. You can only read 
from HFiles directly when you are using a Snapshot, as there are 
concurrency issues WRT the lifecycle of HFiles managed by HBase. It is 
not safe to try to HFiles underneath HBase on your own unless you are 
confident you understand all the edge cases in how HBase manages files.

On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:
> Just to be sure, if I execute Scan inside Spark, the execution is goig
> through RegionServers and I get all the features of HBase/Scan (filters and
> so on), all the parallelization is in charge of the RegionServers (even
> I'm  running the program with spark)
> If I use TableInputFormat I read all the column families (even If I don't
> want to) , not previous filter either, it's just open the files of a hbase
> table and process them completly. All te parallelization is in Spark and
> don't use HBase at all, it's just read in HDFS the files what HBase stored
> for a specific table.
> Am I missing something?

View raw message