hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Scan vs TableInputFormat to process data
Date Mon, 03 Jun 2019 13:16:42 GMT
Also, keep in  mind that by bypassing the RegionServer you also bypass the
security rules...


Le sam. 1 juin 2019 à 21:43, Josh Elser <elserj@apache.org> a écrit :

> Hi Guillermo,
> Yes, you are missing something.
> TableInputFormat uses the Scan API just like Spark would.
> Bypassing the RegionServer and reading from HFiles directly is
> accomplished by using the TableSnapshotInputFormat. You can only read
> from HFiles directly when you are using a Snapshot, as there are
> concurrency issues WRT the lifecycle of HFiles managed by HBase. It is
> not safe to try to HFiles underneath HBase on your own unless you are
> confident you understand all the edge cases in how HBase manages files.
> On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:
> > Just to be sure, if I execute Scan inside Spark, the execution is goig
> > through RegionServers and I get all the features of HBase/Scan (filters
> and
> > so on), all the parallelization is in charge of the RegionServers (even
> > I'm  running the program with spark)
> > If I use TableInputFormat I read all the column families (even If I don't
> > want to) , not previous filter either, it's just open the files of a
> hbase
> > table and process them completly. All te parallelization is in Spark and
> > don't use HBase at all, it's just read in HDFS the files what HBase
> stored
> > for a specific table.
> >
> > Am I missing something?
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message