hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Jain <sachinjain...@gmail.com>
Subject Implementation of full table scan using Spark
Date Thu, 29 Jun 2017 03:15:08 GMT
Hi,

I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext to
do a full table scan and get an rdd from it.

Partial piece of code looks like this:

sparkContext.newAPIHadoopRDD(
  HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.getNameWithNamespaceInclAsString,
hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt),
  classOf[TableInputFormat],
  classOf[ImmutableBytesWritable],
  classOf[Result]
)


As per my understanding this full table scan works fast because we are
reading Hfiles directly.

*Q1. Does that mean we are skipping memstores ? *If yes, then we should
have missed some data which is present in memstore because that data has
not been persisted to disk yet and hence not available via HFile.

*In my local setup, I always get all the data*. Since I am inserting 10-20
entires only I am assuming this is present in memstore when I am issuing
the full table scan spark job.

Q2. When I issue a get command, Is there a way to know if the record is
served from blockCache, memstore or Hfile?

Thanks
-Sachin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message