hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jingcheng Du <dujin...@gmail.com>
Subject Re: Implementation of full table scan using Spark
Date Thu, 29 Jun 2017 04:01:14 GMT
Hi Sachin,
The TableInputFormat should read the memstore.
The TableInputFormat is converted to scan to each region, the operations in
each region should be a normal scan, so the memstore should be included.
That's why you can always read all the data.

bq. As per my understanding this full table scan works fast because we are
reading Hfiles directly.
I think the fast full table scan is because you run the scan in each region
concurrently in Spark.

2017-06-29 11:33 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:

> TableInputFormat doesn't read memstore.
>
> bq. I am inserting 10-20 entires only
>
> You can query JMX and check the values for the following:
>
> flushedCellsCount
> flushedCellsSize
>
> FlushMemstoreSize_num_ops
>
> For Q2, there is no client side support for knowing where the data comes
> from.
>
> On Wed, Jun 28, 2017 at 8:15 PM, Sachin Jain <sachinjain024@gmail.com>
> wrote:
>
> > Hi,
> >
> > I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext
> to
> > do a full table scan and get an rdd from it.
> >
> > Partial piece of code looks like this:
> >
> > sparkContext.newAPIHadoopRDD(
> >   HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.
> > getNameWithNamespaceInclAsString,
> > hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt),
> >   classOf[TableInputFormat],
> >   classOf[ImmutableBytesWritable],
> >   classOf[Result]
> > )
> >
> >
> > As per my understanding this full table scan works fast because we are
> > reading Hfiles directly.
> >
> > *Q1. Does that mean we are skipping memstores ? *If yes, then we should
> > have missed some data which is present in memstore because that data has
> > not been persisted to disk yet and hence not available via HFile.
> >
> > *In my local setup, I always get all the data*. Since I am inserting
> 10-20
> > entires only I am assuming this is present in memstore when I am issuing
> > the full table scan spark job.
> >
> > Q2. When I issue a get command, Is there a way to know if the record is
> > served from blockCache, memstore or Hfile?
> >
> > Thanks
> > -Sachin
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message