hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Jain <sachinjain...@gmail.com>
Subject Re: Implementation of full table scan using Spark
Date Thu, 29 Jun 2017 05:27:33 GMT
@Ted Yu If full table scan does not read memstore then why I am getting the
recently inserted data. I am pretty sure others may have seen this earlier
and may not didn't notice.

@Jingcheng Thanks for your answer. If you are true, then my understanding
was wrong. I will try to see the code of TableInputFormat and see if I get
something new.

On Thu, Jun 29, 2017 at 9:31 AM, Jingcheng Du <dujingch@gmail.com> wrote:

> Hi Sachin,
> The TableInputFormat should read the memstore.
> The TableInputFormat is converted to scan to each region, the operations in
> each region should be a normal scan, so the memstore should be included.
> That's why you can always read all the data.
>
> bq. As per my understanding this full table scan works fast because we are
> reading Hfiles directly.
> I think the fast full table scan is because you run the scan in each region
> concurrently in Spark.
>
> 2017-06-29 11:33 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
>
> > TableInputFormat doesn't read memstore.
> >
> > bq. I am inserting 10-20 entires only
> >
> > You can query JMX and check the values for the following:
> >
> > flushedCellsCount
> > flushedCellsSize
> >
> > FlushMemstoreSize_num_ops
> >
> > For Q2, there is no client side support for knowing where the data comes
> > from.
> >
> > On Wed, Jun 28, 2017 at 8:15 PM, Sachin Jain <sachinjain024@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I have used TableInputFormat and newAPIHadoopRDD defined on
> sparkContext
> > to
> > > do a full table scan and get an rdd from it.
> > >
> > > Partial piece of code looks like this:
> > >
> > > sparkContext.newAPIHadoopRDD(
> > >   HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.
> > > getNameWithNamespaceInclAsString,
> > > hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt),
> > >   classOf[TableInputFormat],
> > >   classOf[ImmutableBytesWritable],
> > >   classOf[Result]
> > > )
> > >
> > >
> > > As per my understanding this full table scan works fast because we are
> > > reading Hfiles directly.
> > >
> > > *Q1. Does that mean we are skipping memstores ? *If yes, then we should
> > > have missed some data which is present in memstore because that data
> has
> > > not been persisted to disk yet and hence not available via HFile.
> > >
> > > *In my local setup, I always get all the data*. Since I am inserting
> > 10-20
> > > entires only I am assuming this is present in memstore when I am
> issuing
> > > the full table scan spark job.
> > >
> > > Q2. When I issue a get command, Is there a way to know if the record is
> > > served from blockCache, memstore or Hfile?
> > >
> > > Thanks
> > > -Sachin
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message