Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4CAF3200CB6 for ; Thu, 29 Jun 2017 11:41:57 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4B6A6160BED; Thu, 29 Jun 2017 09:41:57 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6A4F5160BDF for ; Thu, 29 Jun 2017 11:41:56 +0200 (CEST) Received: (qmail 32966 invoked by uid 500); 29 Jun 2017 09:41:55 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 32954 invoked by uid 99); 29 Jun 2017 09:41:54 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Jun 2017 09:41:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0C276C1416 for ; Thu, 29 Jun 2017 09:41:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id zKbfjpLsb0yp for ; Thu, 29 Jun 2017 09:41:52 +0000 (UTC) Received: from mail-yb0-f181.google.com (mail-yb0-f181.google.com [209.85.213.181]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id B64BE5FB8B for ; Thu, 29 Jun 2017 09:41:51 +0000 (UTC) Received: by mail-yb0-f181.google.com with SMTP id 84so26997746ybe.0 for ; Thu, 29 Jun 2017 02:41:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=swijmTA5+YdMN+/+qfamLuV6DwwAR7/KPamF75YiRos=; b=bw1yW2s8reqfcrWykS2m+uNlA1dXhTGzGwzQnRIUZT0s1FhZ/YLcF6ZhnFaiirkOLs iYaSvywbKKeU828E1jLCiSvubMelyi8SplstT665Av4nhFIQyVJMgTXfjokTU+GbSKFX WAG1h1SdXy9vFlGgdD4t4ONIrV3dkD3GxWqz4t5Ctf5KrilxwzpeUJg2FsmHJpaHfrGW sSrWT+5mV7IDciJ/SssVPhTBRpFSpW+zEwBfyRsrD3bFZv7/8N+yUWt6FiNieJX+oLHb tlKEKP2z1NfWN5+1efDiQhBrKlDPlxGqxxN5nkC5c1OwPLh6lb7uGLtyGL57tOgzjE/S uKRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=swijmTA5+YdMN+/+qfamLuV6DwwAR7/KPamF75YiRos=; b=OQE9+gf1oRzHA3OBKiqdqIRHZ7TbEw/bhB0idlTmZRzcVho0bGZtIvj3xmPYIsPGAD dj7APsVZ/mAz7D8hZIikstMlC1FmqLm7QprR4gcfw2MZLPgmHp6PWP+2jWEUFR8qDlwc oGuqTO8R44o+b90uBBYDZsGk9NTP+vICQb+kbvsswF0vGuOXuHKcmrN8FpswoZg2VBbf sPvctrf42TxOcr7nj4x/lnyfn722uiVVPvGB6w16aq2eEKIbTe2232WhdodsS92UQjM+ +9s249VK+8uksbdQiZWJ21Ox2WZ//muY8HcRb8J2OmfOEYJl3+aH4dGdnHKZGAM/AjkB R5Gg== X-Gm-Message-State: AKS2vOzDVRUmXrUhX0XasGuJYMVFIcT1kDm3tCPIqiph2NMh0tXpScCz NnhEsb6Q/cPeJXjkYHXofirgB8jjjMdQ X-Received: by 10.37.38.17 with SMTP id m17mr11623136ybm.191.1498729311016; Thu, 29 Jun 2017 02:41:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.231.200 with HTTP; Thu, 29 Jun 2017 02:41:50 -0700 (PDT) In-Reply-To: References: From: Ted Yu Date: Thu, 29 Jun 2017 02:41:50 -0700 Message-ID: Subject: Re: Implementation of full table scan using Spark To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary="94eb2c18fbda32a04f0553161cf6" archived-at: Thu, 29 Jun 2017 09:41:57 -0000 --94eb2c18fbda32a04f0553161cf6 Content-Type: text/plain; charset="UTF-8" Sachin: My previous answer was inaccurate. Please take a look at TableRecordReaderImpl where htable.getScanner() is called to obtain ResultScanner. The (relatively) fast table scan may be due to your table having not much data. Cheers On Wed, Jun 28, 2017 at 10:27 PM, Sachin Jain wrote: > @Ted Yu If full table scan does not read memstore then why I am getting the > recently inserted data. I am pretty sure others may have seen this earlier > and may not didn't notice. > > @Jingcheng Thanks for your answer. If you are true, then my understanding > was wrong. I will try to see the code of TableInputFormat and see if I get > something new. > > On Thu, Jun 29, 2017 at 9:31 AM, Jingcheng Du wrote: > > > Hi Sachin, > > The TableInputFormat should read the memstore. > > The TableInputFormat is converted to scan to each region, the operations > in > > each region should be a normal scan, so the memstore should be included. > > That's why you can always read all the data. > > > > bq. As per my understanding this full table scan works fast because we > are > > reading Hfiles directly. > > I think the fast full table scan is because you run the scan in each > region > > concurrently in Spark. > > > > 2017-06-29 11:33 GMT+08:00 Ted Yu : > > > > > TableInputFormat doesn't read memstore. > > > > > > bq. I am inserting 10-20 entires only > > > > > > You can query JMX and check the values for the following: > > > > > > flushedCellsCount > > > flushedCellsSize > > > > > > FlushMemstoreSize_num_ops > > > > > > For Q2, there is no client side support for knowing where the data > comes > > > from. > > > > > > On Wed, Jun 28, 2017 at 8:15 PM, Sachin Jain > > > wrote: > > > > > > > Hi, > > > > > > > > I have used TableInputFormat and newAPIHadoopRDD defined on > > sparkContext > > > to > > > > do a full table scan and get an rdd from it. > > > > > > > > Partial piece of code looks like this: > > > > > > > > sparkContext.newAPIHadoopRDD( > > > > HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName. > > > > getNameWithNamespaceInclAsString, > > > > hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt), > > > > classOf[TableInputFormat], > > > > classOf[ImmutableBytesWritable], > > > > classOf[Result] > > > > ) > > > > > > > > > > > > As per my understanding this full table scan works fast because we > are > > > > reading Hfiles directly. > > > > > > > > *Q1. Does that mean we are skipping memstores ? *If yes, then we > should > > > > have missed some data which is present in memstore because that data > > has > > > > not been persisted to disk yet and hence not available via HFile. > > > > > > > > *In my local setup, I always get all the data*. Since I am inserting > > > 10-20 > > > > entires only I am assuming this is present in memstore when I am > > issuing > > > > the full table scan spark job. > > > > > > > > Q2. When I issue a get command, Is there a way to know if the record > is > > > > served from blockCache, memstore or Hfile? > > > > > > > > Thanks > > > > -Sachin > > > > > > > > > > --94eb2c18fbda32a04f0553161cf6--