Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C5DED2009EE for ; Wed, 18 May 2016 13:58:54 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C4668160A00; Wed, 18 May 2016 11:58:54 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E29A31609B1 for ; Wed, 18 May 2016 13:58:53 +0200 (CEST) Received: (qmail 89954 invoked by uid 500); 18 May 2016 11:58:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 89944 invoked by uid 99); 18 May 2016 11:58:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 May 2016 11:58:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 164FC180361 for ; Wed, 18 May 2016 11:58:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.198 X-Spam-Level: * X-Spam-Status: No, score=1.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=snapdeal.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id zLyID0YvmLkw for ; Wed, 18 May 2016 11:58:49 +0000 (UTC) Received: from mail-lb0-f176.google.com (mail-lb0-f176.google.com [209.85.217.176]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 169D15F239 for ; Wed, 18 May 2016 11:58:49 +0000 (UTC) Received: by mail-lb0-f176.google.com with SMTP id ww9so16522914lbc.2 for ; Wed, 18 May 2016 04:58:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=snapdeal.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=4foFJK5RZPhEBbIerYAOoMe4q3Hamr35TBBa6b7mTWY=; b=Vv9a3ggFLZHRJvvcyCTL24akjEoLDCz72pL4HQjyt4kpWnDNi+3xB7Mh8zZjYLawVx eWT7X8a6a+89EqfyX8gZ5qRy8lBQb61XNdf8nNuCbNcSn+pHxDrJ51q7WZMcmpvTIO82 vmScUYcGmLHEyGqI4yk59+A6e/qzWgjHbHcME= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=4foFJK5RZPhEBbIerYAOoMe4q3Hamr35TBBa6b7mTWY=; b=Qq269YJEE/1N+9NsPgHSW6cqGnoVWnsDZtsmQK5VqOUBi+Jw+4Jwa4YPVNLkHvNUlB W4hOW1baCoNLp5PN6D3d/oinFc/R1LxtG4ZMdkpazpriSUwYk/oPVfch63nkrgGImzEW TjLLctpjZ1PJsyZRANdiyxnxAF3YtgCITa+fpYraz0hLotxgOqSO8p7xtCUSAT/4Zkqo OyNwsl8wbsSz2aaLl2o6PBKQd5vnepiRlfxPDyYoSi8xMD7bmK9I4RpyG1imaATjror5 tmnMr0r86WpBzWAvNYbQTu9e4U9zlR76Y8FnYwdOwf1eu3sTG8B8s+cqOcf+O5IfTqbK Sw9w== X-Gm-Message-State: AOPr4FVHmGFTUNppxD08cNM2nQifCoSUIB46aTkDdLpjVzZFIPcg0DFJeR5olb/VfYAniZmyrcI2x9yIRTRX9L4e X-Received: by 10.112.184.79 with SMTP id es15mr2427396lbc.30.1463572721799; Wed, 18 May 2016 04:58:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.38.83 with HTTP; Wed, 18 May 2016 04:58:22 -0700 (PDT) In-Reply-To: References: From: Atul Saroha Date: Wed, 18 May 2016 17:28:22 +0530 Message-ID: Subject: Re: Low cardinality secondary index behaviour To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a1133ac902fb36105331c9436 archived-at: Wed, 18 May 2016 11:58:55 -0000 --001a1133ac902fb36105331c9436 Content-Type: text/plain; charset=UTF-8 Thanks Tyler, SPARSE SASI index solves my use case. Planing to upgrade the cassandra to 3.0.6 now. --------------------------------------------------------------------------------------------------------------------- Atul Saroha *Lead Software Engineer* *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369 Plot # 362, ASF Centre - Tower A, Udyog Vihar, Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs wrote: > > On Tue, May 10, 2016 at 6:41 AM, Atul Saroha > wrote: > >> I have concern over using secondary index on field with low cardinality. >> Lets say I have few billion rows and each row can be classified in 1000 >> category. Lets say we have 50 node cluster. >> >> Now we want to fetch data for a single category using secondary index >> over a category. And query is paginated too with fetch size property say >> 5000. >> >> Since query on secondary index works as scatter and gatherer approach by >> coordinator node. Would it lead to out of memory on coordinator or timeout >> errors too much. >> > > Paging will prevent the coordinator from using excessive memory. With the > type of data that you described, timeouts shouldn't be huge problem because > it will only take a few token ranges (assuming you're using vnodes) to get > enough matching rows to hit the page size. > > >> >> How does pagination (token level data fetch) behave in scatter and >> gatherer approach? >> > > Secondary index queries fetch token ranges in sequential order [1], > starting with the minimum token. When you fetch a new page, it resumes > from the last token (and primary key) that it returned in the previous page. > > [1] As an optimization, multiple token ranges will be fetched in parallel > based on estimates of how many token ranges it will take to fill the page. > > >> >> Secondly, What If we create an inverted table with partition key as >> category. Then this will led to lots of data on single node. Then it might >> led to hot shard issue and performance issue of data fetching from single >> node as a single partition has millions of rows. >> >> How should we tackle such low cardinality index in Cassandra? > > > The data distribution that you described sounds like a reasonable fit for > secondary indexes. However, I would also take into account how frequently > you run this query and how fast you need it to be. Even ignoring the > scatter-gather aspects of a secondary index query, they are still expensive > because they fetch many non-contiguous rows from an SSTable. If you need > to run this query very frequently, that may add too much load to your > cluster, and some sort of inverted table approach may be more appropriate. > > -- > Tyler Hobbs > DataStax > --001a1133ac902fb36105331c9436 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Tyler,

SPARSE SASI index solves m= y use case. Planing to upgrade the cassandra to 3.0.6 now.

<= div dir=3D"ltr">
--------= ---------------------------------------------------------------------------= ----------------------------------
Atul Saroha

Lea= d Software Engineer
M: +91 8447784271=C2=A0T: +91 124-415-6069 EXT: 12369
Plot # 3= 62, ASF Centre - Tower A, Udyog Vihar,
=C2=A0Phase -4, Sector 18, Gurgao= n, Haryana 122016, INDIA
<= /div>

On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs= <tyler@datastax.com> wrote:

On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <= atul.saroha@snapdeal.com> wrote:
I have concern over using secondary index on field wit= h low cardinality. Lets say I have few billion rows and each row can be cla= ssified in 1000 category. Lets say we have 50 node cluster.

No= w we want to fetch data for a single category using secondary index over a = category. And query is paginated too with fetch size property say 5000.
Since query on secondary index works as scatter and gatherer approach = by coordinator node. Would it lead to out of memory on coordinator or timeo= ut errors too much.

= Paging will prevent the coordinator from using excessive memory.=C2=A0 With= the type of data that you described, timeouts shouldn't be huge proble= m because it will only take a few token ranges (assuming you're using v= nodes) to get enough matching rows to hit the page size.
=C2=A0

How d= oes pagination (token level data fetch) behave in scatter and gatherer appr= oach?

Secondary inde= x queries fetch token ranges in sequential order [1], starting with the min= imum token.=C2=A0 When you fetch a new page, it resumes from the last token= (and primary key) that it returned in the previous page.

[1] As an optimization, multiple token ranges will be fetched in parallel = based on estimates of how many token ranges it will take to fill the page.<= br>
=C2=A0
<= div>

Secondly, What If we create an inverted table with parti= tion key as category. Then this will led to lots of data on single node. Th= en it might led to hot shard issue and performance issue of data fetching f= rom single node as a single partition has=C2=A0 millions of rows.

How should we tackle such low cardinality index in Cassandra?

The data distribution = that you described sounds like a reasonable fit for secondary indexes.=C2= =A0 However, I would also take into account how frequently you run this que= ry and how fast you need it to be.=C2=A0 Even ignoring the scatter-gather a= spects of a secondary index query, they are still expensive because they fe= tch many non-contiguous rows from an SSTable.=C2=A0 If you need to run this= query very frequently, that may add too much load to your cluster, and som= e sort of inverted table approach may be more appropriate.

-- <= br>
Tyler Hobbs
DataStax
<= /div>

--001a1133ac902fb36105331c9436--