Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 1 Dec 2014 14:21:13 +0000 (UTC)
From: =?utf-8?Q?Piotr_Ko=C5=82aczkowski_=28JIRA=29?= <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12731795.1407172959000.48762.1417443673032@Atlassian.JIRA>
In-Reply-To: <JIRA.12731795.1407172959000@Atlassian.JIRA>
References: <JIRA.12731795.1407172959000@Atlassian.JIRA>
 <JIRA.12731795.1407172959502@arcas>
Subject: [jira] [Commented] (CASSANDRA-7688) Add data sizing to a system
 table
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-7688?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
14229828#comment-14229828 ]=20

Piotr Ko=C5=82aczkowski commented on CASSANDRA-7688:
-----------------------------------------------

We only need estimates, not exact values. Factor 1.5x error is considered a=
n awesome estimate, factor 3x is still fairly good.=20
Also note that Spark/Hadoop does many token range scans. Maybe collecting s=
ome statistics on the fly, during the scans (or during the compaction) woul=
d be viable?  And running a full compaction to get statistics more accurate=
 - why not? You need to do it anyway to get top speed when scanning data in=
 Spark, because a full table scan is doing kind-of implicit compaction anyw=
ay, isn't it?=20

Also, one more thing - it would be good to have those values per column (so=
rry for making it even harder, I know it is not an easy task). At least to =
know that a column is responsible for xx% of data in the table - knowing su=
ch thing would make a huge difference when estimating data size, because we=
're not always fetching all columns and they may vary in size a lot (e.g. c=
ollections!). Some sampling on insert would probably be enough.


> Add data sizing to a system table
> ---------------------------------
>
>                 Key: CASSANDRA-7688
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jeremiah Jordan
>             Fix For: 2.1.3
>
>
> Currently you can't implement something similar to describe_splits_ex pur=
ely from the a native protocol driver.  https://datastax-oss.atlassian.net/=
browse/JAVA-312 is open to expose easily getting ownership information to a=
 client in the java-driver.  But you still need the data sizing part to get=
 splits of a given size.  We should add the sizing information to a system =
table so that native clients can get to it.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)