Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Tue, 22 Mar 2016 03:01:25 +0000 (UTC)
From: "Stefania (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12932942.1453341387000.13216.1458615685730@Atlassian.JIRA>
In-Reply-To: <JIRA.12932942.1453341387000@Atlassian.JIRA>
References: <JIRA.12932942.1453341387000@Atlassian.JIRA>
 <JIRA.12932942.1453341387756@arcas>
Subject: [jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix
 progress report and debug performance
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefania updated CASSANDRA-11053:
---------------------------------
    Attachment: bisect_test.py

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>              Labels: doc-impacting
>             Fix For: 2.1.14, 2.2.6, 3.0.5, 3.5
>
>         Attachments: bisect_test.py, copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt, parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
>
>
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages 50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared statements should be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)