cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Tue, 02 Feb 2016 06:49:40 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stefania updated CASSANDRA-11053:
---------------------------------
    Attachment: parent_profile.txt
                worker_profiles.txt

I've repeated the test in the exact some conditions as described above with {{cProfile}} profiling
all processes. I am attaching full profile results (_worker_profiles.txt_ and _parent_profile.txt_).


The total test time was approx 15 minutes (900 seconds), of which 15 seconds were an artificial
sleep in the parent to allow workers to dump their profile results.

It is clear that with these large datasets we can no longer afford to read all data in the
parent and dish out rows as it has been the approach so far. We spend in fact over 600 seconds
in {{read_rows}}. We also spend significant time in the worker processes receiving data (30
seconds). Distributing file names to workers and letting them do all the work is pretty easy
to do and would solve these two issues. However it comes with some consequences:

* We would end up with one process per file unless we somehow split large files but splitting
large files would take time and users can prepare their data themselves. Further, COPY TO
can now export to multiple files. Therefore I think we should keep things simple and adapt
our bulk tests to export to multiple files.
* Either we change the meaning of the *max ingest rate* and make it per worker process, or
we would need to use a global lock which could become a bottleneck. I would prefer changing
the meaning of max ingest rate as users can always specify a rate that is equal to {{max_rate
/ num_processes}} if they really need to.
* To keep things simple, retries would be best handled by worker processes and therefore if
one process fails then the import fails at least partially; I think we can live with this.


In terms of the worker processes, there is room for improvement there too but it is not as
straightforward. One interesting thing to do would be to use a cythonized driver version but
this would not work out of the box due to the formatting hooks we inject in the driver. We
spend a lot of time batching records, getting the replicas, binding parameters and hashing
(_murmur3).

WDYK [~pauloricardomg] and [~thobbs]?

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, parent_profile.txt, worker_profiles.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message