cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-9304) COPY TO improvements
Date Thu, 05 Nov 2015 09:11:27 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991279#comment-14991279
] 

Stefania edited comment on CASSANDRA-9304 at 11/5/15 9:11 AM:
--------------------------------------------------------------

Thank you for your input. 

Regarding version support for Windows, fine for 2.2+ but for completeness I'll point out that
the only obstacle left in 2.1 is the name of the file (_cqlsh_ -> _cqlsh.py_).

Regarding the problem with pipes, I've replaced pipes with queues so we don't need to deal
with the low level platform specific details. Queues can also be safely used from the callback
threads, which was not the case for pipes.

Regarding the problem with the driver, -I haven't tested in 2.2 but I don't think it matters
which version since- I verified the problem applies to 2.2 as well, yesterday I was using
the latest cassandra-test driver version, today I used 2.7.2. The column type is the same,
{{cassandra.cqltypes.BytesType}}, the method called from {{recv_result_rows()}} is the same,
{{<bound method CassandraTypeType.from_binary of <class 'cassandra.cqltypes.BytesType'>>}}
but {{cls.serialize}} in {{from_binary}} is a lambda for the case that works and the default
implementation {{CassandraType.deserialize}} for  the case that does not work. I don't know
where the lambda comes from but I noticed there is a cython deserialize for {{BytesType}}
in deserializers.pyx. I don't know how cython works but if this is picked up in the normal
case then the problem is again with the way multiprocessing imports modules. 

The problem can be solved by adding a deserialize implementation to BytesType, like it's done
for other types:

{code}
Stefi@Lila MINGW64 ~/git/cstar/python-driver ((2.7.2))
$ git diff
diff --git a/cassandra/cqltypes.py b/cassandra/cqltypes.py
index f39d28b..eb8d3b6 100644
--- a/cassandra/cqltypes.py
+++ b/cassandra/cqltypes.py
@@ -350,6 +350,10 @@ class BytesType(_CassandraType):
     def serialize(val, protocol_version):
         return six.binary_type(val)

+    @staticmethod
+    def deserialize(byts, protocol_version):
+        return bytearray(byts)
+

 class DecimalType(_CassandraType):
     typename = 'decimal'
{code}

If this is not enough and you want to debug some more [~aholmber], you can use the 2.1 patch
attached. I'm still working on the 2.2. merge. You need to generate a table with a blob, I
used cassandra-stress. Then run {{COPY <anytable> TO 'anyfile';}} from cqlsh and this
should result in a Unicode decode error on Windows because the blob is received as a string.
If you prefer me to test things for you, that works too.



was (Author: stefania):
Thank you for your input. 

Regarding version support for Windows, fine for 2.2+ but for completeness I'll point out that
the only obstacle left in 2.1 is the name of the file (_cqlsh_ -> _cqlsh.py_).

Regarding the problem with pipes, I've replaced pipes with queues so we don't need to deal
with the low level platform specific details. Queues can also be safely used from the callback
threads, which was not the case for pipes.

Regarding the problem with the driver, I haven't tested in 2.2 but I don't think it matters
which version since yesterday I was using the latest cassandra-test driver version. Today
I used 2.7.2. The column type is the same, {{cassandra.cqltypes.BytesType}}, the method called
from {{recv_result_rows()}} is the same, {{<bound method CassandraTypeType.from_binary
of <class 'cassandra.cqltypes.BytesType'>>}} but {{cls.serialize}} in {{from_binary}}
is a lambda for the case that works and the default implementation {{CassandraType.deserialize}}
for  the case that does not work. I don't know where the lambda comes from but I noticed there
is a cython deserialize for {{BytesType}} in deserializers.pyx. I don't know how cython works
but if this is picked up in the normal case then the problem is again with the way multiprocessing
imports modules. 

The problem can be solved by adding a deserialize implementation to BytesType, like it's done
for other types:

{code}
Stefi@Lila MINGW64 ~/git/cstar/python-driver ((2.7.2))
$ git diff
diff --git a/cassandra/cqltypes.py b/cassandra/cqltypes.py
index f39d28b..eb8d3b6 100644
--- a/cassandra/cqltypes.py
+++ b/cassandra/cqltypes.py
@@ -350,6 +350,10 @@ class BytesType(_CassandraType):
     def serialize(val, protocol_version):
         return six.binary_type(val)

+    @staticmethod
+    def deserialize(byts, protocol_version):
+        return bytearray(byts)
+

 class DecimalType(_CassandraType):
     typename = 'decimal'
{code}

If this is not enough and you want to debug some more [~aholmber], you can use the 2.1 patch
attached. I'm still working on the 2.2. merge. You need to generate a table with a blob, I
used cassandra-stress. Then run {{COPY <anytable> TO 'anyfile';}} from cqlsh and this
should result in a Unicode decode error on Windows because the blob is received as a string.
If you prefer me to test things for you, that works too.


> COPY TO improvements
> --------------------
>
>                 Key: CASSANDRA-9304
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9304
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Minor
>              Labels: cqlsh
>             Fix For: 3.x, 2.1.x, 2.2.x
>
>
> COPY FROM has gotten a lot of love.  COPY TO not so much.  One obvious improvement could
be to parallelize reading and writing (write one page of data while fetching the next).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message