Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 42FC873D6 for ; Mon, 1 Aug 2011 21:55:43 +0000 (UTC) Received: (qmail 74552 invoked by uid 500); 1 Aug 2011 21:55:40 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 74450 invoked by uid 500); 1 Aug 2011 21:55:40 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 74442 invoked by uid 99); 1 Aug 2011 21:55:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Aug 2011 21:55:39 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL,TRACKER_ID X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [66.33.216.122] (HELO hapkido.dreamhost.com) (66.33.216.122) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Aug 2011 21:55:34 +0000 Received: from homiemail-a45.g.dreamhost.com (caibbdcaaaaf.dreamhost.com [208.113.200.5]) by hapkido.dreamhost.com (Postfix) with ESMTP id 045D317AFB7 for ; Mon, 1 Aug 2011 14:55:13 -0700 (PDT) Received: from homiemail-a45.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a45.g.dreamhost.com (Postfix) with ESMTP id F0DD0480AE for ; Mon, 1 Aug 2011 14:55:03 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=rQjQysiIn4 eopqtfyTWCGKGslkonvrYAlwI07YD7U9h1wXUJyFunC8KdPBzCFU3IhGyKHXsdBC Ae4q2xxJ73N/gIYvqFgUbGuE1jhzJo1+vOcB4XOxIc8bXxoqIf6QCU8InhsnIHgO OUDcT9qb6cXq+1tM6ApdUkhK7ZW8KBkiA= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=iX8qvbp4rZE9wCdy HkkTagtN6Jg=; b=Y/VrqFHVG5efZpSktqOyh/XASesFqXu7MejNxRlI61t67pNT 8v7Qa2dbI1KmUWUlZKC+dNDydYsH7VZQSHZ11ia7UGgQ6CS2Swj2VnZr3mURVzLt Y0bFMzIrOLoi8y7Fz/60mxuZHTr7tzEsqC2/mNjw262qULn0CatIpOFVBa4= Received: from aarons-mbp-2011.lan (122-57-124-115.jetstream.xtra.co.nz [122.57.124.115]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a45.g.dreamhost.com (Postfix) with ESMTPSA id 1D9134805B for ; Mon, 1 Aug 2011 14:55:02 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: multipart/alternative; boundary="Apple-Mail=_5FEAF954-B2E8-4E9D-A859-359D952CFA86" Subject: Re: Cassandra bulk import confusion Date: Tue, 2 Aug 2011 09:55:04 +1200 In-Reply-To: To: user@cassandra.apache.org References: Message-Id: <6B0312AB-1927-4268-A712-7031A8CC1AA5@thelastpickle.com> X-Mailer: Apple Mail (2.1244.3) --Apple-Mail=_5FEAF954-B2E8-4E9D-A859-359D952CFA86 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Incase you missed it, fresh off the press = http://www.datastax.com/dev/blog/bulk-loading Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 30 Jul 2011, at 04:10, Jeff Schmidt wrote: > Hello: >=20 > I'm relatively new to Cassandra, but I've been searching around, and = it looks like Cassandra 0.8.x has improved support for bulk importing of = data. I keep finding references to the json2sstable command, and I've = read about that on the Datastax and Apache documentation pages. >=20 > There's a lot of detail here if you want it, otherwise please skip to = the end. json2sstable seems to run successfully, but I cannot see the = data in the new CF using the CLI. >=20 > My goal is extract data from various sources, munge it together in = some manner, and then bulk load it into Cassandra. That is as opposed = to using Hector to programmatically insert the data. I'd like to deploy = these files to the cloud (Puppet) and then instruct Cassndra to bulk = load them, and then inform the application that new data exists. This = is for a period content update of certain column families of curated, = read-only, data that occurs on a monthly basis. I'm thinking of using = JMX to signal the application to switch to a new set of CFs and keep = running w/o downtime. At a later time, I'll delete the old CFs. >=20 > I'm using Cassandra 0.8.2 and I'm just playing with this concept. I = create a test CF using the CLI >=20 > [default@Ingenuity] use Test; > Authenticated to keyspace: Test > [default@Test] create column family TestCF with comparator =3D = UTF8Type and column_metadata =3D [{column_name: nodeId, = validation_class: UTF8Type}]; > 28991070-b9f9-11e0-0000-242d50cf1fb5 > Waiting for schema agreement... > ... schemas agree across the cluster > [default@Test] update column family TestCF with = key_validation_class=3DUTF8Type;=20 > 2af88440-b9f9-11e0-0000-242d50cf1fb5 > Waiting for schema agreement... > ... schemas agree across the cluster > [default@Test] set TestCF['SID|123']['nodeId'] =3D 'ING:001'; =20 > Value inserted. > [default@Test] set TestCF['EG|3030']['nodeId'] =3D 'ING:002'; =20 > Value inserted. > [default@Test] set TestCF['EG|3031']['nodeId'] =3D 'ING:003';=20 > Value inserted. > [default@Test] list TestCF; > Using default limit of 100 > ------------------- > RowKey: EG|3030 > =3D> (column=3DnodeId, value=3DING:002, timestamp=3D1311954072252000) > ------------------- > RowKey: EG|3031 > =3D> (column=3DnodeId, value=3DING:003, timestamp=3D1311954073631000) > ------------------- > RowKey: SID|123 > =3D> (column=3DnodeId, value=3DING:001, timestamp=3D1311954072249000) >=20 > 3 Rows Returned. > [default@Test]=20 >=20 > Now, cassandra.yaml is stock, except I changed it to place the data in = a non-default location: >=20 > # directories where Cassandra should store data on disk. > data_file_directories: > - /usr/local/ingenuity/isec/cassandra/datastore/data >=20 > # commit log > commitlog_directory: = /usr/local/ingenuity/isec/cassandra/datastore/commitlog >=20 > # saved caches > saved_caches_directory: = /usr/local/ingenuity/isec/cassandra/datastore/saved_caches >=20 > In that data directory: >=20 > [imac:datastore/data/Test] jas% pwd > /usr/local/ingenuity/isec/cassandra/datastore/data/Test > [imac:datastore/data/Test] jas% ls > [imac:datastore/data/Test] jas%=20 >=20 > There is nothing there. Perhaps Cassandra has not yet felt the need = to write the SSTables. So, since I need to reference in actual data = file with sstable2json, I ran nodetool flush: >=20 > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h = localhost flush Test TestCF > [imac:isec/cassandra/apache-cassandra-0.8.2] jas%=20 >=20 > Now, I have files! >=20 > [imac:datastore/data/Test] jas% pwd > /usr/local/ingenuity/isec/cassandra/datastore/data/Test > [imac:datastore/data/Test] jas% ls > TestCF-g-1-Data.db TestCF-g-1-Index.db > TestCF-g-1-Filter.db TestCF-g-1-Statistics.db > [imac:datastore/data/Test] jas%=20 >=20 > Given that, I'm able run sstable2json and I can see I'm getting what's = in that CF: >=20 > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/sstable2json = /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db= > testcf.jason > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% cat testcf.jason=20 > { > "45477c33303330": [["nodeId","ING:002",1311954072252000]], > "45477c33303331": [["nodeId","ING:003",1311954073631000]], > "5349447c313233": [["nodeId","ING:001",1311954072249000]] > } > [imac:isec/cassandra/apache-cassandra-0.8.2] jas%=20 >=20 > Oops, okay, that file extension should be json not jason, but oh = well... :) >=20 > Okay, so I now I have data in the proper format for importing with = json2sstable. Like I said, I want to import this data into a new CF. = Let's call it TestCF2 (in the same keyspace): >=20 > [default@Test] create column family TestCF2 with comparator =3D = UTF8Type and column_metadata =3D [{column_name: nodeId, = validation_class: UTF8Type}]; > 4dcc44b0-b9fa-11e0-0000-242d50cf1fb5 > Waiting for schema agreement... > ... schemas agree across the cluster > [default@Test] update column family TestCF2 with = key_validation_class=3DUTF8Type;=20 > 5092dec0-b9fa-11e0-0000-242d50cf1fb5 > Waiting for schema agreement... > ... schemas agree across the cluster > [default@Test]=20 >=20 > Again there are no files created in the data directory, so I do a = flush for the new CF: >=20 > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h = localhost flush Test TestCF2 > [imac:isec/cassandra/apache-cassandra-0.8.2] jas%=20 >=20 > Well, that did not help, still no files for TestCF2. There is no = actual data yet, so I'm guessing the system tables have what they need. = So, I go ahead and import the data using json2sstable: >=20 > [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/json2sstable -K = Test -c TestCF2 testcf.jason = /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.d= b > Importing 3 keys... > 3 keys imported successfully. > [imac:isec/cassandra/apache-cassandra-0.8.2] jas%=20 >=20 > Okay, and the files did show up: >=20 > [imac:datastore/data/Test] jas% pwd > /usr/local/ingenuity/isec/cassandra/datastore/data/Test > [imac:datastore/data/Test] jas% ls > TestCF-g-1-Data.db TestCF2-g-1-Data.db > TestCF-g-1-Filter.db TestCF2-g-1-Filter.db > TestCF-g-1-Index.db TestCF2-g-1-Index.db > TestCF-g-1-Statistics.db TestCF2-g-1-Statistics.db > [imac:datastore/data/Test] jas%=20 >=20 > Back in the CLI: >=20 > [default@Test] list TestCF2; > Using default limit of 100 >=20 > 0 Row Returned. > [default@Test]=20 >=20 > However, if I edit TestCF-g-1-Data.db, I can sort of see the data is = present. Quitting and starting the CLI has no affect. What gets the the = CF data into the MemTables so it's accessible to a Cassandra client? I = tried various nodetool commands (repair, compact, cleanup, flush, = invalidatekeycache, invalidaterowcache) and I don't see any rows for = TestCF2 in the CLI. >=20 > Anyway, it seems this procedure works as I'd expect, well except for = not seeing the new data. :) >=20 > What am I missing here? >=20 > Thanks, >=20 > Jeff > -- > Jeff Schmidt > 535 Consulting > jas@535consulting.com > http://www.535consulting.com > (650) 423-1068 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 --Apple-Mail=_5FEAF954-B2E8-4E9D-A859-359D952CFA86 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii http://www.datastax= .com/dev/blog/bulk-loading

Cheers

http://www.thelastpickle.com

On 30 Jul 2011, at 04:10, Jeff Schmidt wrote:

Hello:

I'm = relatively new to Cassandra, but I've been searching around, and it = looks like Cassandra 0.8.x has improved support for bulk importing of = data.  I keep finding references to the json2sstable command, and = I've read about that on the Datastax and Apache documentation = pages.

There's a lot of detail here if you want = it, otherwise please skip to the end. json2sstable seems to run = successfully, but I cannot see the data in the new CF using the = CLI.

My goal is extract data from various = sources, munge it together in some manner, and then bulk load it into = Cassandra.  That is as opposed to using Hector to programmatically = insert the data.  I'd like to deploy these files to the cloud = (Puppet) and then instruct Cassndra to bulk load them, and then inform = the application that new data exists.  This is for a period content = update of certain column families of curated, read-only, data that = occurs on a monthly basis. I'm thinking of using JMX to signal the = application to switch to a new set of CFs and keep running w/o downtime. =  At a later time, I'll delete the old = CFs.

I'm using Cassandra 0.8.2 and I'm just = playing with this concept.  I create a test CF using the = CLI

[default@Ingenuity] use = Test;
Authenticated to keyspace: Test
[default@Test] = create column family TestCF with comparator =3D UTF8Type and = column_metadata =3D [{column_name: nodeId, validation_class: = UTF8Type}];
28991070-b9f9-11e0-0000-242d50cf1fb5
Waiti= ng for schema agreement...
... schemas agree across the = cluster
[default@Test] update column family TestCF with = key_validation_class=3DUTF8Type; 
2af88440-b9f9-11e0-0000-2= 42d50cf1fb5
Waiting for schema agreement...
... = schemas agree across the cluster
[default@Test] set = TestCF['SID|123']['nodeId'] =3D 'ING:001';  
Value = inserted.
[default@Test] set TestCF['EG|3030']['nodeId'] =3D = 'ING:002';  
Value inserted.
[default@Test] set = TestCF['EG|3031']['nodeId'] =3D 'ING:003'; 
Value = inserted.
[default@Test] list TestCF;
Using = default limit of 100
-------------------
RowKey: = EG|3030
=3D> (column=3DnodeId, value=3DING:002, = timestamp=3D1311954072252000)
-------------------
RowK= ey: EG|3031
=3D> (column=3DnodeId, value=3DING:003, = timestamp=3D1311954073631000)
-------------------
RowK= ey: SID|123
=3D> (column=3DnodeId, value=3DING:001, = timestamp=3D1311954072249000)

3 Rows = Returned.
[default@Test] 

No= w, cassandra.yaml is stock, except I changed it to place the data in a = non-default location:

# directories where Cassandra should store data = on disk.
data_file_directories:
usr/local/ingenuity/isec/cassandra/datastore/data

# = commit log
commitlog_directory: /usr/local/ingenuity/isec/cassandra/datastore/commitlog

# saved caches
usr/local/ingenuity/isec/cassandra/datastore/saved_caches

In= that data = directory:

[imac:datastore/data/Test] jas% = pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% = ls
[imac:datastore/data/Test] = jas% 

There is nothing there. =  Perhaps Cassandra has not yet felt the need to write the SSTables. =  So, since I need to reference in actual data file with = sstable2json, I ran nodetool = flush:

[imac:isec/cassandra/apache-cassandra= -0.8.2] jas% bin/nodetool -h localhost flush Test = TestCF
[imac:isec/cassandra/apache-cassandra-0.8.2] = jas% 

Now, I have = files!

[imac:datastore/data/Test] jas% = pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% = ls
TestCF-g-1-Data.db = TestCF-g-1-Index.db
TestCF-g-1-Filter.db = TestCF-g-1-Statistics.db
[imac:datastore/data/Test] = jas% 

Given that, I'm able run = sstable2json and I can see I'm getting what's in that = CF:

[imac:isec/cassandra/apache-cassandra-0.= 8.2] jas%  bin/sstable2json = /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db= > = testcf.jason
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% = cat testcf.jason 
{
"45477c33303330": = [["nodeId","ING:002",1311954072252000]],
"45477c33303331": = [["nodeId","ING:003",1311954073631000]],
"5349447c313233": = [["nodeId","ING:001",1311954072249000]]
}
[imac:isec/c= assandra/apache-cassandra-0.8.2] = jas% 

Oops, okay, that file = extension should be json not jason, but oh well... = :)

Okay, so I now I have data in the proper = format for importing with json2sstable.  Like I said, I want to = import this data into a new CF. Let's call it TestCF2 (in the same = keyspace):

[default@Test] create column = family TestCF2 with comparator =3D UTF8Type and column_metadata =3D = [{column_name: nodeId, validation_class: = UTF8Type}];
4dcc44b0-b9fa-11e0-0000-242d50cf1fb5
Waiti= ng for schema agreement...
... schemas agree across the = cluster
[default@Test] update column family TestCF2 with = key_validation_class=3DUTF8Type; 
5092dec0-b9fa-11e0-0000-2= 42d50cf1fb5
Waiting for schema agreement...
... = schemas agree across the = cluster
[default@Test] 

Agai= n there are no files created in the data directory, so I do a flush for = the new = CF:

[imac:isec/cassandra/apache-cassandra-0.= 8.2] jas% bin/nodetool -h localhost flush Test = TestCF2
[imac:isec/cassandra/apache-cassandra-0.8.2] = jas% 

Well, that did not help, still = no files for TestCF2.  There is no actual data yet, so I'm guessing = the system tables have what they need. So, I go ahead and import the = data using = json2sstable:

[imac:isec/cassandra/apache-ca= ssandra-0.8.2] jas% bin/json2sstable -K Test -c TestCF2 testcf.jason = /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.d= b
Importing 3 keys...
3 keys imported = successfully.
[imac:isec/cassandra/apache-cassandra-0.8.2] = jas% 

Okay, and the files did show = up:

[imac:datastore/data/Test] jas% = pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% = ls
TestCF-g-1-Data.db = TestCF2-g-1-Data.db
TestCF-g-1-Filter.db = TestCF2-g-1-Filter.db
TestCF-g-1-Index.db = TestCF2-g-1-Index.db
TestCF-g-1-Statistics.db = TestCF2-g-1-Statistics.db
[imac:datastore/data/Test] = jas% 

Back in the = CLI:

[default@Test] list = TestCF2;
Using default limit of 100

0 = Row = Returned.
[default@Test] 

Ho= wever, if I edit TestCF-g-1-Data.db, I can sort of see the data is = present.  Quitting and starting the CLI has no affect. What gets = the the CF data into the MemTables so it's accessible to a Cassandra = client?   I tried various nodetool commands (repair, compact, = cleanup, flush, invalidatekeycache, invalidaterowcache) and I = don't see any rows for TestCF2 in the = CLI.

Anyway, it seems this procedure works as = I'd expect, well except for not seeing the new data. = :)

What am I missing = here?

Thanks,

Jeff
=
--
Jeff Schmidt
535 = Consulting
jas@535consulting.com






= --Apple-Mail=_5FEAF954-B2E8-4E9D-A859-359D952CFA86--