Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 610F76141 for ; Fri, 29 Jul 2011 16:24:38 +0000 (UTC) Received: (qmail 89760 invoked by uid 500); 29 Jul 2011 16:11:34 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 89698 invoked by uid 500); 29 Jul 2011 16:11:21 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 89661 invoked by uid 99); 29 Jul 2011 16:11:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 16:11:16 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL,TRACKER_ID X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [72.167.82.84] (HELO p3plsmtpa01-04.prod.phx3.secureserver.net) (72.167.82.84) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 29 Jul 2011 16:11:07 +0000 Received: (qmail 16507 invoked from network); 29 Jul 2011 16:10:44 -0000 Received: from unknown (72.42.106.117) by p3plsmtpa01-04.prod.phx3.secureserver.net (72.167.82.84) with ESMTP; 29 Jul 2011 16:10:44 -0000 From: Jeff Schmidt Content-Type: multipart/alternative; boundary=Apple-Mail-3-788677141 Subject: Cassandra bulk import confusion Date: Fri, 29 Jul 2011 10:10:42 -0600 Message-Id: To: user@cassandra.apache.org Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-3-788677141 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Hello: I'm relatively new to Cassandra, but I've been searching around, and it = looks like Cassandra 0.8.x has improved support for bulk importing of = data. I keep finding references to the json2sstable command, and I've = read about that on the Datastax and Apache documentation pages. There's a lot of detail here if you want it, otherwise please skip to = the end. json2sstable seems to run successfully, but I cannot see the = data in the new CF using the CLI. My goal is extract data from various sources, munge it together in some = manner, and then bulk load it into Cassandra. That is as opposed to = using Hector to programmatically insert the data. I'd like to deploy = these files to the cloud (Puppet) and then instruct Cassndra to bulk = load them, and then inform the application that new data exists. This = is for a period content update of certain column families of curated, = read-only, data that occurs on a monthly basis. I'm thinking of using = JMX to signal the application to switch to a new set of CFs and keep = running w/o downtime. At a later time, I'll delete the old CFs. I'm using Cassandra 0.8.2 and I'm just playing with this concept. I = create a test CF using the CLI [default@Ingenuity] use Test; Authenticated to keyspace: Test [default@Test] create column family TestCF with comparator =3D UTF8Type = and column_metadata =3D [{column_name: nodeId, validation_class: = UTF8Type}]; 28991070-b9f9-11e0-0000-242d50cf1fb5 Waiting for schema agreement... ... schemas agree across the cluster [default@Test] update column family TestCF with = key_validation_class=3DUTF8Type;=20 2af88440-b9f9-11e0-0000-242d50cf1fb5 Waiting for schema agreement... ... schemas agree across the cluster [default@Test] set TestCF['SID|123']['nodeId'] =3D 'ING:001'; =20 Value inserted. [default@Test] set TestCF['EG|3030']['nodeId'] =3D 'ING:002'; =20 Value inserted. [default@Test] set TestCF['EG|3031']['nodeId'] =3D 'ING:003';=20 Value inserted. [default@Test] list TestCF; Using default limit of 100 ------------------- RowKey: EG|3030 =3D> (column=3DnodeId, value=3DING:002, timestamp=3D1311954072252000) ------------------- RowKey: EG|3031 =3D> (column=3DnodeId, value=3DING:003, timestamp=3D1311954073631000) ------------------- RowKey: SID|123 =3D> (column=3DnodeId, value=3DING:001, timestamp=3D1311954072249000) 3 Rows Returned. [default@Test]=20 Now, cassandra.yaml is stock, except I changed it to place the data in a = non-default location: # directories where Cassandra should store data on disk. data_file_directories: - /usr/local/ingenuity/isec/cassandra/datastore/data # commit log commitlog_directory: = /usr/local/ingenuity/isec/cassandra/datastore/commitlog # saved caches saved_caches_directory: = /usr/local/ingenuity/isec/cassandra/datastore/saved_caches In that data directory: [imac:datastore/data/Test] jas% pwd /usr/local/ingenuity/isec/cassandra/datastore/data/Test [imac:datastore/data/Test] jas% ls [imac:datastore/data/Test] jas%=20 There is nothing there. Perhaps Cassandra has not yet felt the need to = write the SSTables. So, since I need to reference in actual data file = with sstable2json, I ran nodetool flush: [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h = localhost flush Test TestCF [imac:isec/cassandra/apache-cassandra-0.8.2] jas%=20 Now, I have files! [imac:datastore/data/Test] jas% pwd /usr/local/ingenuity/isec/cassandra/datastore/data/Test [imac:datastore/data/Test] jas% ls TestCF-g-1-Data.db TestCF-g-1-Index.db TestCF-g-1-Filter.db TestCF-g-1-Statistics.db [imac:datastore/data/Test] jas%=20 Given that, I'm able run sstable2json and I can see I'm getting what's = in that CF: [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/sstable2json = /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db= > testcf.jason [imac:isec/cassandra/apache-cassandra-0.8.2] jas% cat testcf.jason=20 { "45477c33303330": [["nodeId","ING:002",1311954072252000]], "45477c33303331": [["nodeId","ING:003",1311954073631000]], "5349447c313233": [["nodeId","ING:001",1311954072249000]] } [imac:isec/cassandra/apache-cassandra-0.8.2] jas%=20 Oops, okay, that file extension should be json not jason, but oh well... = :) Okay, so I now I have data in the proper format for importing with = json2sstable. Like I said, I want to import this data into a new CF. = Let's call it TestCF2 (in the same keyspace): [default@Test] create column family TestCF2 with comparator =3D UTF8Type = and column_metadata =3D [{column_name: nodeId, validation_class: = UTF8Type}]; 4dcc44b0-b9fa-11e0-0000-242d50cf1fb5 Waiting for schema agreement... ... schemas agree across the cluster [default@Test] update column family TestCF2 with = key_validation_class=3DUTF8Type;=20 5092dec0-b9fa-11e0-0000-242d50cf1fb5 Waiting for schema agreement... ... schemas agree across the cluster [default@Test]=20 Again there are no files created in the data directory, so I do a flush = for the new CF: [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h = localhost flush Test TestCF2 [imac:isec/cassandra/apache-cassandra-0.8.2] jas%=20 Well, that did not help, still no files for TestCF2. There is no actual = data yet, so I'm guessing the system tables have what they need. So, I = go ahead and import the data using json2sstable: [imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/json2sstable -K = Test -c TestCF2 testcf.jason = /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.d= b Importing 3 keys... 3 keys imported successfully. [imac:isec/cassandra/apache-cassandra-0.8.2] jas%=20 Okay, and the files did show up: [imac:datastore/data/Test] jas% pwd /usr/local/ingenuity/isec/cassandra/datastore/data/Test [imac:datastore/data/Test] jas% ls TestCF-g-1-Data.db TestCF2-g-1-Data.db TestCF-g-1-Filter.db TestCF2-g-1-Filter.db TestCF-g-1-Index.db TestCF2-g-1-Index.db TestCF-g-1-Statistics.db TestCF2-g-1-Statistics.db [imac:datastore/data/Test] jas%=20 Back in the CLI: [default@Test] list TestCF2; Using default limit of 100 0 Row Returned. [default@Test]=20 However, if I edit TestCF-g-1-Data.db, I can sort of see the data is = present. Quitting and starting the CLI has no affect. What gets the the = CF data into the MemTables so it's accessible to a Cassandra client? I = tried various nodetool commands (repair, compact, cleanup, flush, = invalidatekeycache, invalidaterowcache) and I don't see any rows for = TestCF2 in the CLI. Anyway, it seems this procedure works as I'd expect, well except for not = seeing the new data. :) What am I missing here? Thanks, Jeff -- Jeff Schmidt 535 Consulting jas@535consulting.com http://www.535consulting.com (650) 423-1068 --Apple-Mail-3-788677141 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii # directories where Cassandra should store data = on disk.
data_file_directories:
usr/local/ingenuity/isec/cassandra/datastore/data

# = commit log
commitlog_directory: /usr/local/ingenuity/isec/cassandra/datastore/commitlog

# saved caches
usr/local/ingenuity/isec/cassandra/datastore/saved_caches

In= that data = directory:

[imac:datastore/data/Test] jas% = pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% = ls
[imac:datastore/data/Test] = jas% 

There is nothing there. =  Perhaps Cassandra has not yet felt the need to write the SSTables. =  So, since I need to reference in actual data file with = sstable2json, I ran nodetool = flush:

[imac:isec/cassandra/apache-cassandra= -0.8.2] jas% bin/nodetool -h localhost flush Test = TestCF
[imac:isec/cassandra/apache-cassandra-0.8.2] = jas% 

Now, I have = files!

[imac:datastore/data/Test] jas% = pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% = ls
TestCF-g-1-Data.db = TestCF-g-1-Index.db
TestCF-g-1-Filter.db = TestCF-g-1-Statistics.db
[imac:datastore/data/Test] = jas% 

Given that, I'm able run = sstable2json and I can see I'm getting what's in that = CF:

[imac:isec/cassandra/apache-cassandra-0.= 8.2] jas%  bin/sstable2json = /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db= > = testcf.jason
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% = cat testcf.jason 
{
"45477c33303330": = [["nodeId","ING:002",1311954072252000]],
"45477c33303331": = [["nodeId","ING:003",1311954073631000]],
"5349447c313233": = [["nodeId","ING:001",1311954072249000]]
}
[imac:isec/c= assandra/apache-cassandra-0.8.2] = jas% 

Oops, okay, that file = extension should be json not jason, but oh well... = :)

Okay, so I now I have data in the proper = format for importing with json2sstable.  Like I said, I want to = import this data into a new CF. Let's call it TestCF2 (in the same = keyspace):

[default@Test] create column = family TestCF2 with comparator =3D UTF8Type and column_metadata =3D = [{column_name: nodeId, validation_class: = UTF8Type}];
4dcc44b0-b9fa-11e0-0000-242d50cf1fb5
Waiti= ng for schema agreement...
... schemas agree across the = cluster
[default@Test] update column family TestCF2 with = key_validation_class=3DUTF8Type; 
5092dec0-b9fa-11e0-0000-2= 42d50cf1fb5
Waiting for schema agreement...
... = schemas agree across the = cluster
[default@Test] 

Agai= n there are no files created in the data directory, so I do a flush for = the new = CF:

[imac:isec/cassandra/apache-cassandra-0.= 8.2] jas% bin/nodetool -h localhost flush Test = TestCF2
[imac:isec/cassandra/apache-cassandra-0.8.2] = jas% 

Well, that did not help, still = no files for TestCF2.  There is no actual data yet, so I'm guessing = the system tables have what they need. So, I go ahead and import the = data using = json2sstable:

[imac:isec/cassandra/apache-ca= ssandra-0.8.2] jas% bin/json2sstable -K Test -c TestCF2 testcf.jason = /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.d= b
Importing 3 keys...
3 keys imported = successfully.
[imac:isec/cassandra/apache-cassandra-0.8.2] = jas% 

Okay, and the files did show = up:

[imac:datastore/data/Test] jas% = pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% = ls
TestCF-g-1-Data.db = TestCF2-g-1-Data.db
TestCF-g-1-Filter.db = TestCF2-g-1-Filter.db
TestCF-g-1-Index.db = TestCF2-g-1-Index.db
TestCF-g-1-Statistics.db = TestCF2-g-1-Statistics.db
[imac:datastore/data/Test] = jas% 

Back in the = CLI:

[default@Test] list = TestCF2;
Using default limit of 100

0 = Row = Returned.
[default@Test] 

Ho= wever, if I edit TestCF-g-1-Data.db, I can sort of see the data is = present.  Quitting and starting the CLI has no affect. What gets = the the CF data into the MemTables so it's accessible to a Cassandra = client?   I tried various nodetool commands (repair, compact, = cleanup, flush, invalidatekeycache, invalidaterowcache) and I = don't see any rows for TestCF2 in the = CLI.

Anyway, it seems this procedure works as = I'd expect, well except for not seeing the new data. = :)

What am I missing = here?

Thanks,

Jeff
=

--

Jeff Schmidt

535 Consulting

jas@535consulting.com

=






= --Apple-Mail-3-788677141--