cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Schmidt <...@535consulting.com>
Subject Cassandra bulk import confusion
Date Fri, 29 Jul 2011 16:10:42 GMT
Hello:

I'm relatively new to Cassandra, but I've been searching around, and it looks like Cassandra
0.8.x has improved support for bulk importing of data.  I keep finding references to the json2sstable
command, and I've read about that on the Datastax and Apache documentation pages.

There's a lot of detail here if you want it, otherwise please skip to the end. json2sstable
seems to run successfully, but I cannot see the data in the new CF using the CLI.

My goal is extract data from various sources, munge it together in some manner, and then bulk
load it into Cassandra.  That is as opposed to using Hector to programmatically insert the
data.  I'd like to deploy these files to the cloud (Puppet) and then instruct Cassndra to
bulk load them, and then inform the application that new data exists.  This is for a period
content update of certain column families of curated, read-only, data that occurs on a monthly
basis. I'm thinking of using JMX to signal the application to switch to a new set of CFs and
keep running w/o downtime.  At a later time, I'll delete the old CFs.

I'm using Cassandra 0.8.2 and I'm just playing with this concept.  I create a test CF using
the CLI

[default@Ingenuity] use Test;
Authenticated to keyspace: Test
[default@Test] create column family TestCF with comparator = UTF8Type and column_metadata
= [{column_name: nodeId, validation_class: UTF8Type}];
28991070-b9f9-11e0-0000-242d50cf1fb5
Waiting for schema agreement...
... schemas agree across the cluster
[default@Test] update column family TestCF with key_validation_class=UTF8Type; 
2af88440-b9f9-11e0-0000-242d50cf1fb5
Waiting for schema agreement...
... schemas agree across the cluster
[default@Test] set TestCF['SID|123']['nodeId'] = 'ING:001';  
Value inserted.
[default@Test] set TestCF['EG|3030']['nodeId'] = 'ING:002';  
Value inserted.
[default@Test] set TestCF['EG|3031']['nodeId'] = 'ING:003'; 
Value inserted.
[default@Test] list TestCF;
Using default limit of 100
-------------------
RowKey: EG|3030
=> (column=nodeId, value=ING:002, timestamp=1311954072252000)
-------------------
RowKey: EG|3031
=> (column=nodeId, value=ING:003, timestamp=1311954073631000)
-------------------
RowKey: SID|123
=> (column=nodeId, value=ING:001, timestamp=1311954072249000)

3 Rows Returned.
[default@Test] 

Now, cassandra.yaml is stock, except I changed it to place the data in a non-default location:

# directories where Cassandra should store data on disk.
data_file_directories:
    - /usr/local/ingenuity/isec/cassandra/datastore/data

# commit log
commitlog_directory: /usr/local/ingenuity/isec/cassandra/datastore/commitlog

# saved caches
saved_caches_directory: /usr/local/ingenuity/isec/cassandra/datastore/saved_caches

In that data directory:

[imac:datastore/data/Test] jas% pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% ls
[imac:datastore/data/Test] jas% 

There is nothing there.  Perhaps Cassandra has not yet felt the need to write the SSTables.
 So, since I need to reference in actual data file with sstable2json, I ran nodetool flush:

[imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost flush Test TestCF
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% 

Now, I have files!

[imac:datastore/data/Test] jas% pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% ls
TestCF-g-1-Data.db		TestCF-g-1-Index.db
TestCF-g-1-Filter.db		TestCF-g-1-Statistics.db
[imac:datastore/data/Test] jas% 

Given that, I'm able run sstable2json and I can see I'm getting what's in that CF:

[imac:isec/cassandra/apache-cassandra-0.8.2] jas%  bin/sstable2json /usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF-g-1-Data.db
> testcf.jason
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% cat testcf.jason 
{
"45477c33303330": [["nodeId","ING:002",1311954072252000]],
"45477c33303331": [["nodeId","ING:003",1311954073631000]],
"5349447c313233": [["nodeId","ING:001",1311954072249000]]
}
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% 

Oops, okay, that file extension should be json not jason, but oh well... :)

Okay, so I now I have data in the proper format for importing with json2sstable.  Like I said,
I want to import this data into a new CF. Let's call it TestCF2 (in the same keyspace):

[default@Test] create column family TestCF2 with comparator = UTF8Type and column_metadata
= [{column_name: nodeId, validation_class: UTF8Type}];
4dcc44b0-b9fa-11e0-0000-242d50cf1fb5
Waiting for schema agreement...
... schemas agree across the cluster
[default@Test] update column family TestCF2 with key_validation_class=UTF8Type; 
5092dec0-b9fa-11e0-0000-242d50cf1fb5
Waiting for schema agreement...
... schemas agree across the cluster
[default@Test] 

Again there are no files created in the data directory, so I do a flush for the new CF:

[imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/nodetool -h localhost flush Test TestCF2
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% 

Well, that did not help, still no files for TestCF2.  There is no actual data yet, so I'm
guessing the system tables have what they need. So, I go ahead and import the data using json2sstable:

[imac:isec/cassandra/apache-cassandra-0.8.2] jas% bin/json2sstable -K Test -c TestCF2 testcf.jason
/usr/local/ingenuity/isec/cassandra/datastore/data/Test/TestCF2-g-1-Data.db
Importing 3 keys...
3 keys imported successfully.
[imac:isec/cassandra/apache-cassandra-0.8.2] jas% 

Okay, and the files did show up:

[imac:datastore/data/Test] jas% pwd
/usr/local/ingenuity/isec/cassandra/datastore/data/Test
[imac:datastore/data/Test] jas% ls
TestCF-g-1-Data.db		TestCF2-g-1-Data.db
TestCF-g-1-Filter.db		TestCF2-g-1-Filter.db
TestCF-g-1-Index.db		TestCF2-g-1-Index.db
TestCF-g-1-Statistics.db	TestCF2-g-1-Statistics.db
[imac:datastore/data/Test] jas% 

Back in the CLI:

[default@Test] list TestCF2;
Using default limit of 100

0 Row Returned.
[default@Test] 

However, if I edit TestCF-g-1-Data.db, I can sort of see the data is present.  Quitting and
starting the CLI has no affect. What gets the the CF data into the MemTables so it's accessible
to a Cassandra client?   I tried various nodetool commands (repair, compact, cleanup, flush,
invalidatekeycache, invalidaterowcache) and I don't see any rows for TestCF2 in the CLI.

Anyway, it seems this procedure works as I'd expect, well except for not seeing the new data.
:)

What am I missing here?

Thanks,

Jeff
--
Jeff Schmidt
535 Consulting
jas@535consulting.com
http://www.535consulting.com
(650) 423-1068










Mime
View raw message