Sorry for the belated reply.
 
Your approach #2 must work.
 
One concern I can think of: if the machines are spread across geographically and there is a substantial difference in network speed, AND you want to sync data this may take forever.
Also add a very good check on whether the schema got applied.
 
Good Luck!
 
Sent: Friday, May 24, 2013 1:14 AM
Subject: Re: Creating namespace and column family from multiple nodes concurrently
 
I am sorry if I was not clear. I was using nodes to refer machines (or vice versa).
 
Let me put in another way...
 
The application is composed of multiple instances of an executable. The application runs on multiple machines concurrently. All the instances are going to issue the same CQL command to and try to create exactly same namespace and column families.
 
Thank you
Emalayan
 

From: Arthur Zubarev <Arthur.Zubarev@Aol.com>
To: Emalayan Vairavanathan <svemalayan@yahoo.com>; user@cassandra.apache.org
Sent: Thursday, 23 May 2013 1:15 PM
Subject: Re: Creating namespace and column family from multiple nodes concurrently
 
so where the multiple nodes are? I am just puzzled
 
Sent: Thursday, May 23, 2013 3:43 PM
Subject: Re: Creating namespace and column family from multiple nodes concurrently
 
"Would each device/machine have its own keyspace?"
 
No. All the machines are going to run the exactly same CQL commands and going to create the same namespace and column families.
 
Thank you
Emalayan
 

From: Arthur Zubarev <Arthur.Zubarev@Aol.com>
To: Emalayan Vairavanathan <svemalayan@yahoo.com>; user@cassandra.apache.org
Sent: Thursday, 23 May 2013 12:20 PM
Subject: Re: Creating namespace and column family from multiple nodes concurrently
 
Would each device/machine have its own keyspace?
 
Basically, your client needs to take care of a successful creation of the schema and any other verifications and it is going to be time consuming.
 
Sent: Thursday, May 23, 2013 3:07 PM
Subject: Re: Creating namespace and column family from multiple nodes concurrently
 
Hi Arthur and Farraz,

Thank you for getting back to me.

I am trying to avoid sync among concurrent instances and this is why I am preferring Option - 2. Further in my application, I have reasonable window between the application initialization phase and the application runtime.  So as long as Cassandra can safely handle concurrent creation I should be fine.

Do you have any idea how Cassandra is going to handle concurrent namespace and column family creation (Here all the instances are going to create the same namespace and column families concurrently)?
        - Does Cassandra take much time to agree on a final schema (In case if Cassandra is using some sort of exponential back off algorithms to handle schema conflicts) ?
        - Or is it going to result schema conflicts which needs manual intervention ?
        - Or will this result in race conditions ?
        - Or some other issues e.g: memory/ cpu /network bottlenecks ? 

Thank you
Emalayan
 

From: Arthur Zubarev <arthur.zubarev@aol.com>
To: user@cassandra.apache.org; svemalayan@yahoo.com
Sent: Wednesday, 22 May 2013 8:07 PM
Subject: Re: Creating namespace and column family from multiple nodes concurrently
 
I am assuming here you want to sync all the 100s of nodes once the application is airborne. I suspect this would flood the network and even potentially affect the machine itself memory-wise. How are you going to maintain the nodes (compaction+repair)?
 
 
Regards,

Arthur

 
 
-----Original Message-----
From: Emalayan Vairavanathan <svemalayan@yahoo.com>
To: user <user@cassandra.apache.org>
Sent: Wed, May 22, 2013 8:31 pm
Subject: Creating namespace and column family from multiple nodes concurrently

Hi all,
 
I am implementing a distributed application which runs on 100s of machines concurrently. This application is going to use Cassandra as underlaying storage.
 
The application creates the schema (name space and column families) during initialization phase.  It seems I have two options to create the schema.

Option - 1 : Using a single node for schema creation.
        Option - 2: Having all the nodes (> 100) to run the same schema creation logic (First, nodes will check whether the schema is already available and then try to create the schema if it is not available already). 
 
To keep the initialization phase simple, I prefer to go for Option - 2. However I am not sure how Cassandra is going to behave if multiple nodes try to create the same schema (namespace and column families) concurrently. It would be nice if someone can tell me about the implications of Option - 2 with Cassandra version 1.2.2.

Please let me know if you have question.

Thank you
VE