@Jan

 Your use-case is different than what i though. So basically you have only one data source (the feed) and many consumers (the workers)

 Only one worker is allowed to consumer the feed at a time.

 This can be modeled very easily using distributed lock with C.A.S

CREATE TABLE feed_lock (
    lock text PRIMARY KEY,
    worker_token text
);

1) First initialize the table with : INSERT INTO feed_lock (lock) VALUES('lock'). The primary key value is hard-coded and is always the same: 'lock'. It does not really matter. At this step, the column worker_token is null since we did not insert any value in it

2) If a worker w1 wants to get the lock: UPDATE feed_lock SET worker_token='token1' WHERE lock='lock' IF worker_token=null.
w1 tries to acquire the lock if it is not hold by any other worker (IF worker_token=null)

3) Concurrently, if another worker w2 tries to acquire the lock, it will fail since worker_token is not null any more.

4) w1 will release the lock with: UPDATE feed_lock SET worker_token=null WHERE lock='lock' IF worker_token='token1'
The important detail here is that only w1 knows the value of token1 so only w1 is able to release the lock

This is a very simple design.

 Now it can happen that the lock is never released because one worker holding it crashed and the secret token is lost. To circumvent that, you can held the lock using an update with TTL. The lock will be released no matter how after the TTL expires. If the processing of the feed takes longer than the TTL time, the current worker can always extends the lease with another update with TTL again to reset the TTL value.

 Hope that helps

 Regards

 Duy Hai DOAN



 


On Fri, Apr 4, 2014 at 6:48 PM, Jan Algermissen <jan.algermissen@nordsc.com> wrote:
Hi DuyHai,


On 04 Apr 2014, at 13:58, DuyHai Doan <doanduyhai@gmail.com> wrote:

> @Jan
>
>  This subject of distributed workers & queues has been discussed in the mailing list many times.

Sorry + thanks.

Unfortunately, I do not want to use C* as a queue, but to coordinate workers that page through an (XML) data feed of events every N seconds.

Let me try again:

- I have N instances of the same system, replicated to ensure work is
  being done despite failure of instances
- the instances are master less and know nothing about each other. Given
  them an integer ID isn’r really possible and the number of instances isn’t
  really known
- there is a schedule, controlling how often the feed is read, say once every Minute
- the schedule might change by way of an administrator of the ‘feed polling’
- the worker instances check for work every, e.g. 10 secs
- once a worker starts, it checks whether there is work to do (the schedule aspect) and if so,
  starts polling the feed until the last event has been reached.
- During that time, no other worker must poll the feed
- once the working worker is done it saves the timestamp or ID of the last seen event and sets the next schedule

- the processing of the events might take much longer than the schedule intervals

I hope this explains more, what I am up to. Maybe I can adapt your suggestion, I just do not see how.

Jan



> Basically one implementation can be:
>
> 1) p data providers, c data consumers
> 2) create partitions (physical rows) of arbitrary number of columns (let's say 10 000, not too big though). Partition key = bucket number (#b)
> 3) assign an integer id (pId) to each provider, same for each consumer (cId)
> 4) each provider can only write messages in bucket number such that #b mod p = pId mod p
> 5) once the provider reaches 10 000 messages per bucket, it switches to the next one with new #b = old #b + p
> 6) the consumers follow the same rule for bucket switching
>
> Example:
>
>  p = 5, c = 3
>
>  - p1 writes messages into buckets {1,6,11,16...} // 1, 1+5, 1+5+5, ....
>  - p2 writes messages into buckets {2,7,12,17...} // 2, 2+5, 2+5+5,...
>  - p3 writes messages into buckets {3,8,13,18...}
>  - p4 writes messages into buckets {4,9,14,19...}
>  - p5 writes messages into buckets {5,10,15,20...}
>
>  - c1 consumes messages from buckets {1,4,7,10...} // 1, 1+3, 1+3+3...
>  - c2 consumes messages from buckets {2,5,8,11...}
>  - c1 consumes messages from buckets {3,6,9,12...}
>
> Of course, consumers can not re-put messages into the bucket otherwise the counting (10 000 elements/bucket) is screwed
>
> Alternatively, you can insert messages with TTL to automatically expired "consumed buckets" after a while, saving you the hassle to clean up old buckets to reclaim disk space.
>
>
>  There are other implementations based on distributed lock using C* C.A.S also but the above algorithm do not requires any lock.
>
> Regards
>
>  Duy Hai DOAN
>
>
>
>
>
> On Fri, Apr 4, 2014 at 12:47 PM, prem yadav <ipremyadav@gmail.com> wrote:
> Oh ok. I thought you did not have a cassandra cluster already. Sorry about that.
>
>
> On Fri, Apr 4, 2014 at 11:42 AM, Jan Algermissen <jan.algermissen@nordsc.com> wrote:
>
> On 04 Apr 2014, at 11:18, prem yadav <ipremyadav@gmail.com> wrote:
>
>> Though cassandra can work but to me it looks like you could use a persistent queue for example (rabbitMQ) to implement this. All your workers can subscribe to a queue.
>> In fact, why not just MySQL?
>
> Hey, I have got a C* cluster that can (potentially) do CAS.
>
> Why would I set up a MySQL cluster to solve that problem?
>
> And yeah, I could use a queue or redis or whatnot, but I want to avoid yet another moving part :-)
>
> Jan
>
>
>>
>>
>> On Thu, Apr 3, 2014 at 11:44 PM, Jan Algermissen <jan.algermissen@nordsc.com> wrote:
>> Hi,
>>
>> maybe someone knows a nice solution to the following problem:
>>
>> I have N worker processes that are intentionally masterless and do not know about each other - they are stateless and independent instances of a given service system.
>>
>> These workers need to poll an event feed, say about every 10 seconds and persist a state after processing the polled events so the next worker knows where to continue processing events.
>>
>> I would like to use C*’s CAS feature to coordinate the workers and protect the shared state (a row or cell in  a C* key space, too).
>>
>> Has anybody done something similar and can suggest a ‘clever’ data model design and interaction?
>>
>>
>>
>> Jan
>>
>
>
>