cayenne-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrus Adamchik <and...@objectstyle.org>
Subject Re: Partitioning a query result..
Date Fri, 16 Dec 2016 18:35:17 GMT
Actually this is an interesting architectural discussion. Speaking for myself, I certainly
like having it here.

The 2 main approaches have already been mentioned:

1. Single dispatcher -> message queue -> multiple workers.
2. Multiple workers that somehow guess their part of the workload.

Both can be made to work. Generally I like #1 as it is not brittle at all. This is how work
is parallelized on the cloud essentially. Dispatcher instance would poll the DB and post a
stream of IDs to the queue. Workers would grab the ids from the queue and do their processing.
Worker instances can come and go. A good choice for the message queue is Apache Kafka, that
supports automatically spreading messages to multiple consumers (and yes there's Bootique
support for it). If your can make your dispatcher fast (and I don't see why you can't... fetching
10K IDs can be done in milliseconds with proper DB indexes), you can keep adding as many workers
as needed. 

So, to confirm your scenario:

* On each job run do you need to reprocess previously seen records, or do you only care about
new records since the last run?
* On a single instance, do you have an idea of "the main query time" vs "processing time +
any extra queries and commits"?

Andrus


> On Dec 16, 2016, at 8:50 PM, Giaccone, Tony <anthony.giaccone@nytimes.com> wrote:
> 
> Right so I agree with the partitioning of the database, that's a thing that
> can be done.
> 
> Andrus, I'm a bit less confident in the proposal you're suggesting. I want
> to be able to spin up new instances potentially in new containers and run
> them in different environments. If we're moving to a cloud based
> infrastructure, then paralyzing in a single app doesn't match up with that
> kind of deployment. I recognize there are limits on my solution as well.
> You have to deal with how you split up the rows into partitions.
> 
> The problem generally stated is. If I have 10,000 records and I want to
> distribute them across N number of workers. How do I do that?  How can I
> partition the result set at run time, into an arbitrary number of workers?
> 
> I also realize this is quickly expanding out side the scope of the cayenne
> users mailing list.
> 
> On Thu, Dec 15, 2016 at 3:18 AM, Andrus Adamchik <andrus@objectstyle.org>
> wrote:
> 
>> Here is another idea:
>> 
>> * read all data in one thread using iterated query and DataRows
>> * append received rows to an in-memory queue (individually or in small
>> batches)
>> * run a thread pool of processors that read from the queue and do the work.
>> 
>> As with all things performance, this needs to be measured and compared
>> with a single-threaded base line. This will not help with IO bottleneck,
>> but the processing part will happen in parallel. If you see any Cayenne
>> bottlenecks during the last step, you can start multiple ServerRuntimes -
>> one per thread.
>> 
>> Andrus
>> 
>>> On Dec 15, 2016, at 3:06 AM, John Huss <johnthuss@gmail.com> wrote:
>>> 
>>> Unless your DB disk is stripped into at least four parts this won't be
>>> faster.
>>> On Wed, Dec 14, 2016 at 5:46 PM Tony Giaccone <tgiaccone@gmail.com>
>> wrote:
>>> 
>>>> I want to speed thing up, by running multiple instances of a job that
>>>> fetches data from a table.  So that for example if I need to process
>> 10,000
>>>> rows
>>>> the query runs on each instance and returns 4 sets of 2500 rows one for
>>>> each instance with no duplication.
>>>> 
>>>> My first thought in SQL was to add something like this to the where
>>>> clause..
>>>> 
>>>> and MOD(ID, INSTANCE_COUNT) == INSTANCE_ID;
>>>> 
>>>> so that if the instance count was 4 then the instance IDs would run
>>>> 0,1,2,3.
>>>> 
>>>> I'm not quite sure how you would structure that using the queryAPI. Any
>>>> suggestions about that?
>>>> 
>>>> And there are some problems with this idea, as you have to be certain
>> your
>>>> IDs increase in a manner that aligns with your math so that the
>>>> partitioning is equal in size.
>>>> For example if your sequence increments by 20, then you would have to
>> futz
>>>> around with your math to get the right partitioning and that is the
>> problem
>>>> with this technique.
>>>> It's brittle it depends on getting a bunch of things in  "sync".
>>>> 
>>>> Does anyone have another idea of how to segment out rows that would
>> yield a
>>>> solution that's not quite so brittle?
>>>> 
>>>> 
>>>> 
>>>> Tony Giaccone
>>>> 
>> 
>> 


Mime
View raw message