cayenne-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Giaccone, Tony" <anthony.giacc...@nytimes.com>
Subject Re: Partitioning a query result..
Date Fri, 16 Dec 2016 17:50:00 GMT
Right so I agree with the partitioning of the database, that's a thing that
can be done.

Andrus, I'm a bit less confident in the proposal you're suggesting. I want
to be able to spin up new instances potentially in new containers and run
them in different environments. If we're moving to a cloud based
infrastructure, then paralyzing in a single app doesn't match up with that
kind of deployment. I recognize there are limits on my solution as well.
You have to deal with how you split up the rows into partitions.

The problem generally stated is. If I have 10,000 records and I want to
distribute them across N number of workers. How do I do that?  How can I
partition the result set at run time, into an arbitrary number of workers?

I also realize this is quickly expanding out side the scope of the cayenne
users mailing list.

On Thu, Dec 15, 2016 at 3:18 AM, Andrus Adamchik <andrus@objectstyle.org>
wrote:

> Here is another idea:
>
> * read all data in one thread using iterated query and DataRows
> * append received rows to an in-memory queue (individually or in small
> batches)
> * run a thread pool of processors that read from the queue and do the work.
>
> As with all things performance, this needs to be measured and compared
> with a single-threaded base line. This will not help with IO bottleneck,
> but the processing part will happen in parallel. If you see any Cayenne
> bottlenecks during the last step, you can start multiple ServerRuntimes -
> one per thread.
>
> Andrus
>
> > On Dec 15, 2016, at 3:06 AM, John Huss <johnthuss@gmail.com> wrote:
> >
> > Unless your DB disk is stripped into at least four parts this won't be
> > faster.
> > On Wed, Dec 14, 2016 at 5:46 PM Tony Giaccone <tgiaccone@gmail.com>
> wrote:
> >
> >> I want to speed thing up, by running multiple instances of a job that
> >> fetches data from a table.  So that for example if I need to process
> 10,000
> >> rows
> >> the query runs on each instance and returns 4 sets of 2500 rows one for
> >> each instance with no duplication.
> >>
> >> My first thought in SQL was to add something like this to the where
> >> clause..
> >>
> >> and MOD(ID, INSTANCE_COUNT) == INSTANCE_ID;
> >>
> >> so that if the instance count was 4 then the instance IDs would run
> >> 0,1,2,3.
> >>
> >> I'm not quite sure how you would structure that using the queryAPI. Any
> >> suggestions about that?
> >>
> >> And there are some problems with this idea, as you have to be certain
> your
> >> IDs increase in a manner that aligns with your math so that the
> >> partitioning is equal in size.
> >> For example if your sequence increments by 20, then you would have to
> futz
> >> around with your math to get the right partitioning and that is the
> problem
> >> with this technique.
> >> It's brittle it depends on getting a bunch of things in  "sync".
> >>
> >> Does anyone have another idea of how to segment out rows that would
> yield a
> >> solution that's not quite so brittle?
> >>
> >>
> >>
> >> Tony Giaccone
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message