aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: Speeding up Aurora client job creation
Date Tue, 17 Mar 2015 02:58:00 GMT
Exploring the possibilities - can you use python 2.7?  If so, you could
leverage some of the private libraries within the client and lower the
surface area of what you need to build.  It won't be a stable programmatic
API, but you might get moving faster.  I assume this is what Stephan is
suggesting.

-=Bill

On Mon, Mar 16, 2015 at 7:52 PM, Hussein Elgridly <
hussein@broadinstitute.org> wrote:

> I'm not quite sure I understand your question, so I'll be painfully
> explicit instead.
>
> I don't want to use the existing Aurora client because it's slow (Pystachio
> + repeated HTTP connection overheads, as detailed earlier in this thread).
> Instead, I want to use the Thrift interface to talk to the Aurora scheduler
> directly - I can skip Pystachio entirely and keep the HTTP connection
> open).
>
> I cannot use the official Thrift bindings for Python as they do not yet
> support Python 3 [1]. There is a third-party, pure Python implementation of
> Thrift that does support Python 3 called thriftpy [2]. However, thriftpy
> does not include a THTTPClient transport, which is what the Aurora
> scheduler uses. I will therefore have to write my own THTTPClient transport
> (and probably contribute it back to thriftpy).
>
> [1] https://issues.apache.org/jira/browse/THRIFT-1857
> [2] https://github.com/eleme/thriftpy
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 16 March 2015 at 19:11, Erb, Stephan <Stephan.Erb@blue-yonder.com>
> wrote:
>
> > Just to make sure I get this correctly: You say, you cannot use the
> > existing python client because it is python 2.7 only so you want to
> write a
> > new one in python 3?
> >
> > Regards,
> > Stephan
> > ________________________________________
> > From: Hussein Elgridly <hussein@broadinstitute.org>
> > Sent: Monday, March 16, 2015 11:44 PM
> > To: dev@aurora.incubator.apache.org
> > Subject: Re: Speeding up Aurora client job creation
> >
> > So this has now bubbled back to the top of my TODO list and I'm actively
> > working on it. I am entirely new to Thrift so please forgive the newbie
> > questions...
> >
> > I would like to talk to the Aurora scheduler directly from my (Python)
> > application using Thrift. Since I'm on Python 3.4 I've had to use
> thriftpy:
> > https://github.com/eleme/thriftpy
> >
> > As far as I can tell, the following should work (by default, thriftpy
> uses
> > a TBufferedTransport around a TSocket):
> >
> > ---
> > import thriftpy
> > import thriftpy.rpc
> >
> > aurora_api = thriftpy.load("api.thrift")
> >
> > client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
> > host="localhost", port=8081,
> > proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
> >
> > print(client.getJobSummary())
> > ---
> >
> > Obviously I wouldn't be writing this email if it did work :) It hangs.
> >
> > I jumped into pdb and found it was sending the following payload:
> >
> > b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
> "ttype":
> > 1, "version": 1}, "payload": {}}'
> >
> > to a socket that looked like this:
> >
> > <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
> > laddr=('<localhost's_private_ip>', 49167),
> raddr=('localhost's_private_ip',
> > 8081)>
> >
> > ...but was waiting forever to receive any data. Adding a timeout just
> > triggered the timeout.
> >
> > I'm stumped. Any clues?
> >
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 12 February 2015 at 04:15, Erb, Stephan <Stephan.Erb@blue-yonder.com>
> > wrote:
> >
> > > Hi Hussein,
> > >
> > > we also had slight performance problems when talking to Aurora. We
> ended
> > > up using the existing python client directly in our code (see
> > > apache.aurora.client.api.__init__.py). This allowed us to reuse the api
> > > object and its scheduler connection, dropping a connection latency of
> > about
> > > 0.3-0.4 seconds per request.
> > >
> > > Best Regards,
> > > Stephan
> > > ________________________________________
> > > From: Bill Farner <wfarner@apache.org>
> > > Sent: Wednesday, February 11, 2015 9:29 PM
> > > To: dev@aurora.incubator.apache.org
> > > Subject: Re: Speeding up Aurora client job creation
> > >
> > > To reduce that time you will indeed want to talk directly to the
> > > scheduler.  This will definitely require you to roll up your sleeves a
> > bit
> > > and set up a thrift client to our api (based on api.thrift [1]), since
> > you
> > > will need to specify your tasks in a format that the thermos executor
> can
> > > understand.  Turns out this is JSON data, so it should not be *too*
> > > prohibitive.
> > >
> > > However, there is another technical limitation you will hit for the
> > > submission rate you are after.  The scheduler is backed by a durable
> > store
> > > whose write latency is at minimum the amount of time required to fsync.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
> > >
> > > -=Bill
> > >
> > > On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
> > > hussein@broadinstitute.org> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I'm looking at a use cases that involves submitting potentially
> > hundreds
> > > of
> > > > jobs a second to our Mesos cluster. My tests show that the aurora
> > client
> > > is
> > > > taking 1-2 seconds for each job submission, and that I can run about
> > four
> > > > client processes in parallel before they peg the CPU at 100%. I need
> > more
> > > > throughput than this!
> > > >
> > > > Squashing jobs down to the Process or Task level doesn't really make
> > > sense
> > > > for our use case. I'm aware that with some shenanigans I can batch
> jobs
> > > > together using job instances, but that's a lot of work on my current
> > > > timeframe (and of questionable utility given that the jobs certainly
> > > won't
> > > > have identical resource requirements).
> > > >
> > > > What I really need is (at least) an order of magnitude speedup in
> terms
> > > of
> > > > being able to submit jobs to the Aurora scheduler (via the client or
> > > > otherwise).
> > > >
> > > > Conceptually it doesn't seem like adding a job to a queue should be a
> > > thing
> > > > that takes a couple of seconds, so I'm baffled as to why it's taking
> so
> > > > long. As an experiment, I wrapped the call to client.execute() in
> > > > client.py:proxy_main in cProfile and called aurora job create with a
> > very
> > > > simple test job.
> > > >
> > > > Results of the profile are in the Gist below:
> > > >
> > > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
> > > >
> > > > Our of a 0.977s profile time, the two things that stick out to me
> are:
> > > >
> > > > 1. 0.526s spent in Pystachio for a job that doesn't use any templates
> > > > 2. 0.564s spent in create_job, presumably talking to the scheduler
> (and
> > > > setting up the machinery for doing so)
> > > >
> > > > I imagine I can sidestep #1 with a check for "{{" in the job file and
> > > > bypass Pystachio entirely. Can I also skip the Aurora client entirely
> > and
> > > > talk directly to the scheduler? If so what does that entail, and are
> > > there
> > > > any risks associated?
> > > >
> > > > Thanks,
> > > > -Hussein
> > > >
> > > > Hussein Elgridly
> > > > Senior Software Engineer, DSDE
> > > > The Broad Institute of MIT and Harvard
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message