aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hussein Elgridly <huss...@broadinstitute.org>
Subject Re: Speeding up Aurora client job creation
Date Mon, 16 Mar 2015 22:58:03 GMT
I dug into TRequestsTransport and I get it now. Sending raw bytes across a
socket is not the same as doing an HTTP POST with said bytes stuffed in the
body!

I guess I too will be rolling my own HTTP transport...

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 16 March 2015 at 18:44, Hussein Elgridly <hussein@broadinstitute.org>
wrote:

> So this has now bubbled back to the top of my TODO list and I'm actively
> working on it. I am entirely new to Thrift so please forgive the newbie
> questions...
>
> I would like to talk to the Aurora scheduler directly from my (Python)
> application using Thrift. Since I'm on Python 3.4 I've had to use thriftpy:
> https://github.com/eleme/thriftpy
>
> As far as I can tell, the following should work (by default, thriftpy uses
> a TBufferedTransport around a TSocket):
>
> ---
> import thriftpy
> import thriftpy.rpc
>
> aurora_api = thriftpy.load("api.thrift")
>
> client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
> host="localhost", port=8081,
> proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
>
> print(client.getJobSummary())
> ---
>
> Obviously I wouldn't be writing this email if it did work :) It hangs.
>
> I jumped into pdb and found it was sending the following payload:
>
> b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
> "ttype": 1, "version": 1}, "payload": {}}'
>
> to a socket that looked like this:
>
> <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
> laddr=('<localhost's_private_ip>', 49167), raddr=('localhost's_private_ip',
> 8081)>
>
> ...but was waiting forever to receive any data. Adding a timeout just
> triggered the timeout.
>
> I'm stumped. Any clues?
>
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 12 February 2015 at 04:15, Erb, Stephan <Stephan.Erb@blue-yonder.com>
> wrote:
>
>> Hi Hussein,
>>
>> we also had slight performance problems when talking to Aurora. We ended
>> up using the existing python client directly in our code (see
>> apache.aurora.client.api.__init__.py). This allowed us to reuse the api
>> object and its scheduler connection, dropping a connection latency of about
>> 0.3-0.4 seconds per request.
>>
>> Best Regards,
>> Stephan
>> ________________________________________
>> From: Bill Farner <wfarner@apache.org>
>> Sent: Wednesday, February 11, 2015 9:29 PM
>> To: dev@aurora.incubator.apache.org
>> Subject: Re: Speeding up Aurora client job creation
>>
>> To reduce that time you will indeed want to talk directly to the
>> scheduler.  This will definitely require you to roll up your sleeves a bit
>> and set up a thrift client to our api (based on api.thrift [1]), since you
>> will need to specify your tasks in a format that the thermos executor can
>> understand.  Turns out this is JSON data, so it should not be *too*
>> prohibitive.
>>
>> However, there is another technical limitation you will hit for the
>> submission rate you are after.  The scheduler is backed by a durable store
>> whose write latency is at minimum the amount of time required to fsync.
>>
>> [1]
>>
>> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>>
>> -=Bill
>>
>> On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
>> hussein@broadinstitute.org> wrote:
>>
>> > Hi folks,
>> >
>> > I'm looking at a use cases that involves submitting potentially
>> hundreds of
>> > jobs a second to our Mesos cluster. My tests show that the aurora
>> client is
>> > taking 1-2 seconds for each job submission, and that I can run about
>> four
>> > client processes in parallel before they peg the CPU at 100%. I need
>> more
>> > throughput than this!
>> >
>> > Squashing jobs down to the Process or Task level doesn't really make
>> sense
>> > for our use case. I'm aware that with some shenanigans I can batch jobs
>> > together using job instances, but that's a lot of work on my current
>> > timeframe (and of questionable utility given that the jobs certainly
>> won't
>> > have identical resource requirements).
>> >
>> > What I really need is (at least) an order of magnitude speedup in terms
>> of
>> > being able to submit jobs to the Aurora scheduler (via the client or
>> > otherwise).
>> >
>> > Conceptually it doesn't seem like adding a job to a queue should be a
>> thing
>> > that takes a couple of seconds, so I'm baffled as to why it's taking so
>> > long. As an experiment, I wrapped the call to client.execute() in
>> > client.py:proxy_main in cProfile and called aurora job create with a
>> very
>> > simple test job.
>> >
>> > Results of the profile are in the Gist below:
>> >
>> > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
>> >
>> > Our of a 0.977s profile time, the two things that stick out to me are:
>> >
>> > 1. 0.526s spent in Pystachio for a job that doesn't use any templates
>> > 2. 0.564s spent in create_job, presumably talking to the scheduler (and
>> > setting up the machinery for doing so)
>> >
>> > I imagine I can sidestep #1 with a check for "{{" in the job file and
>> > bypass Pystachio entirely. Can I also skip the Aurora client entirely
>> and
>> > talk directly to the scheduler? If so what does that entail, and are
>> there
>> > any risks associated?
>> >
>> > Thanks,
>> > -Hussein
>> >
>> > Hussein Elgridly
>> > Senior Software Engineer, DSDE
>> > The Broad Institute of MIT and Harvard
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message