spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Heuer <>
Subject Re: Please keep s3://spark-related-packages/ alive
Date Tue, 27 Feb 2018 15:36:06 GMT
On Tue, Feb 27, 2018 at 8:17 AM, Sean Owen <> wrote:

> See
> d3kbcqa49mib13-cloudfront-net-td22427.html -- it was 'retired', yes.
> Agree with all that, though they're intended for occasional individual use
> and not a case where performance and uptime matter. For that, I think you'd
> want to just host your own copy of the bits you need.
> The notional problem was that the S3 bucket wasn't obviously
> controlled/blessed by the ASF and yet was a source of official bits. It was
> another set of third-party credentials to hand around to release managers,
> which was IIRC a little problematic.
> Homebrew does host distributions of ASF projects, like Spark, FWIW.

To clarify, the apache-spark.rb formula in Homebrew uses the Apache mirror
closer.lua script


> On Mon, Feb 26, 2018 at 10:57 PM Nicholas Chammas <
>> wrote:
>> If you go to the Downloads <> page
>> and download Spark 2.2.1, you’ll get a link to an Apache mirror. It didn’t
>> use to be this way. As recently as Spark 2.2.0, downloads were served via
>> CloudFront <>, which was backed by an
>> S3 bucket named spark-related-packages.
>> It seems that we’ve stopped using CloudFront, and the S3 bucket behind it
>> has stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m guessing
>> this is part of an effort to use the Apache mirror network, like other
>> Apache projects do.
>> From a user perspective, the Apache mirror network is several steps down
>> from using a modern CDN. Let me summarize why:
>>    1. *Apache mirrors are often slow.* Apache does not impose any
>>    performance requirements on its mirrors
>>    <>.
>>    The difference between getting a good mirror and a bad one means
>>    downloading Spark in less than a minute vs. 20 minutes. The problem is so
>>    bad that I’ve thought about adding an Apache mirror blacklist
>>    <>
>>    to Flintrock to avoid getting one of these dud mirrors.
>>    2. *Apache mirrors are inconvenient to use.* When you download
>>    something from an Apache mirror, you get a link like this one
>>    <>.
>>    Instead of automatically redirecting you to your download, though, you need
>>    to process the results you get back
>>    <>
>>    to find your download target. And you need to handle the high download
>>    failure rate, since sometimes the mirror you get doesn’t have the file it
>>    claims to have.
>>    3. *Apache mirrors are incomplete.* Apache mirrors only keep around
>>    the latest releases, save for a few “archive” mirrors, which are often
>>    slow. So if you want to download anything but the latest version of Spark,
>>    you are out of luck.
>> Some of these problems can be mitigated by picking a specific mirror that
>> works well and hardcoding it in your scripts, but that defeats the purpose
>> of dynamically selecting a mirror and makes you a “bad” user of the mirror
>> network.
>> I raised some of these issues over on INFRA-10999
>> <>. The ticket sat for
>> a year before I heard anything back, and the bottom line was that none of
>> the above problems have a solution on the horizon. It’s fine. I understand
>> that Apache is a volunteer organization and that the infrastructure team
>> has a lot to manage as it is. I still find it disappointing that an
>> organization of Apache’s stature doesn’t have a better solution for this in
>> collaboration with a third party. Python serves PyPI downloads using
>> Fastly <> and Homebrew serves packages using
>> Bintray <>. They both work really, really well. Why
>> don’t we have something as good for Apache projects? Anyway, that’s a
>> separate discussion.
>> What I want to say is this:
>> Dear whoever owns the spark-related-packages S3 bucket
>> <>,
>> Please keep the bucket up-to-date with the latest Spark releases,
>> alongside the past releases that are already on there. It’s a huge help to
>> the Flintrock <> project, and it’s
>> an equally big help to those of us writing infrastructure automation
>> scripts that deploy Spark in other contexts.
>> I understand that hosting this stuff is not free, and that I am not
>> paying anything for this service. If it needs to go, so be it. But I wanted
>> to take this opportunity to lay out the benefits I’ve enjoyed thanks to
>> having this bucket around, and to make sure that if it did die, it didn’t
>> die a quiet death.
>> Sincerely,
>> Nick
>> ​

View raw message