giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: [VOTE][CHANGED] Release Giraph 1.0 (rc1)
Date Sun, 14 Apr 2013 21:26:43 GMT
Thank you, Avery, wish I had found the bug earlier.
Am 14.04.2013 23:25 schrieb "Avery Ching" <aching@apache.org>:

> Thanks for your input Sebastian.  Given the choice to removing
> PageRankVertex or adding the fix, I've added your fix and will cut RC2 a
> bit later today.  I really hope this is the last RC.
>
> Avery
>
> On 4/14/13 9:34 AM, Sebastian Schelter wrote:
>
>> Hi Avery,
>>
>> I see your concerns. The benchmarking question is difficult, we had very
>> bad experiences with Mahout in that regards. E.g., we once had a
>> M/R-based PageRank implementation in Mahout that uses our integer-based
>> vectors and removed it as we got public complaints that you can't fit
>> the whole web into the range of an integer. Personally, I'd also refrain
>> from using floats instead of doubles for benchmarks, as this simply
>> means you give up on accuracy.
>>
>> Regarding benchmarks, I guess the best thing we could do is publish our
>> own numbers. The current runtimes I've seen are already very good,
>> Giraph beat a very optimized Stratosphere implementation that we did for
>> a recent paper by approx. 25%.
>>
>> To conclude, I do in no way want to hold up the current release. I'm
>> perfectly fine with not including the patch and optimizing the
>> implementation for a 1.0.1 release, but then we should remove the
>> current examples.PageRankVertex from the 1.0 release, as the convergence
>> detection is broken and we should not knowingly ship bugged code.
>>
>> Best,
>> Sebastian
>>
>>
>> On 14.04.2013 18:18, Avery Ching wrote:
>>
>>> Hi Sebastian,
>>>
>>> Thanks for the patch.  I'll try to take a look at it.
>>>
>>> The only reason I bring the optimizations up is that a lot of folks tend
>>> to compare PageRank performance.  The optimizations I'm referring to are
>>> Giraph ones, not algorithmic ones.  We use ints, floats for ids,
>>> messages, respectively instead longs, doubles (1/2 network traffic) and
>>> IntNullArrayEdges vertex edges (efficient array backed edges) instead of
>>> ByteArrayEdges.  You can see
>>> https://issues.apache.org/**jira/browse/giraph-543<https://issues.apache.org/jira/browse/giraph-543>for
more details.
>>>
>>> Anyway, given that we are going to ship a 1.0.1 release in a few weeks
>>> for a variety of reasons, should this really hold up the current
>>> release?  I would prefer to not cut anymore RCs unless things are
>>> totally broken (i.e. profiles not compiling, major Giraph bugs, etc.).
>>> There are still a lot of outstanding issues in JIRA, we can't fix them
>>> all for the 1.0 release.
>>>
>>> Let me know what you think.
>>>
>>> Avery
>>>
>>> On 4/13/13 10:46 AM, Sebastian Schelter wrote:
>>>
>>>> Hi Avery,
>>>>
>>>> I found the bug and can I provide a patch today or tomorrow, so
>>>> hopefully we can include that in the release (to not knowingly ship
>>>> bugged code). Furthermore I improved the code to protect against
>>>> rounding errors.
>>>>
>>>> I don't really get what you mean with the missing optimization in
>>>> comparison to the benchmark PageRank implementation.
>>>>
>>>> The implementation in o.a.g.examples.PageRankVertex aims to be a robust
>>>> real-world implementation. As optimization, it dismisses edge weights
>>>> and reuses objects where possible. Furthermore it is able to handle
>>>> dangling vertices that are present in almost every real-world network
>>>> and it automatically detects the number of supersteps to run. With the
>>>> patch, it should also provide improved numerical stability.
>>>>
>>>> If the runtimes doesn't look good enough when compared to the benchmark
>>>> implementation, this might also be caused by the dataset which has a
>>>> skewed degree distribution (like most real-world networks). The
>>>> benchmark uses a uniform degree distribution AFAIK.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 13.04.2013 15:46, Avery Ching wrote:
>>>>
>>>>> That's great Sebastian.  I would also recommend taking a look at the
>>>>> PageRankBenchmark for a performance comparison.  It has been a lot of
>>>>> speed improvements that should be a bunch faster than PageRankVertex.
>>>>> Even that though, is not totally optimized.  Hopefully we'll be adding
>>>>> a
>>>>> "how to optimize performance" guide in the near future.  Should we
>>>>> delay
>>>>> the release or simply just ship a 1.1, say in the next month with this
>>>>> fix and supporting YARN's 2.0.4?  I'd like to get on a more normal
>>>>> release cycle rather than once a year =).
>>>>>
>>>>> Avery
>>>>>
>>>>> On 4/13/13 3:02 AM, Sebastian Schelter wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I got some good and bad news, I tested PageRankVertex (not the
>>>>>> Benchmark
>>>>>> but the example implementation o.a.g.examples.PageRankVertex) from
>>>>>> trunk
>>>>>> compiled for Hadoop 1.0 on a cluster of 26 machines with 208 cores.
>>>>>>
>>>>>> I used the Webbase2001 dataset [1] which has 115M vertices and more
>>>>>> than
>>>>>> 1B edges and got some awesome running times, average superstep takes
>>>>>> 15
>>>>>> seconds (!!!). Awesome work, I have to say!
>>>>>>
>>>>>> Unfortunately, there seems to be an issue with the convergence
>>>>>> detection, as it didn't get the correct convergence behavior. I'd
like
>>>>>> to have a look into that this week, so we can ship a performant
>>>>>> PageRank
>>>>>> implementation which automatically runs an appropriate number of
>>>>>> supersteps. Hope this doesn't delay the release too much.
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>>
>>>>>> [1] http://law.di.unimi.it/**webdata/webbase-2001/<http://law.di.unimi.it/webdata/webbase-2001/>
>>>>>>
>>>>>>
>>>>>> On 13.04.2013 07:39, Avery Ching wrote:
>>>>>>
>>>>>>> Thanks to the quick feedback from Roman and Lewis, we have cut
a
>>>>>>> new RC1
>>>>>>> that addresses the following issues.
>>>>>>>
>>>>>>> * Got rid of .git repo in tarball
>>>>>>> * Fixed issue with not compiling without git repo (GIRAPH-628)
>>>>>>> * Used gnutar in OSX rather than tar to generate the tarball
and
>>>>>>> get rid
>>>>>>> of warnings
>>>>>>> * Pushed GIRAPH-627 to support the yarn profile better
>>>>>>> * Tarball name changed to the final artifact name (giraph-1.0.tar.gz)
>>>>>>>
>>>>>>> Release notes:
>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/RELEASE_**
>>>>>>> NOTES.html<http://people.apache.org/~aching/giraph-1.0-RC1/RELEASE_NOTES.html>
>>>>>>>
>>>>>>> Release artifacts:
>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/<http://people.apache.org/~aching/giraph-1.0-RC1/>
>>>>>>>
>>>>>>> Corresponding git tag:
>>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=**
>>>>>>> shortlog;h=refs/tags/release-**1.0-RC1<https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC1>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Signing keys:
>>>>>>> http://people.apache.org/keys/**group/giraph.asc<http://people.apache.org/keys/group/giraph.asc>
>>>>>>>
>>>>>>> The vote runs for 72 hours, until Monday 11pm PST.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Avery
>>>>>>>
>>>>>>> Original message below regarding rc0:
>>>>>>>
>>>>>>> ------------------------------**-
>>>>>>>
>>>>>>> Fellow Giraphers,
>>>>>>>
>>>>>>> We have a our first release candidate since graduating from
>>>>>>> incubation.
>>>>>>>     This is a source release, primarily due to the different
>>>>>>> versions of
>>>>>>> Hadoop we support with munge (similar to the 0.1 release).  Since
>>>>>>> 0.1,
>>>>>>> we've made A TON of progress on overall performance, optimizing
>>>>>>> memory
>>>>>>> use, split vertex/edge inputs, easy interoperability with Apache
>>>>>>> Hive,
>>>>>>> and a bunch of other areas.  In many ways, this is an almost
totally
>>>>>>> different codebase.  Thanks everyone for your hard work!
>>>>>>>
>>>>>>> Apache Giraph has been running in production at Facebook (against
>>>>>>> Facebook's Corona implementation of Hadoop -
>>>>>>> https://github.com/facebook/**hadoop-20/tree/master/src/**
>>>>>>> contrib/corona<https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona>
>>>>>>> )
>>>>>>> since around last December.  It has proven to be very scalable,
>>>>>>> performant, and enables a bunch of new applications.  Based on
the
>>>>>>> drastic improvements and the use of Giraph in production, it
seems
>>>>>>> appropriate to bump up our version to 1.0.
>>>>>>>
>>>>>>> While anyone can vote, the ASF requires majority approval from
the
>>>>>>> PMC
>>>>>>> -- i.e., at least three PMC members must vote affirmatively for
>>>>>>> release,
>>>>>>> and there must be more positive than negative votes. Releases
may
>>>>>>> not be
>>>>>>> vetoed. Before voting +1 PMC members are required to download
the
>>>>>>> signed
>>>>>>> source code package, compile it as provided, and test the resulting
>>>>>>> executable on their own platform, along with also verifying that
the
>>>>>>> package meets the requirements of the ASF policy on releases.
>>>>>>>
>>>>>>> Please test this against many other Hadoop versions and let us
know
>>>>>>> how
>>>>>>> this goes!
>>>>>>>
>>>>>>> Release notes:
>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/RELEASE_**
>>>>>>> NOTES.html<http://people.apache.org/~aching/giraph-1.0-RC0/RELEASE_NOTES.html>
>>>>>>>
>>>>>>> Release artifacts:
>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/<http://people.apache.org/~aching/giraph-1.0-RC0/>
>>>>>>>
>>>>>>> Corresponding git tag:
>>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=**
>>>>>>> shortlog;h=refs/tags/release-**1.0-RC0<https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC0>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Signing keys:
>>>>>>> http://people.apache.org/keys/**group/giraph.asc<http://people.apache.org/keys/group/giraph.asc>
>>>>>>>
>>>>>>> The vote runs for 72 hours, until Monday 4pm PST.
>>>>>>>
>>>>>>> Thanks everyone for your patience with this release!
>>>>>>>
>>>>>>> Avery
>>>>>>>
>>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message