giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: [VOTE][CHANGED] Release Giraph 1.0 (rc1)
Date Sun, 14 Apr 2013 21:25:13 GMT
Thanks for your input Sebastian.  Given the choice to removing 
PageRankVertex or adding the fix, I've added your fix and will cut RC2 a 
bit later today.  I really hope this is the last RC.

Avery

On 4/14/13 9:34 AM, Sebastian Schelter wrote:
> Hi Avery,
>
> I see your concerns. The benchmarking question is difficult, we had very
> bad experiences with Mahout in that regards. E.g., we once had a
> M/R-based PageRank implementation in Mahout that uses our integer-based
> vectors and removed it as we got public complaints that you can't fit
> the whole web into the range of an integer. Personally, I'd also refrain
> from using floats instead of doubles for benchmarks, as this simply
> means you give up on accuracy.
>
> Regarding benchmarks, I guess the best thing we could do is publish our
> own numbers. The current runtimes I've seen are already very good,
> Giraph beat a very optimized Stratosphere implementation that we did for
> a recent paper by approx. 25%.
>
> To conclude, I do in no way want to hold up the current release. I'm
> perfectly fine with not including the patch and optimizing the
> implementation for a 1.0.1 release, but then we should remove the
> current examples.PageRankVertex from the 1.0 release, as the convergence
> detection is broken and we should not knowingly ship bugged code.
>
> Best,
> Sebastian
>
>
> On 14.04.2013 18:18, Avery Ching wrote:
>> Hi Sebastian,
>>
>> Thanks for the patch.  I'll try to take a look at it.
>>
>> The only reason I bring the optimizations up is that a lot of folks tend
>> to compare PageRank performance.  The optimizations I'm referring to are
>> Giraph ones, not algorithmic ones.  We use ints, floats for ids,
>> messages, respectively instead longs, doubles (1/2 network traffic) and
>> IntNullArrayEdges vertex edges (efficient array backed edges) instead of
>> ByteArrayEdges.  You can see
>> https://issues.apache.org/jira/browse/giraph-543 for more details.
>>
>> Anyway, given that we are going to ship a 1.0.1 release in a few weeks
>> for a variety of reasons, should this really hold up the current
>> release?  I would prefer to not cut anymore RCs unless things are
>> totally broken (i.e. profiles not compiling, major Giraph bugs, etc.).
>> There are still a lot of outstanding issues in JIRA, we can't fix them
>> all for the 1.0 release.
>>
>> Let me know what you think.
>>
>> Avery
>>
>> On 4/13/13 10:46 AM, Sebastian Schelter wrote:
>>> Hi Avery,
>>>
>>> I found the bug and can I provide a patch today or tomorrow, so
>>> hopefully we can include that in the release (to not knowingly ship
>>> bugged code). Furthermore I improved the code to protect against
>>> rounding errors.
>>>
>>> I don't really get what you mean with the missing optimization in
>>> comparison to the benchmark PageRank implementation.
>>>
>>> The implementation in o.a.g.examples.PageRankVertex aims to be a robust
>>> real-world implementation. As optimization, it dismisses edge weights
>>> and reuses objects where possible. Furthermore it is able to handle
>>> dangling vertices that are present in almost every real-world network
>>> and it automatically detects the number of supersteps to run. With the
>>> patch, it should also provide improved numerical stability.
>>>
>>> If the runtimes doesn't look good enough when compared to the benchmark
>>> implementation, this might also be caused by the dataset which has a
>>> skewed degree distribution (like most real-world networks). The
>>> benchmark uses a uniform degree distribution AFAIK.
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 13.04.2013 15:46, Avery Ching wrote:
>>>> That's great Sebastian.  I would also recommend taking a look at the
>>>> PageRankBenchmark for a performance comparison.  It has been a lot of
>>>> speed improvements that should be a bunch faster than PageRankVertex.
>>>> Even that though, is not totally optimized.  Hopefully we'll be adding a
>>>> "how to optimize performance" guide in the near future.  Should we delay
>>>> the release or simply just ship a 1.1, say in the next month with this
>>>> fix and supporting YARN's 2.0.4?  I'd like to get on a more normal
>>>> release cycle rather than once a year =).
>>>>
>>>> Avery
>>>>
>>>> On 4/13/13 3:02 AM, Sebastian Schelter wrote:
>>>>> Hi there,
>>>>>
>>>>> I got some good and bad news, I tested PageRankVertex (not the
>>>>> Benchmark
>>>>> but the example implementation o.a.g.examples.PageRankVertex) from
>>>>> trunk
>>>>> compiled for Hadoop 1.0 on a cluster of 26 machines with 208 cores.
>>>>>
>>>>> I used the Webbase2001 dataset [1] which has 115M vertices and more
>>>>> than
>>>>> 1B edges and got some awesome running times, average superstep takes
15
>>>>> seconds (!!!). Awesome work, I have to say!
>>>>>
>>>>> Unfortunately, there seems to be an issue with the convergence
>>>>> detection, as it didn't get the correct convergence behavior. I'd like
>>>>> to have a look into that this week, so we can ship a performant
>>>>> PageRank
>>>>> implementation which automatically runs an appropriate number of
>>>>> supersteps. Hope this doesn't delay the release too much.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> [1] http://law.di.unimi.it/webdata/webbase-2001/
>>>>>
>>>>>
>>>>> On 13.04.2013 07:39, Avery Ching wrote:
>>>>>> Thanks to the quick feedback from Roman and Lewis, we have cut a
>>>>>> new RC1
>>>>>> that addresses the following issues.
>>>>>>
>>>>>> * Got rid of .git repo in tarball
>>>>>> * Fixed issue with not compiling without git repo (GIRAPH-628)
>>>>>> * Used gnutar in OSX rather than tar to generate the tarball and
>>>>>> get rid
>>>>>> of warnings
>>>>>> * Pushed GIRAPH-627 to support the yarn profile better
>>>>>> * Tarball name changed to the final artifact name (giraph-1.0.tar.gz)
>>>>>>
>>>>>> Release notes:
>>>>>> http://people.apache.org/~aching/giraph-1.0-RC1/RELEASE_NOTES.html
>>>>>>
>>>>>> Release artifacts:
>>>>>> http://people.apache.org/~aching/giraph-1.0-RC1/
>>>>>>
>>>>>> Corresponding git tag:
>>>>>> https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC1
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Signing keys:
>>>>>> http://people.apache.org/keys/group/giraph.asc
>>>>>>
>>>>>> The vote runs for 72 hours, until Monday 11pm PST.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Avery
>>>>>>
>>>>>> Original message below regarding rc0:
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> Fellow Giraphers,
>>>>>>
>>>>>> We have a our first release candidate since graduating from
>>>>>> incubation.
>>>>>>     This is a source release, primarily due to the different
>>>>>> versions of
>>>>>> Hadoop we support with munge (similar to the 0.1 release).  Since
0.1,
>>>>>> we've made A TON of progress on overall performance, optimizing memory
>>>>>> use, split vertex/edge inputs, easy interoperability with Apache
Hive,
>>>>>> and a bunch of other areas.  In many ways, this is an almost totally
>>>>>> different codebase.  Thanks everyone for your hard work!
>>>>>>
>>>>>> Apache Giraph has been running in production at Facebook (against
>>>>>> Facebook's Corona implementation of Hadoop -
>>>>>> https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona)
>>>>>> since around last December.  It has proven to be very scalable,
>>>>>> performant, and enables a bunch of new applications.  Based on the
>>>>>> drastic improvements and the use of Giraph in production, it seems
>>>>>> appropriate to bump up our version to 1.0.
>>>>>>
>>>>>> While anyone can vote, the ASF requires majority approval from the
PMC
>>>>>> -- i.e., at least three PMC members must vote affirmatively for
>>>>>> release,
>>>>>> and there must be more positive than negative votes. Releases may
>>>>>> not be
>>>>>> vetoed. Before voting +1 PMC members are required to download the
>>>>>> signed
>>>>>> source code package, compile it as provided, and test the resulting
>>>>>> executable on their own platform, along with also verifying that
the
>>>>>> package meets the requirements of the ASF policy on releases.
>>>>>>
>>>>>> Please test this against many other Hadoop versions and let us know
>>>>>> how
>>>>>> this goes!
>>>>>>
>>>>>> Release notes:
>>>>>> http://people.apache.org/~aching/giraph-1.0-RC0/RELEASE_NOTES.html
>>>>>>
>>>>>> Release artifacts:
>>>>>> http://people.apache.org/~aching/giraph-1.0-RC0/
>>>>>>
>>>>>> Corresponding git tag:
>>>>>> https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC0
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Signing keys:
>>>>>> http://people.apache.org/keys/group/giraph.asc
>>>>>>
>>>>>> The vote runs for 72 hours, until Monday 4pm PST.
>>>>>>
>>>>>> Thanks everyone for your patience with this release!
>>>>>>
>>>>>> Avery


Mime
View raw message