giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: [VOTE][CHANGED] Release Giraph 1.0 (rc1)
Date Sun, 14 Apr 2013 22:28:53 GMT
I generally agree and can understand that is mostly typically true, but 
many other benchmarks are doing this to show off performance.  Also, if 
you have the FB graph of a billion users, it could theoretically fit 
into an 32-bit integer.

Avery

On 4/14/13 2:41 PM, Gianmarco De Francisci Morales wrote:
> Hi,
>
> only one quick comment on optimizations and using ints as ids.
> In my opinion, if you can use an int as an id for your dataset, probably
> you don't need Giraph for your problem.
> Just my 2c
>
> Cheers,
>
> --
> Gianmarco
>
>
> On Sun, Apr 14, 2013 at 11:26 PM, Sebastian Schelter <ssc@apache.org> wrote:
>
>> Thank you, Avery, wish I had found the bug earlier.
>> Am 14.04.2013 23:25 schrieb "Avery Ching" <aching@apache.org>:
>>
>>> Thanks for your input Sebastian.  Given the choice to removing
>>> PageRankVertex or adding the fix, I've added your fix and will cut RC2 a
>>> bit later today.  I really hope this is the last RC.
>>>
>>> Avery
>>>
>>> On 4/14/13 9:34 AM, Sebastian Schelter wrote:
>>>
>>>> Hi Avery,
>>>>
>>>> I see your concerns. The benchmarking question is difficult, we had very
>>>> bad experiences with Mahout in that regards. E.g., we once had a
>>>> M/R-based PageRank implementation in Mahout that uses our integer-based
>>>> vectors and removed it as we got public complaints that you can't fit
>>>> the whole web into the range of an integer. Personally, I'd also refrain
>>>> from using floats instead of doubles for benchmarks, as this simply
>>>> means you give up on accuracy.
>>>>
>>>> Regarding benchmarks, I guess the best thing we could do is publish our
>>>> own numbers. The current runtimes I've seen are already very good,
>>>> Giraph beat a very optimized Stratosphere implementation that we did for
>>>> a recent paper by approx. 25%.
>>>>
>>>> To conclude, I do in no way want to hold up the current release. I'm
>>>> perfectly fine with not including the patch and optimizing the
>>>> implementation for a 1.0.1 release, but then we should remove the
>>>> current examples.PageRankVertex from the 1.0 release, as the convergence
>>>> detection is broken and we should not knowingly ship bugged code.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>> On 14.04.2013 18:18, Avery Ching wrote:
>>>>
>>>>> Hi Sebastian,
>>>>>
>>>>> Thanks for the patch.  I'll try to take a look at it.
>>>>>
>>>>> The only reason I bring the optimizations up is that a lot of folks
>> tend
>>>>> to compare PageRank performance.  The optimizations I'm referring to
>> are
>>>>> Giraph ones, not algorithmic ones.  We use ints, floats for ids,
>>>>> messages, respectively instead longs, doubles (1/2 network traffic) and
>>>>> IntNullArrayEdges vertex edges (efficient array backed edges) instead
>> of
>>>>> ByteArrayEdges.  You can see
>>>>> https://issues.apache.org/**jira/browse/giraph-543<
>> https://issues.apache.org/jira/browse/giraph-543>for more details.
>>>>> Anyway, given that we are going to ship a 1.0.1 release in a few weeks
>>>>> for a variety of reasons, should this really hold up the current
>>>>> release?  I would prefer to not cut anymore RCs unless things are
>>>>> totally broken (i.e. profiles not compiling, major Giraph bugs, etc.).
>>>>> There are still a lot of outstanding issues in JIRA, we can't fix them
>>>>> all for the 1.0 release.
>>>>>
>>>>> Let me know what you think.
>>>>>
>>>>> Avery
>>>>>
>>>>> On 4/13/13 10:46 AM, Sebastian Schelter wrote:
>>>>>
>>>>>> Hi Avery,
>>>>>>
>>>>>> I found the bug and can I provide a patch today or tomorrow, so
>>>>>> hopefully we can include that in the release (to not knowingly ship
>>>>>> bugged code). Furthermore I improved the code to protect against
>>>>>> rounding errors.
>>>>>>
>>>>>> I don't really get what you mean with the missing optimization in
>>>>>> comparison to the benchmark PageRank implementation.
>>>>>>
>>>>>> The implementation in o.a.g.examples.PageRankVertex aims to be a
>> robust
>>>>>> real-world implementation. As optimization, it dismisses edge weights
>>>>>> and reuses objects where possible. Furthermore it is able to handle
>>>>>> dangling vertices that are present in almost every real-world network
>>>>>> and it automatically detects the number of supersteps to run. With
the
>>>>>> patch, it should also provide improved numerical stability.
>>>>>>
>>>>>> If the runtimes doesn't look good enough when compared to the
>> benchmark
>>>>>> implementation, this might also be caused by the dataset which has
a
>>>>>> skewed degree distribution (like most real-world networks). The
>>>>>> benchmark uses a uniform degree distribution AFAIK.
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> On 13.04.2013 15:46, Avery Ching wrote:
>>>>>>
>>>>>>> That's great Sebastian.  I would also recommend taking a look
at the
>>>>>>> PageRankBenchmark for a performance comparison.  It has been
a lot of
>>>>>>> speed improvements that should be a bunch faster than PageRankVertex.
>>>>>>> Even that though, is not totally optimized.  Hopefully we'll
be
>> adding
>>>>>>> a
>>>>>>> "how to optimize performance" guide in the near future.  Should
we
>>>>>>> delay
>>>>>>> the release or simply just ship a 1.1, say in the next month
with
>> this
>>>>>>> fix and supporting YARN's 2.0.4?  I'd like to get on a more normal
>>>>>>> release cycle rather than once a year =).
>>>>>>>
>>>>>>> Avery
>>>>>>>
>>>>>>> On 4/13/13 3:02 AM, Sebastian Schelter wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I got some good and bad news, I tested PageRankVertex (not
the
>>>>>>>> Benchmark
>>>>>>>> but the example implementation o.a.g.examples.PageRankVertex)
from
>>>>>>>> trunk
>>>>>>>> compiled for Hadoop 1.0 on a cluster of 26 machines with
208 cores.
>>>>>>>>
>>>>>>>> I used the Webbase2001 dataset [1] which has 115M vertices
and more
>>>>>>>> than
>>>>>>>> 1B edges and got some awesome running times, average superstep
takes
>>>>>>>> 15
>>>>>>>> seconds (!!!). Awesome work, I have to say!
>>>>>>>>
>>>>>>>> Unfortunately, there seems to be an issue with the convergence
>>>>>>>> detection, as it didn't get the correct convergence behavior.
I'd
>> like
>>>>>>>> to have a look into that this week, so we can ship a performant
>>>>>>>> PageRank
>>>>>>>> implementation which automatically runs an appropriate number
of
>>>>>>>> supersteps. Hope this doesn't delay the release too much.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Sebastian
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] http://law.di.unimi.it/**webdata/webbase-2001/<
>> http://law.di.unimi.it/webdata/webbase-2001/>
>>>>>>>>
>>>>>>>> On 13.04.2013 07:39, Avery Ching wrote:
>>>>>>>>
>>>>>>>>> Thanks to the quick feedback from Roman and Lewis, we
have cut a
>>>>>>>>> new RC1
>>>>>>>>> that addresses the following issues.
>>>>>>>>>
>>>>>>>>> * Got rid of .git repo in tarball
>>>>>>>>> * Fixed issue with not compiling without git repo (GIRAPH-628)
>>>>>>>>> * Used gnutar in OSX rather than tar to generate the
tarball and
>>>>>>>>> get rid
>>>>>>>>> of warnings
>>>>>>>>> * Pushed GIRAPH-627 to support the yarn profile better
>>>>>>>>> * Tarball name changed to the final artifact name
>> (giraph-1.0.tar.gz)
>>>>>>>>> Release notes:
>>>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/RELEASE_**
>>>>>>>>> NOTES.html<
>> http://people.apache.org/~aching/giraph-1.0-RC1/RELEASE_NOTES.html>
>>>>>>>>> Release artifacts:
>>>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/<
>> http://people.apache.org/~aching/giraph-1.0-RC1/>
>>>>>>>>> Corresponding git tag:
>>>>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=**
>>>>>>>>> shortlog;h=refs/tags/release-**1.0-RC1<
>> https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Signing keys:
>>>>>>>>> http://people.apache.org/keys/**group/giraph.asc<
>> http://people.apache.org/keys/group/giraph.asc>
>>>>>>>>> The vote runs for 72 hours, until Monday 11pm PST.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Avery
>>>>>>>>>
>>>>>>>>> Original message below regarding rc0:
>>>>>>>>>
>>>>>>>>> ------------------------------**-
>>>>>>>>>
>>>>>>>>> Fellow Giraphers,
>>>>>>>>>
>>>>>>>>> We have a our first release candidate since graduating
from
>>>>>>>>> incubation.
>>>>>>>>>      This is a source release, primarily due to the different
>>>>>>>>> versions of
>>>>>>>>> Hadoop we support with munge (similar to the 0.1 release).
 Since
>>>>>>>>> 0.1,
>>>>>>>>> we've made A TON of progress on overall performance,
optimizing
>>>>>>>>> memory
>>>>>>>>> use, split vertex/edge inputs, easy interoperability
with Apache
>>>>>>>>> Hive,
>>>>>>>>> and a bunch of other areas.  In many ways, this is an
almost
>> totally
>>>>>>>>> different codebase.  Thanks everyone for your hard work!
>>>>>>>>>
>>>>>>>>> Apache Giraph has been running in production at Facebook
(against
>>>>>>>>> Facebook's Corona implementation of Hadoop -
>>>>>>>>> https://github.com/facebook/**hadoop-20/tree/master/src/**
>>>>>>>>> contrib/corona<
>> https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona>
>>>>>>>>> )
>>>>>>>>> since around last December.  It has proven to be very
scalable,
>>>>>>>>> performant, and enables a bunch of new applications.
 Based on the
>>>>>>>>> drastic improvements and the use of Giraph in production,
it seems
>>>>>>>>> appropriate to bump up our version to 1.0.
>>>>>>>>>
>>>>>>>>> While anyone can vote, the ASF requires majority approval
from the
>>>>>>>>> PMC
>>>>>>>>> -- i.e., at least three PMC members must vote affirmatively
for
>>>>>>>>> release,
>>>>>>>>> and there must be more positive than negative votes.
Releases may
>>>>>>>>> not be
>>>>>>>>> vetoed. Before voting +1 PMC members are required to
download the
>>>>>>>>> signed
>>>>>>>>> source code package, compile it as provided, and test
the resulting
>>>>>>>>> executable on their own platform, along with also verifying
that
>> the
>>>>>>>>> package meets the requirements of the ASF policy on releases.
>>>>>>>>>
>>>>>>>>> Please test this against many other Hadoop versions and
let us know
>>>>>>>>> how
>>>>>>>>> this goes!
>>>>>>>>>
>>>>>>>>> Release notes:
>>>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/RELEASE_**
>>>>>>>>> NOTES.html<
>> http://people.apache.org/~aching/giraph-1.0-RC0/RELEASE_NOTES.html>
>>>>>>>>> Release artifacts:
>>>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/<
>> http://people.apache.org/~aching/giraph-1.0-RC0/>
>>>>>>>>> Corresponding git tag:
>>>>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=**
>>>>>>>>> shortlog;h=refs/tags/release-**1.0-RC0<
>> https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Signing keys:
>>>>>>>>> http://people.apache.org/keys/**group/giraph.asc<
>> http://people.apache.org/keys/group/giraph.asc>
>>>>>>>>> The vote runs for 72 hours, until Monday 4pm PST.
>>>>>>>>>
>>>>>>>>> Thanks everyone for your patience with this release!
>>>>>>>>>
>>>>>>>>> Avery
>>>>>>>>>


Mime
View raw message