giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gianmarco De Francisci Morales <g...@apache.org>
Subject Re: [VOTE][CHANGED] Release Giraph 1.0 (rc1)
Date Sun, 14 Apr 2013 21:41:38 GMT
Hi,

only one quick comment on optimizations and using ints as ids.
In my opinion, if you can use an int as an id for your dataset, probably
you don't need Giraph for your problem.
Just my 2c

Cheers,

--
Gianmarco


On Sun, Apr 14, 2013 at 11:26 PM, Sebastian Schelter <ssc@apache.org> wrote:

> Thank you, Avery, wish I had found the bug earlier.
> Am 14.04.2013 23:25 schrieb "Avery Ching" <aching@apache.org>:
>
> > Thanks for your input Sebastian.  Given the choice to removing
> > PageRankVertex or adding the fix, I've added your fix and will cut RC2 a
> > bit later today.  I really hope this is the last RC.
> >
> > Avery
> >
> > On 4/14/13 9:34 AM, Sebastian Schelter wrote:
> >
> >> Hi Avery,
> >>
> >> I see your concerns. The benchmarking question is difficult, we had very
> >> bad experiences with Mahout in that regards. E.g., we once had a
> >> M/R-based PageRank implementation in Mahout that uses our integer-based
> >> vectors and removed it as we got public complaints that you can't fit
> >> the whole web into the range of an integer. Personally, I'd also refrain
> >> from using floats instead of doubles for benchmarks, as this simply
> >> means you give up on accuracy.
> >>
> >> Regarding benchmarks, I guess the best thing we could do is publish our
> >> own numbers. The current runtimes I've seen are already very good,
> >> Giraph beat a very optimized Stratosphere implementation that we did for
> >> a recent paper by approx. 25%.
> >>
> >> To conclude, I do in no way want to hold up the current release. I'm
> >> perfectly fine with not including the patch and optimizing the
> >> implementation for a 1.0.1 release, but then we should remove the
> >> current examples.PageRankVertex from the 1.0 release, as the convergence
> >> detection is broken and we should not knowingly ship bugged code.
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> On 14.04.2013 18:18, Avery Ching wrote:
> >>
> >>> Hi Sebastian,
> >>>
> >>> Thanks for the patch.  I'll try to take a look at it.
> >>>
> >>> The only reason I bring the optimizations up is that a lot of folks
> tend
> >>> to compare PageRank performance.  The optimizations I'm referring to
> are
> >>> Giraph ones, not algorithmic ones.  We use ints, floats for ids,
> >>> messages, respectively instead longs, doubles (1/2 network traffic) and
> >>> IntNullArrayEdges vertex edges (efficient array backed edges) instead
> of
> >>> ByteArrayEdges.  You can see
> >>> https://issues.apache.org/**jira/browse/giraph-543<
> https://issues.apache.org/jira/browse/giraph-543>for more details.
> >>>
> >>> Anyway, given that we are going to ship a 1.0.1 release in a few weeks
> >>> for a variety of reasons, should this really hold up the current
> >>> release?  I would prefer to not cut anymore RCs unless things are
> >>> totally broken (i.e. profiles not compiling, major Giraph bugs, etc.).
> >>> There are still a lot of outstanding issues in JIRA, we can't fix them
> >>> all for the 1.0 release.
> >>>
> >>> Let me know what you think.
> >>>
> >>> Avery
> >>>
> >>> On 4/13/13 10:46 AM, Sebastian Schelter wrote:
> >>>
> >>>> Hi Avery,
> >>>>
> >>>> I found the bug and can I provide a patch today or tomorrow, so
> >>>> hopefully we can include that in the release (to not knowingly ship
> >>>> bugged code). Furthermore I improved the code to protect against
> >>>> rounding errors.
> >>>>
> >>>> I don't really get what you mean with the missing optimization in
> >>>> comparison to the benchmark PageRank implementation.
> >>>>
> >>>> The implementation in o.a.g.examples.PageRankVertex aims to be a
> robust
> >>>> real-world implementation. As optimization, it dismisses edge weights
> >>>> and reuses objects where possible. Furthermore it is able to handle
> >>>> dangling vertices that are present in almost every real-world network
> >>>> and it automatically detects the number of supersteps to run. With the
> >>>> patch, it should also provide improved numerical stability.
> >>>>
> >>>> If the runtimes doesn't look good enough when compared to the
> benchmark
> >>>> implementation, this might also be caused by the dataset which has a
> >>>> skewed degree distribution (like most real-world networks). The
> >>>> benchmark uses a uniform degree distribution AFAIK.
> >>>>
> >>>> Best,
> >>>> Sebastian
> >>>>
> >>>> On 13.04.2013 15:46, Avery Ching wrote:
> >>>>
> >>>>> That's great Sebastian.  I would also recommend taking a look at
the
> >>>>> PageRankBenchmark for a performance comparison.  It has been a lot
of
> >>>>> speed improvements that should be a bunch faster than PageRankVertex.
> >>>>> Even that though, is not totally optimized.  Hopefully we'll be
> adding
> >>>>> a
> >>>>> "how to optimize performance" guide in the near future.  Should
we
> >>>>> delay
> >>>>> the release or simply just ship a 1.1, say in the next month with
> this
> >>>>> fix and supporting YARN's 2.0.4?  I'd like to get on a more normal
> >>>>> release cycle rather than once a year =).
> >>>>>
> >>>>> Avery
> >>>>>
> >>>>> On 4/13/13 3:02 AM, Sebastian Schelter wrote:
> >>>>>
> >>>>>> Hi there,
> >>>>>>
> >>>>>> I got some good and bad news, I tested PageRankVertex (not the
> >>>>>> Benchmark
> >>>>>> but the example implementation o.a.g.examples.PageRankVertex)
from
> >>>>>> trunk
> >>>>>> compiled for Hadoop 1.0 on a cluster of 26 machines with 208
cores.
> >>>>>>
> >>>>>> I used the Webbase2001 dataset [1] which has 115M vertices and
more
> >>>>>> than
> >>>>>> 1B edges and got some awesome running times, average superstep
takes
> >>>>>> 15
> >>>>>> seconds (!!!). Awesome work, I have to say!
> >>>>>>
> >>>>>> Unfortunately, there seems to be an issue with the convergence
> >>>>>> detection, as it didn't get the correct convergence behavior.
I'd
> like
> >>>>>> to have a look into that this week, so we can ship a performant
> >>>>>> PageRank
> >>>>>> implementation which automatically runs an appropriate number
of
> >>>>>> supersteps. Hope this doesn't delay the release too much.
> >>>>>>
> >>>>>> Best,
> >>>>>> Sebastian
> >>>>>>
> >>>>>>
> >>>>>> [1] http://law.di.unimi.it/**webdata/webbase-2001/<
> http://law.di.unimi.it/webdata/webbase-2001/>
> >>>>>>
> >>>>>>
> >>>>>> On 13.04.2013 07:39, Avery Ching wrote:
> >>>>>>
> >>>>>>> Thanks to the quick feedback from Roman and Lewis, we have
cut a
> >>>>>>> new RC1
> >>>>>>> that addresses the following issues.
> >>>>>>>
> >>>>>>> * Got rid of .git repo in tarball
> >>>>>>> * Fixed issue with not compiling without git repo (GIRAPH-628)
> >>>>>>> * Used gnutar in OSX rather than tar to generate the tarball
and
> >>>>>>> get rid
> >>>>>>> of warnings
> >>>>>>> * Pushed GIRAPH-627 to support the yarn profile better
> >>>>>>> * Tarball name changed to the final artifact name
> (giraph-1.0.tar.gz)
> >>>>>>>
> >>>>>>> Release notes:
> >>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/RELEASE_**
> >>>>>>> NOTES.html<
> http://people.apache.org/~aching/giraph-1.0-RC1/RELEASE_NOTES.html>
> >>>>>>>
> >>>>>>> Release artifacts:
> >>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/<
> http://people.apache.org/~aching/giraph-1.0-RC1/>
> >>>>>>>
> >>>>>>> Corresponding git tag:
> >>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=**
> >>>>>>> shortlog;h=refs/tags/release-**1.0-RC1<
> https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC1
> >
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Signing keys:
> >>>>>>> http://people.apache.org/keys/**group/giraph.asc<
> http://people.apache.org/keys/group/giraph.asc>
> >>>>>>>
> >>>>>>> The vote runs for 72 hours, until Monday 11pm PST.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Avery
> >>>>>>>
> >>>>>>> Original message below regarding rc0:
> >>>>>>>
> >>>>>>> ------------------------------**-
> >>>>>>>
> >>>>>>> Fellow Giraphers,
> >>>>>>>
> >>>>>>> We have a our first release candidate since graduating from
> >>>>>>> incubation.
> >>>>>>>     This is a source release, primarily due to the different
> >>>>>>> versions of
> >>>>>>> Hadoop we support with munge (similar to the 0.1 release).
 Since
> >>>>>>> 0.1,
> >>>>>>> we've made A TON of progress on overall performance, optimizing
> >>>>>>> memory
> >>>>>>> use, split vertex/edge inputs, easy interoperability with
Apache
> >>>>>>> Hive,
> >>>>>>> and a bunch of other areas.  In many ways, this is an almost
> totally
> >>>>>>> different codebase.  Thanks everyone for your hard work!
> >>>>>>>
> >>>>>>> Apache Giraph has been running in production at Facebook
(against
> >>>>>>> Facebook's Corona implementation of Hadoop -
> >>>>>>> https://github.com/facebook/**hadoop-20/tree/master/src/**
> >>>>>>> contrib/corona<
> https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona>
> >>>>>>> )
> >>>>>>> since around last December.  It has proven to be very scalable,
> >>>>>>> performant, and enables a bunch of new applications.  Based
on the
> >>>>>>> drastic improvements and the use of Giraph in production,
it seems
> >>>>>>> appropriate to bump up our version to 1.0.
> >>>>>>>
> >>>>>>> While anyone can vote, the ASF requires majority approval
from the
> >>>>>>> PMC
> >>>>>>> -- i.e., at least three PMC members must vote affirmatively
for
> >>>>>>> release,
> >>>>>>> and there must be more positive than negative votes. Releases
may
> >>>>>>> not be
> >>>>>>> vetoed. Before voting +1 PMC members are required to download
the
> >>>>>>> signed
> >>>>>>> source code package, compile it as provided, and test the
resulting
> >>>>>>> executable on their own platform, along with also verifying
that
> the
> >>>>>>> package meets the requirements of the ASF policy on releases.
> >>>>>>>
> >>>>>>> Please test this against many other Hadoop versions and
let us know
> >>>>>>> how
> >>>>>>> this goes!
> >>>>>>>
> >>>>>>> Release notes:
> >>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/RELEASE_**
> >>>>>>> NOTES.html<
> http://people.apache.org/~aching/giraph-1.0-RC0/RELEASE_NOTES.html>
> >>>>>>>
> >>>>>>> Release artifacts:
> >>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/<
> http://people.apache.org/~aching/giraph-1.0-RC0/>
> >>>>>>>
> >>>>>>> Corresponding git tag:
> >>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=**
> >>>>>>> shortlog;h=refs/tags/release-**1.0-RC0<
> https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC0
> >
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Signing keys:
> >>>>>>> http://people.apache.org/keys/**group/giraph.asc<
> http://people.apache.org/keys/group/giraph.asc>
> >>>>>>>
> >>>>>>> The vote runs for 72 hours, until Monday 4pm PST.
> >>>>>>>
> >>>>>>> Thanks everyone for your patience with this release!
> >>>>>>>
> >>>>>>> Avery
> >>>>>>>
> >>>>>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message