crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chao Shi <stepi...@live.com>
Subject Re: About status web page
Date Wed, 27 Feb 2013 05:07:06 GMT
Josh,

It is exactly what I need. It can help to decouple the status web server
from core crunch jar (as it depends on jetty, which is not necessary for
everyone).

Just to make sure I understand correctly:

runAsync returns a future-like object, e.g. RunningPipeline. A user can use
it to start a web server.

RunningPipeline runningPipeline = pipeline.runAsync();
> StatusServer statusServer = new StatusServer(runningPipeline, port);
> statusServer.start();
> runningPipeline.waitUntilDone();
> statusServer.stop();


It would be also nice to expose information about each MR stage as well. If
this requires careful design of API, dot graph is enough for now.

On Wed, Feb 27, 2013 at 12:29 PM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Chao,
>
> Does the asynchronous pipeline execution work in
> https://issues.apache.org/jira/browse/CRUNCH-156 help with this? Right
> now,
> it returns an ListenableFuture<PipelineResult> from runAsync, but we could
> add support for returning the graphviz plan as well, so that you could fire
> up a server to visualize the file while the job was running.
>
> J
>
>
> On Tue, Feb 26, 2013 at 8:03 PM, Chao Shi <stepinto@live.com> wrote:
>
> > Yes, it is for debugging and monitoring.
> >
> > I'm developing a complex pipeline (30+ MRs plus lots of joins). I have a
> > hard time to understand which part of the pipeline spends most running
> time
> > and how much intermediate output does it produce. Crunch's optimization
> > work is great, but it makes the execution plan difficult to be
> understood.
> > Each time I modified the pipeline, I have to dump the dot file and run
> > graphviz to generate a new picture and examine if there's anything wrong.
> >
> > About security, I'm not familiar with how Hadoop does it. I will try to
> > reuse hadoop's HttpServer (does it have something to do with security?).
> > The bottom line is to make this feature disabled by default, and let
> users
> > enable it at their own risk.
> >
> > If this feature is enabled, the user can choose to use unused port or
> > specified port. I haven't got an idea that how the user know the randomly
> > picked port (via log?) . I will be working on a prototype version first,
> > and see if the status page is generally useful.
> >
> > On Wed, Feb 27, 2013 at 2:30 AM, Matthias Friedrich <matt@mafr.de>
> wrote:
> >
> > > Hi Chao,
> > >
> > > sounds interesting - just a couple of things that come to mind:
> > >
> > > I this intended as debugging aid or for operational monitoring?
> > >
> > > A Crunch job is a temporary thing, to me this doesn't sound like a
> > > good match for a web service because it disappears after a (possibly
> > > short) time. Also, when multiple jobs are executed concurrently from
> > > the same machine, you can't work with a well-known port, you'd have to
> > > pick an unused port for each job.
> > >
> > > It also looks to me like this has security implications? Right now,
> > > Crunch is just a client library and we're part of Hadoop's security
> > > framework. A web service we might have to secure in some way.
> > >
> > > Regards,
> > >   Matthias
> > >
> > > On Tuesday, 2013-02-26, Chao Shi wrote:
> > > > Hi Crunch Devs,
> > > >
> > > > I'm interested in adding a web status page to crunch. I'm working on
> a
> > > > prototype first, which simply runs a jetty server and renders the dot
> > > file
> > > > produced by DotFileWriter at browser. The dot rendering work is done
> by
> > > > viz.js <https://github.com/mdaines/viz.js>. It can successfully
> render
> > > the
> > > > plan into SVG.
> > > >
> > > > I think there are 2 issues I hit with viz.js:
> > > >
> > > > 1. The license of viz.js is unclear. It is compiled from GraphViz
> > source
> > > > code with emscripten. GraphViz is Eclipse Public License 1.0.
> > > >
> > > > 2. viz.js is big and slow. It is a 1.4MB compressed JS. It takes 1
> or 2
> > > > seconds on my laptop to render my pipeline (30+ MRs). I think it good
> > to
> > > > have the graph refresh frequently and show the running status of the
> > > > pipeline (i.e. whether MRs are done or not). Thus the rendering time
> > > would
> > > > be too slow.
> > > >
> > > > Another approach is to call graphviz command at server side, if
> viz.js
> > is
> > > > not possible. I can't find any pure Java implementation of graphviz.
> > > >
> > > > Looking forward to your advices.
> > > >
> > > > Thanks,
> > > > Chao
> > >
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message