crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: About status web page
Date Wed, 27 Feb 2013 05:38:02 GMT
On Tue, Feb 26, 2013 at 9:07 PM, Chao Shi <stepinto@live.com> wrote:

> Josh,
>
> It is exactly what I need. It can help to decouple the status web server
> from core crunch jar (as it depends on jetty, which is not necessary for
> everyone).
>
> Just to make sure I understand correctly:
>
> runAsync returns a future-like object, e.g. RunningPipeline. A user can use
> it to start a web server.
>
> RunningPipeline runningPipeline = pipeline.runAsync();
> > StatusServer statusServer = new StatusServer(runningPipeline, port);
> > statusServer.start();
> > runningPipeline.waitUntilDone();
> > statusServer.stop();
>
>
> It would be also nice to expose information about each MR stage as well. If
> this requires careful design of API, dot graph is enough for now.
>

Yeah, exactly. I'll make the implementation an interface that extends
ListenableFuture<PipelineResult> so we can add methods to it as appropriate.

J


>
> On Wed, Feb 27, 2013 at 12:29 PM, Josh Wills <jwills@cloudera.com> wrote:
>
> > Hey Chao,
> >
> > Does the asynchronous pipeline execution work in
> > https://issues.apache.org/jira/browse/CRUNCH-156 help with this? Right
> > now,
> > it returns an ListenableFuture<PipelineResult> from runAsync, but we
> could
> > add support for returning the graphviz plan as well, so that you could
> fire
> > up a server to visualize the file while the job was running.
> >
> > J
> >
> >
> > On Tue, Feb 26, 2013 at 8:03 PM, Chao Shi <stepinto@live.com> wrote:
> >
> > > Yes, it is for debugging and monitoring.
> > >
> > > I'm developing a complex pipeline (30+ MRs plus lots of joins). I have
> a
> > > hard time to understand which part of the pipeline spends most running
> > time
> > > and how much intermediate output does it produce. Crunch's optimization
> > > work is great, but it makes the execution plan difficult to be
> > understood.
> > > Each time I modified the pipeline, I have to dump the dot file and run
> > > graphviz to generate a new picture and examine if there's anything
> wrong.
> > >
> > > About security, I'm not familiar with how Hadoop does it. I will try to
> > > reuse hadoop's HttpServer (does it have something to do with
> security?).
> > > The bottom line is to make this feature disabled by default, and let
> > users
> > > enable it at their own risk.
> > >
> > > If this feature is enabled, the user can choose to use unused port or
> > > specified port. I haven't got an idea that how the user know the
> randomly
> > > picked port (via log?) . I will be working on a prototype version
> first,
> > > and see if the status page is generally useful.
> > >
> > > On Wed, Feb 27, 2013 at 2:30 AM, Matthias Friedrich <matt@mafr.de>
> > wrote:
> > >
> > > > Hi Chao,
> > > >
> > > > sounds interesting - just a couple of things that come to mind:
> > > >
> > > > I this intended as debugging aid or for operational monitoring?
> > > >
> > > > A Crunch job is a temporary thing, to me this doesn't sound like a
> > > > good match for a web service because it disappears after a (possibly
> > > > short) time. Also, when multiple jobs are executed concurrently from
> > > > the same machine, you can't work with a well-known port, you'd have
> to
> > > > pick an unused port for each job.
> > > >
> > > > It also looks to me like this has security implications? Right now,
> > > > Crunch is just a client library and we're part of Hadoop's security
> > > > framework. A web service we might have to secure in some way.
> > > >
> > > > Regards,
> > > >   Matthias
> > > >
> > > > On Tuesday, 2013-02-26, Chao Shi wrote:
> > > > > Hi Crunch Devs,
> > > > >
> > > > > I'm interested in adding a web status page to crunch. I'm working
> on
> > a
> > > > > prototype first, which simply runs a jetty server and renders the
> dot
> > > > file
> > > > > produced by DotFileWriter at browser. The dot rendering work is
> done
> > by
> > > > > viz.js <https://github.com/mdaines/viz.js>. It can successfully
> > render
> > > > the
> > > > > plan into SVG.
> > > > >
> > > > > I think there are 2 issues I hit with viz.js:
> > > > >
> > > > > 1. The license of viz.js is unclear. It is compiled from GraphViz
> > > source
> > > > > code with emscripten. GraphViz is Eclipse Public License 1.0.
> > > > >
> > > > > 2. viz.js is big and slow. It is a 1.4MB compressed JS. It takes
1
> > or 2
> > > > > seconds on my laptop to render my pipeline (30+ MRs). I think it
> good
> > > to
> > > > > have the graph refresh frequently and show the running status of
> the
> > > > > pipeline (i.e. whether MRs are done or not). Thus the rendering
> time
> > > > would
> > > > > be too slow.
> > > > >
> > > > > Another approach is to call graphviz command at server side, if
> > viz.js
> > > is
> > > > > not possible. I can't find any pure Java implementation of
> graphviz.
> > > > >
> > > > > Looking forward to your advices.
> > > > >
> > > > > Thanks,
> > > > > Chao
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message