hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Expiring session... timeout of 600000ms exceeded
Date Tue, 21 Sep 2010 19:48:08 GMT
Generally best practices for crawlers is that no process runs more than an
hour or five.  All crawler processes update
a central state store with their progress, but they exit when they reach a
time limit knowing that somebody else will
take up the work where they leave off.  This avoids a multitude of ills.

On Tue, Sep 21, 2010 at 11:53 AM, Tim Robertson
<timrobertson100@gmail.com>wrote:

> > On the topic of your application, why you are using processes instead of
> > threads?  With threads, you can get your memory overhead down to 10's of
> > kilobytes as opposed to 10's of megabytes.
>
> I am just prototyping scaling out many processes and potentially
> across multiple machines.  Our live crawler runs in a single JVM, but
> some of these crawlers take 4-6 weeks, so long running processes block
> others, so I was looking at alternatives - our live crawler also uses
> DOM based XML parsing so hitting memory limits - SAX would address
> this.  Also we want to be able to deploy patches to the crawlers
> without interrupting those long running jobs if possible.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message