hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lei Chang <lei_ch...@apache.org>
Subject Re: overcommit_memory setting in cluster with hawq and hadoop deployed
Date Sat, 17 Dec 2016 05:53:16 GMT
This issue has been raised many times. I think Taylor gave a good proposal.

>From long term, I think we should add more tests around killing process
randomly.

If it leads to corruptions, I think it is a bug. From database perspective,
we should not assume that processes cannot be killed under some specific
conditions or at some time.

Thanks
Lei


On Sat, Dec 17, 2016 at 1:43 AM, Taylor Vesely <tvesely@pivotal.io> wrote:

> Hi Ruilong,
>
> I've been brainstorming the issue, and this is my proposed solution. Please
> tell me what you think.
>
> Segments are stateless. In Greenplum, are worried about catalog corruption
> when a segment dies. In HAWQ, all of the data nodes are stateless. Even if
> OOM killer ends up killing a segment, we shouldn't need to worry about
> catalog corruption. *Only the master has a catalog that matters. *
>
> My proposition:
>
> Because the catalog matters on the master, we should probably continue to
> run master nodes with vm.overcommit=2. On the segments, however, I think
> that we shouldn't worry so much about an OOM event. The problem still
> remains that all queries across the cluster will be canceled if a data node
> goes offline (at least until HAWQ is able to restart failed query
> executors).
> If we *really* want to prevent the segments from being killed, we could
> tell the kernel to prefer killing the other processes on the node via the
> /proc/<pid>/oom_score_adj facility. Because Hadoop processes are generally
> resilient enough to restart failed containers, most Java processes can be
> treated as more expendable than HAWQ processes.
>
> /proc/<pid>/oom_score_odj ref:
> https://www.kernel.org/doc/Documentation/filesystems/proc.txt
>
> Thanks,
>
> Taylor Vesely
>
> On Fri, Dec 16, 2016 at 7:01 AM, Ruilong Huo <rhuo@pivotal.io> wrote:
>
> > Hi HAWQ Community,
> >
> > overcommit_memory setting in linux control the behaviour of memory
> > allocation. In cluster deployed with hawq and hadoop, it is controversial
> > to set overcommit_memory for the nodes. To be specific, it is recommended
> > to use overcommit strategy 2 by hawq, while it is recommended to use 1
> or 0
> > in hadoop.
> >
> > This thread is to start the discussion regarding the options to make a
> > reasonable choice here so that it is good with both products.
> >
> > *1. From HAWQ perspective*
> >
> > It is recommended to use vm.overcommit_memory = 2 (other than 0 and 1) to
> > prevent random kill of HAWQ process and thus backend reset.
> >
> > If nodes of the cluster are set to overcommit_memory  = 0 or 1, there is
> > risk that running query might get terminated due to backend reset. Even
> > worse, with overcommit_memory = 1, there is chance that data file and
> > transaction log might get corrupted due to insufficient cleanup during
> > process exit when oom happens. More details of overcommit_memory setting
> in
> > HAWQ can be found at: Linux-Overcommit-strategies-and-Pivotal-GPDB-HDB
> > <https://discuss.pivotal.io/hc/en-us/articles/202703383-
> > Linux-Overcommit-strategies-and-Pivotal-Greenplum-GPDB-Pivotal-HDB-HDB->
> > .
> >
> > *2. From Hadoop perspective*
> >
> > The crash of datanode usually happens when there is not enough heap
> memory
> > for JVM. To be specific, JVM allocates more heap (via a malloc or mmap
> > system call) and the address space has been exhausted. When
> > overcommit_memory = 2 and we run out of available address space, the
> system
> > will return ENOMEM for the system call, and the JVM will crash.
> >
> > This is due to the fact is that Java is very address space greedy. It
> will
> > allocate large regions of address space that it isn't actually using. The
> > overcommit_memory = 2 setting doesn't actually restrict physical memory
> > use, it restricts address space use. Many applications (especially java)
> > actually allocate sparse pages of memory, and rely on the kernel/OS to
> > actually provide the memory as soon as a page fault occurs.
> >
> > Best regards,
> > Ruilong Huo
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message