Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 15A779AA2 for ; Mon, 31 Oct 2011 19:50:38 +0000 (UTC) Received: (qmail 76080 invoked by uid 500); 31 Oct 2011 19:50:37 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 76031 invoked by uid 500); 31 Oct 2011 19:50:37 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 76022 invoked by uid 99); 31 Oct 2011 19:50:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2011 19:50:37 +0000 X-ASF-Spam-Status: No, hits=-0.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zhengda1936@gmail.com designates 209.85.160.176 as permitted sender) Received: from [209.85.160.176] (HELO mail-gy0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2011 19:50:29 +0000 Received: by gyc15 with SMTP id 15so2470667gyc.35 for ; Mon, 31 Oct 2011 12:50:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=OE0vbj9pfFXqWQ9nVaqPkjUtBhTb5xH2RgT97cXuVWE=; b=sImFHFV6OVl+zjEZfNPNpgYFqCjgSmy3Q16Gsrt2ZzT1l1MosAPHiYbzILJzZ98nRC mQa3biyi0mh0qtCtrsatl3l1FsN84w2hPqFdQ7K1kz/f2VP7IvGwvd9lbLXccRURZpT5 JjijE9RX0S4SILrwJaMfzpC6dGRF8j/4maRJA= MIME-Version: 1.0 Received: by 10.236.154.193 with SMTP id h41mr18740677yhk.15.1320090608938; Mon, 31 Oct 2011 12:50:08 -0700 (PDT) Received: by 10.147.125.11 with HTTP; Mon, 31 Oct 2011 12:50:08 -0700 (PDT) In-Reply-To: References: Date: Mon, 31 Oct 2011 15:50:08 -0400 Message-ID: Subject: Re: replication in HDFS From: Zheng Da To: hdfs-dev@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hello Ram, Sorry, I didn't notice your reply. I don't really have a complete design in my mind. I wonder if the community is interested in using an alternative scheme to support data reliability and if the community plans to do it. You are right, we might need to buffer the source blocks on the local disk, and using parity blocks might not gain much advantage when we try to achieve the same reliability as achieved with a small replication factor. I think in the larger HDFS cluster we need a large replication factor (>= 3), right? Furthermore, network bandwidth is scarce resource, so it's more important to save network bandwidth. Thanks, Da On Tue, Oct 25, 2011 at 12:51 AM, Ramkumar Vadali wrote: > (sorry for the delay in replying) > > Hi Zheng > > You are right about HDFS RAID. It is used to save space, and is not involved > in the file write path. The generation of parity blocks and reducing > replication factor happens after a configurable amount of time. > > What is the design you have in mind? When the HDFS file is being written, > the data is generated block-by-block. But generating parity blocks will > require multiple source blocks to be ready, so the writer will need to > buffer the original data, either in memory or on disk. If it is saved on > disk because of memory pressure, will this be similar to writing the file > with replication 2? > > Ram > > > On Thu, Oct 13, 2011 at 1:16 AM, Zheng Da wrote: > >> Hello all, >> >> Right now HDFS is still using simple replication to increase data >> reliability. Even though it works, it just wastes the disk space, >> network and disk bandwidth. For data-intensive applications (that >> needs to write large result to the HDFS), it just limits the >> throughput of MapReduce. Also it's very energy-inefficient. >> >> Is the community trying to use erasure code to increase data >> reliability? I know someone is working on HDFS-RAID, but it can only >> solve the problem in disk space. In many case, network and disk >> bandwidth are more important, which are the factors limiting the >> throughput of MapReduce. Has anyone tried to use erasure code to >> reduce the size of data when data is written to HDFS? I know reducing >> replications might hurt the read performance, but I think it's still >> important to reduce the data size writing to HDFS. >> >> Thanks, >> Da >> >