Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AAD2375F7 for ; Thu, 8 Sep 2011 07:01:41 +0000 (UTC) Received: (qmail 97040 invoked by uid 500); 8 Sep 2011 07:01:36 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 96738 invoked by uid 500); 8 Sep 2011 07:01:14 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 96714 invoked by uid 99); 8 Sep 2011 07:01:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Sep 2011 07:01:10 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of todd@cloudera.com designates 209.85.161.172 as permitted sender) Received: from [209.85.161.172] (HELO mail-gx0-f172.google.com) (209.85.161.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Sep 2011 07:01:04 +0000 Received: by gxk19 with SMTP id 19so855708gxk.3 for ; Thu, 08 Sep 2011 00:00:43 -0700 (PDT) Received: by 10.43.50.72 with SMTP id vd8mr125173icb.433.1315465243170; Thu, 08 Sep 2011 00:00:43 -0700 (PDT) MIME-Version: 1.0 Received: by 10.42.174.74 with HTTP; Thu, 8 Sep 2011 00:00:23 -0700 (PDT) In-Reply-To: References: From: Todd Lipcon Date: Thu, 8 Sep 2011 00:00:23 -0700 Message-ID: Subject: Re: Question about hdfs close * hflush behavior To: hdfs-user@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable 2011/9/7 kang hua : > Thanks my friend! > please allow me to ask more question about detail thinks! > 1 yes, I can use hadoop fs -tail or -cat xxx to see that file content, Bu= t > how can I get that file real size in other process if namenode is not cha= nge > ? I real want is to read the date in tail of that file. You can open the file and then use an API on the DFSInputStream class to find the length. I don't recall the name of the API, but if you look in there, you should see it. > > 2 why "when I reboot hdfs, I can see that file's content that I flushed > again by "hadoop fs -ls xxx" " On restart, the namenode triggers block synchronization, and the up-to-date length is determined. > 3 In append mode. close file and open it with append mode again and agai= n . > real dataspace is normally increase, but nodename show dfs used space > increase to fast. it is a bug ? Might be a bug, yes. > 4 which version of hdfs that append is no bug ? 0.21, which is buggy in other aspects. So, no stable released version has a working append() call. In truth I've never seen a _good_ use case for append-to-an-existing-file. Usually you can do just as well by keeping the file open and periodically hflushing, or rolling to a new file when you want to add more records to an existing dataset. -Todd >> From: todd@cloudera.com >> Date: Wed, 7 Sep 2011 14:17:10 -0700 >> Subject: Re: Question about hdfs close * hflush behavior >> To: hdfs-user@hadoop.apache.orgSend >> >> 2011/9/7 kang hua : >> > >> > Hi friends: >> > I has two question. >> > first one is: >> > I use libhdfs's hflush to flush my data to a file, in same process >> > context I can read it. But I find that file unchanged if I check from >> > hadoop >> > shell ---- it's len is zero( check by "hadoop fs -ls xxx" or read it i= n >> > program); however when I reboot hdfs, I can read that file's content >> > that I >> > flushed again=E3=80=82 why ? >> >> If we were to update th e file metadata on hflush, it would be very >> expensive, since the metadata lives in the NameNode. >> >> If you do hadoop fs -cat xxx, you should see the entirety of the flushed >> data. >> >> > can I hflush data to file without close it,at same time read data >> > flushed >> > by other process =EF=BC=9F >> >> yes. >> > > > > > >> > second one is: >> > does once close hdfs file, the last written block is untouched. even >> > open >> > that file with append mode, namenode will alloc a new block to for >> > append >> > data? >> >> No, it reopens the last block of the existing file for append. >> >> > I find if I close file and open it with append mode again and again. >> > hdfs >> > report will show "used space much more that the file logic size" >> >> Not sure I follow what you mean by this. Can you give more d etail? >> >> > btw: I use cloudera ch2 >> >> The actual "append()" function has some bugs in all of the 0.20 >> releases, including Cloudera's. The hflush/sync() API is fine to use, >> but I would recommend against using append(). >> >> -Todd >> -- >> Todd Lipcon >> Software Engineer, Cloudera > --=20 Todd Lipcon Software Engineer, Cloudera