Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B7F5010421 for ; Thu, 19 Dec 2013 13:58:57 +0000 (UTC) Received: (qmail 59250 invoked by uid 500); 19 Dec 2013 13:56:31 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 59107 invoked by uid 500); 19 Dec 2013 13:56:20 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 59067 invoked by uid 99); 19 Dec 2013 13:56:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Dec 2013 13:56:13 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dsuiter@rdx.com designates 209.85.212.177 as permitted sender) Received: from [209.85.212.177] (HELO mail-wi0-f177.google.com) (209.85.212.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Dec 2013 13:56:07 +0000 Received: by mail-wi0-f177.google.com with SMTP id cc10so2212590wib.4 for ; Thu, 19 Dec 2013 05:55:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rdx.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=hu6ybZqfhAsJgEMhpsXJI9EvCr17L49G+p6s48nIQfg=; b=dk5VO7whQqimqsbuxWuMXiiNS0n6APJWrgZXLTtfr2xHx/WgY2fTltsqhSWey18Q8Y LSjiksqQeJUtrl8L4pIQ1e2e+PE+XzeO88sFInFvuxAkREzl6B6nG4FACNy92sscojTe 1o9sjzrz5MgzH/AMzcNChucHVCjPHUt/Nvhpg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=hu6ybZqfhAsJgEMhpsXJI9EvCr17L49G+p6s48nIQfg=; b=khWi2Pgbsca9JsBQxEfeVuwUeLnUck2qeMfs0ZK7bDTVXosJ/T+PIBypcniZdZjH+F kcyTYLfrsynvHOhu0DJpZcE6mEcIdIZ0U3kcRQldqi1QSYgoHa30FUthXyi5YagLp2vv TC8AhDJ9qawslSim312Rw6+VU9ezBjreH7nqSGGpgnRocpJAf+cSFIGfVu6Vrio94y0/ lueSy5nLO0nHakkPRFpHMYnL3+8QXZy7YxP3GPPqnV4pUVAoMeX9in2uW81cSM7RMfAG 42WtP7SMiOIpSLmKLB22qwtjB+1Sj7ZOcnUdO7YYEGI+X8IbnDKZW8UKquQ/0YoE/ElK lP+g== X-Gm-Message-State: ALoCoQm5cfM5yh/9xLW/agK7yojenOxOB39vyRITzguYgaa8NXMulltKoLnLZAHn1tvZeohti+8G MIME-Version: 1.0 X-Received: by 10.180.38.35 with SMTP id d3mr2577733wik.2.1387461347024; Thu, 19 Dec 2013 05:55:47 -0800 (PST) Received: by 10.217.108.73 with HTTP; Thu, 19 Dec 2013 05:55:46 -0800 (PST) In-Reply-To: References: Date: Thu, 19 Dec 2013 08:55:46 -0500 Message-ID: Subject: Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called? From: Devin Suiter RDX To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8f6473b1bab99004ede38443 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f6473b1bab99004ede38443 Content-Type: text/plain; charset=ISO-8859-1 Hello, In my experience with Flume, watching the HDFS Sink verbose output, I know that even after a file has flushed, but is still open, it reads as a 0-byte file, even if there is actually data contained in the file. A HDFS "file" is a meta-location that can accept streaming input for as long as it is open, so the length cannot be mathematically defined until a start and an end are in place. The flush operation moves data from a buffer to a storage medium, but I don't think that necessarily means that it tells the HDFS RecordWriter to place the "end of stream/EOF" marker down, since the "file" meta-location in HDFS is a pile of actual files around the cluster on physical disk that HDFS presents to you as one file. The HDFS "file" and the physical file splits on disk are distinct, and I would suspect that your HDFS flush calls are forcing Hadoop to move the physical filesplits from their physical datanode buffers to disk, but is not telling HDFS that you expect no further input - that is what the HDFS close will do. One thing you could try - instead of asking for the length property, which is probably unavailable until the close call, try asking for/viewing the contents of the file. Your scenario step 3 says "according to the header hdfs.h, after this call returns, *new readers should be able to see the data*" which isn't the same as "new readers can obtain an updated property value from the file metadata" - one is looking at the data inside the container, and the other is asking the container to describe itself. I hope that helps with your problem! *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She wrote: > > sorry to reply to my own thread. > > Does anyone know the answer to this question? > If so, can you please tell me if my understanding is right or wrong? > > thanks. > > > > 2013/12/17 Xiaobin She > >> hi, >> >> I'm using libhdfs to deal with hdfs in an c++ programme. >> >> And I have encountered an problem. >> >> here is the scenario : >> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file >> 2. call hdfsWrite to write some data >> 3. call hdfsHFlush to flush the data, according to the header hdfs.h, >> after this call returns, new readers shoule be able to see the data >> 4. I use an http get request to get the file list on that directionary >> through the webhdfs interface, >> here I have to use the webhdfs interface because I need to deal with >> symlink file >> 5. from the json response which is returned by the webhdfs, I found that >> the lenght of the file is still 0. >> >> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call >> these three together, but still doesn't work. >> >> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get >> the correct file lenght through the webhdfs interface. >> >> >> Is this right? I mean if you want the other process to see the change of >> data, you need to call hdfsCloseFile? >> >> Or is there somethings I did wrong? >> >> thank you very much for your help. >> >> >> >> >> > --e89a8f6473b1bab99004ede38443 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hello,

In my experience with Flume, wat= ching the HDFS Sink verbose output, I know that even after a file has flush= ed, but is still open, it reads as a 0-byte file, even if there is actually= data contained in the file.

A HDFS "file" is a meta-location that can acc= ept streaming input for as long as it is open, so the length cannot be math= ematically defined until a start and an end are in place.

The flush operation moves data from a buffer to a storage medium= , but I don't think that necessarily means that it tells the HDFS Recor= dWriter to place the "end of stream/EOF" marker down, since the &= quot;file" meta-location in HDFS is a pile of actual files around the = cluster on physical disk that HDFS presents to you as one file. The HDFS &q= uot;file" and the physical file splits on disk are distinct, and I wou= ld suspect that your HDFS flush calls are forcing Hadoop to move the physic= al filesplits from their physical datanode buffers to disk, but is not tell= ing HDFS that you expect no further input - that is what the HDFS close wil= l do.

One thing you could try - instead of asking for the len= gth property, which is probably unavailable until the close call, try askin= g for/viewing the contents of the file.

Your scena= rio step 3 says "according to the header hdfs.h, after this call returns, new read= ers should be able to see the data" which isn't the same as &q= uot;new readers can obtain an updated property value from the file metadata= " - one is looking at the data inside the container, and the other is = asking the container to describe itself.

I h= ope that helps with your problem!


Devin Suiter
Jr. Da= ta Solutions Software Engineer
100 Sandusky Street | 2nd = Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 |=A0www.rdx.com


On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin= She <xiaobinshe@gmail.com> wrote:

sorry to reply to my own thread.

Does anyone kn= ow the answer to this question?
If so, can you please tell me if my unde= rstanding is right or wrong?

thanks.



2013/12/17 Xiaobin She <xiaobinshe@g= mail.com>
hi,

I'm using libhdfs to deal with hdfs in an = c++ programme.

And I have encountered an problem.

here is the= scenario :
1. first I call hdfsOpenFile with O_WRONLY flag to open an f= ile
2. call hdfsWrite to write some data
3. call hdfsHFlush to flush the dat= a,=A0 according to the header hdfs.h, after this call returns, new readers = shoule be able to see the data
4. I use an http get request to get the f= ile list on that directionary through the webhdfs interface,
here=A0 I have to use the webhdfs interface because I need to deal with sym= link file
5. from the json response which is returned by the webhdfs, I = found that the lenght of the file is still 0.

I have tried to replac= e hdfsHFlush with hdfsFlush or hdfsSync, or call these three together, but = still doesn't work.

Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get= the correct file lenght through the webhdfs interface.


Is this = right? I mean if you want the other process to see the change=A0 of data, y= ou need to call hdfsCloseFile?

Or is there somethings I did wrong?

thank you very much for your= help.






--e89a8f6473b1bab99004ede38443--