Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 80BA910276 for ; Fri, 20 Dec 2013 14:15:09 +0000 (UTC) Received: (qmail 54341 invoked by uid 500); 20 Dec 2013 14:14:33 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 54141 invoked by uid 500); 20 Dec 2013 14:14:28 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 54008 invoked by uid 99); 20 Dec 2013 14:14:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Dec 2013 14:14:24 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of java8964@hotmail.com designates 65.55.90.93 as permitted sender) Received: from [65.55.90.93] (HELO snt0-omc2-s18.snt0.hotmail.com) (65.55.90.93) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Dec 2013 14:14:17 +0000 Received: from SNT149-W12 ([65.55.90.71]) by snt0-omc2-s18.snt0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 20 Dec 2013 06:13:56 -0800 X-TMN: [P7k7nLPjpRlrxrGADOLPDH2Es4KzTm9OAwfOT9h+uIs=] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_fb979b67-4115-40fe-84d7-22b6b2b3eca7_" From: java8964 To: "user@hadoop.apache.org" Subject: RE: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called? Date: Fri, 20 Dec 2013 09:13:56 -0500 Importance: Normal In-Reply-To: References: ,,,,, MIME-Version: 1.0 X-OriginalArrivalTime: 20 Dec 2013 14:13:56.0634 (UTC) FILETIME=[B867A7A0:01CEFD8D] X-Virus-Checked: Checked by ClamAV on apache.org --_fb979b67-4115-40fe-84d7-22b6b2b3eca7_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I don't think in HDFS=2C a file can be written concurrently. Process B won'= t be able to write the file (But can read) until it is CLOSED by process A. Yong Date: Fri=2C 20 Dec 2013 15:55:00 +0800 Subject: Re: Why other process can't see the change after calling hdfsHFlus= h unless hdfsCloseFile is called? From: xiaobinshe@gmail.com To: user@hadoop.apache.org To Peyman=2C thank you for your reply. So the property of the file is stored in namenode=2C and it will not be upd= ated until the file is closed. But isn't this will cause some problem ? =0A= For example=2C 1. process A open the file in write mode=2C wirte 1MB data=2C flush the dat= a=2C and hold the file handler opened 2. process B open the file in write mode=2C wirte 1MB data=2C flush the dat= a 3. process B close the file=2C at this point=2C the other process should be= able to see the new lenght of the file =0A= 4. process C open the file in read mode=2C get the length of the file=2C an= d read length bytes of data So at this point=2C the last 1MB data that process C was read is written by= process A ? Am I right ? If this is right=2C then process C read one more 1MB data because process B= close the file=2C but it read the data which is written by process A. =0A= this seems a little wired. 2013/12/20 Peyman Mohajerian =0A= Ok i just read the book section on this (Definite Guide to Hadoop)=2C just = to be sure=2C length of a file is stored in Name Node=2C and its updated on= ly after client calls Name Node after close of the file. At that point if N= ame Node has received all the ACK from Data Nodes then it will set the leng= th meta-data (e.g. minimum replication is met)=2C so one of the last steps = and its for performance reasons=2C client decides when its done writing.=0A= =0A= On Thu=2C Dec 19=2C 2013 at 8:36 AM=2C Xiaobin She w= rote: =0A= =0A= To Devin=2C thank you very much for your explanation. I do found that I can read the data out of the file even if I did not close= the file I'm writing to ( the read operation is call on another file handl= er opened on the same file but still in the same process )=2C which make me= more confuse at that time=2C because I think since I can read the data fro= m the file =2C why can't I get the length of the file correctly. =0A= =0A= =0A= But from the explantion that you have described=2C I think I can understand= it now. So it seems in order to do what I want ( write some data to the file=2C and= then get the length of the file throuth webhdfs interface)=2C I have to op= en and close the file every time I do the write operation. =0A= =0A= =0A= Thank you very much again. xiaobinshe 2013/12/19 Devin Suiter RDX =0A= =0A= =0A= Hello=2C In my experience with Flume=2C watching the HDFS Sink verbose output=2C I k= now that even after a file has flushed=2C but is still open=2C it reads as = a 0-byte file=2C even if there is actually data contained in the file.=0A= =0A= =0A= =0A= A HDFS "file" is a meta-location that can accept streaming input for as lon= g as it is open=2C so the length cannot be mathematically defined until a s= tart and an end are in place. =0A= =0A= =0A= =0A= The flush operation moves data from a buffer to a storage medium=2C but I d= on't think that necessarily means that it tells the HDFS RecordWriter to pl= ace the "end of stream/EOF" marker down=2C since the "file" meta-location i= n HDFS is a pile of actual files around the cluster on physical disk that H= DFS presents to you as one file. The HDFS "file" and the physical file spli= ts on disk are distinct=2C and I would suspect that your HDFS flush calls a= re forcing Hadoop to move the physical filesplits from their physical datan= ode buffers to disk=2C but is not telling HDFS that you expect no further i= nput - that is what the HDFS close will do.=0A= =0A= =0A= =0A= One thing you could try - instead of asking for the length property=2C whic= h is probably unavailable until the close call=2C try asking for/viewing th= e contents of the file. Your scenario step 3 says "according to the header hdfs.h=2C after this cal= l returns=2C new readers should be able to see the data" which isn't the sa= me as "new readers can obtain an updated property value from the file metad= ata" - one is looking at the data inside the container=2C and the other is = asking the container to describe itself.=0A= =0A= =0A= =0A= I hope that helps with your problem! =0A= =0A= =0A= =0A= Devin SuiterJr. Data Solutions Software Engineer100 Sandusky Street | 2nd F= loor | Pittsburgh=2C PA 15212 =0A= =0A= =0A= =0A= Google Voice: 412-256-8556 | www.rdx.com=0A= On Thu=2C Dec 19=2C 2013 at 7:50 AM=2C Xiaobin She w= rote: =0A= =0A= =0A= =0A= sorry to reply to my own thread. Does anyone know the answer to this question? If so=2C can you please tell me if my understanding is right or wrong? thanks. =0A= =0A= =0A= =0A= =0A= 2013/12/17 Xiaobin She =0A= =0A= =0A= =0A= =0A= hi=2C=20 I'm using libhdfs to deal with hdfs in an c++ programme. And I have encountered an problem. here is the scenario : 1. first I call hdfsOpenFile with O_WRONLY flag to open an file =0A= =0A= =0A= =0A= =0A= =0A= 2. call hdfsWrite to write some data 3. call hdfsHFlush to flush the data=2C according to the header hdfs.h=2C = after this call returns=2C new readers shoule be able to see the data 4. I use an http get request to get the file list on that directionary thro= ugh the webhdfs interface=2C =0A= =0A= =0A= =0A= =0A= =0A= here I have to use the webhdfs interface because I need to deal with symli= nk file 5. from the json response which is returned by the webhdfs=2C I found that = the lenght of the file is still 0. I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync=2C or call th= ese three together=2C but still doesn't work. =0A= =0A= =0A= =0A= =0A= =0A= Buf if I call hdfsCloseFile after I call the hdfsHFlush=2C then I can get t= he correct file lenght through the webhdfs interface. Is this right? I mean if you want the other process to see the change of d= ata=2C you need to call hdfsCloseFile? =0A= =0A= =0A= =0A= =0A= =0A= Or is there somethings I did wrong? thank you very much for your help. =0A= =0A= =0A= =0A= =0A= = --_fb979b67-4115-40fe-84d7-22b6b2b3eca7_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
I don't think in HDFS=2C a file = can be written concurrently. Process B won't be able to write the file (But= can read) until it is CLOSED by process A.

Yong

=

Date: Fri=2C 20 Dec 2013 15:55:00 +0800
Sub= ject: Re: Why other process can't see the change after calling hdfsHFlush u= nless hdfsCloseFile is called?
From: xiaobinshe@gmail.com
To: user@ha= doop.apache.org

To Peyman=2C

thank you for y= our reply.

So the property of the file is stored in namenode=2C and = it will not be updated until the file is closed.

But isn't this will= cause some problem ?
=0A=
For example=2C
1. process A open the file in write mode=2C wirte 1MB= data=2C flush the data=2C and hold the file handler opened
2. process B= open the file in write mode=2C wirte 1MB data=2C flush the data
3. proc= ess B close the file=2C at this point=2C the other process should be able t= o see the new lenght of the file
=0A= 4. process C open the file in read mode=2C get the length of the file=2C an= d read length bytes of data

So at this point=2C the last 1MB data th= at process C was read is written by process A ? Am I right ?
If this is = right=2C then process C read one more 1MB data because process B close the = file=2C but it read the data which is written by process A.
=0A= this seems a little wired.







2013/12/20 Peyman Mohajeri= an <=3Bmohajeri@gmail.com>=3B
=0A=
Ok i just read the book section on th= is (Definite Guide to Hadoop)=2C just to be sure=2C length of a file is sto= red in Name Node=2C and its updated only after client calls Name Node after= close of the file. At that point if Name Node has received all the ACK fro= m Data Nodes then it will set the length meta-data (e.g. minimum replicatio= n is met)=2C so one of the last steps and its for performance reasons=2C cl= ient decides when its done writing.
=0A=
=0A=


On Thu= =2C Dec 19=2C 2013 at 8:36 AM=2C Xiaobin She <=3Bxiaobinshe@gmail.com>=3B wrote:
=0A= =0A=
To Devin=2C

thank you ver= y much for your explanation.

I do found that I can read the data out= of the file even if I did not close the file I'm writing to ( the read ope= ration is call on another file handler opened on the same file but still in= the same process )=2C which make me more confuse at that time=2C because I= think since I can read the data from the file =2C why can't I get the leng= th of the file correctly.
=0A= =0A= =0A=
But from the explantion that you have described=2C I think I can unders= tand it now.

So it seems in order to do what I want ( write so= me data to the file=2C and then get the length of the file throuth webhdfs = interface)=2C I have to open and close the file every time I do the write o= peration.
=0A= =0A= =0A=
Thank you very much again.

xiaobinshe



=


2013/12/19 Devin Suiter RDX <=3Bdsuiter@rdx.com>=3B
= =0A= =0A= =0A=
Hello=2C

In my exp= erience with Flume=2C watching the HDFS Sink verbose output=2C I know that = even after a file has flushed=2C but is still open=2C it reads as a 0-byte = file=2C even if there is actually data contained in the file.
=0A= =0A= =0A= =0A=

A HDFS "file" is a meta-location that can accept stream= ing input for as long as it is open=2C so the length cannot be mathematical= ly defined until a start and an end are in place.

=0A= =0A= =0A= =0A=
The flush operation moves data from a buffer to a storage medium= =2C but I don't think that necessarily means that it tells the HDFS RecordW= riter to place the "end of stream/EOF" marker down=2C since the "file" meta= -location in HDFS is a pile of actual files around the cluster on physical = disk that HDFS presents to you as one file. The HDFS "file" and the physica= l file splits on disk are distinct=2C and I would suspect that your HDFS fl= ush calls are forcing Hadoop to move the physical filesplits from their phy= sical datanode buffers to disk=2C but is not telling HDFS that you expect n= o further input - that is what the HDFS close will do.
=0A= =0A= =0A= =0A=

One thing you could try - instead of asking for the len= gth property=2C which is probably unavailable until the close call=2C try a= sking for/viewing the contents of the file.

Your s= cenario step 3 says "according to the header hdfs.h=2C after this call returns=2C <= u>new readers should be able to see the data" which isn't the same as "= new readers can obtain an updated property value from the file metadata" - = one is looking at the data inside the container=2C and the other is asking = the container to describe itself.
=0A= =0A= =0A= =0A=
I hope that helps with your problem!

=
=0A= =0A= =0A= =0A=
Devin Suiter
Jr. Da= ta Solutions Software Engineer
100 Sandusky Street = | 2nd Floor | Pittsburgh=2C PA 15212
=0A= =0A= =0A= =0A= Google Voice: 412-256-8556 | =3Bwww.rdx.com
=
=0A=

On Thu=2C Dec 19=2C 2013 at 7:50 AM= =2C Xiaobin She <=3Bxiaobinshe@gmail.com>=3B wrote:
=0A= =0A= =0A= =0A=

sorry to reply to my own thread.

Does anyone kn= ow the answer to this question?
If so=2C can you please tell me if my un= derstanding is right or wrong?

thanks.

=0A= =0A= =0A= =0A= =0A=

2013/12/17 Xiaobin She <=3Bxiaobin= she@gmail.com>=3B
=0A= =0A= =0A= =0A= =0A=
hi=2C

I'm using libhdfs to deal with hdfs in an c+= + programme.

And I have encountered an problem.

here is the s= cenario :
1. first I call hdfsOpenFile with O_WRONLY flag to open an fil= e
=0A= =0A= =0A= =0A= =0A= =0A= 2. call hdfsWrite to write some data
3. call hdfsHFlush to flush the dat= a=2C =3B according to the header hdfs.h=2C after this call returns=2C n= ew readers shoule be able to see the data
4. I use an http get request t= o get the file list on that directionary through the webhdfs interface=2C=0A= =0A= =0A= =0A= =0A= =0A= here =3B I have to use the webhdfs interface because I need to deal wit= h symlink file
5. from the json response which is returned by the webhdf= s=2C I found that the lenght of the file is still 0.

I have tried to= replace hdfsHFlush with hdfsFlush or hdfsSync=2C or call these three toget= her=2C but still doesn't work.
=0A= =0A= =0A= =0A= =0A= =0A=
Buf if I call hdfsCloseFile after I call the hdfsHFlush=2C then I can g= et the correct file lenght through the webhdfs interface.


Is thi= s right? I mean if you want the other process to see the change =3B of = data=2C you need to call hdfsCloseFile?
=0A= =0A= =0A= =0A= =0A= =0A=
Or is there somethings I did wrong?

thank you very much for your= help.




=0A=

=0A=

=0A=

=0A=

=0A=

= --_fb979b67-4115-40fe-84d7-22b6b2b3eca7_--