Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 10BCF10F22 for ; Sat, 19 Jul 2014 11:35:32 +0000 (UTC) Received: (qmail 66311 invoked by uid 500); 19 Jul 2014 11:35:23 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 66205 invoked by uid 500); 19 Jul 2014 11:35:22 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 66192 invoked by uid 99); 19 Jul 2014 11:35:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Jul 2014 11:35:22 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dechouxb@gmail.com designates 209.85.215.41 as permitted sender) Received: from [209.85.215.41] (HELO mail-la0-f41.google.com) (209.85.215.41) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Jul 2014 11:35:20 +0000 Received: by mail-la0-f41.google.com with SMTP id s18so3604276lam.14 for ; Sat, 19 Jul 2014 04:34:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=eI5JkMmBLCob4YW85ZiRuMVsrAt9G7iB4Anj07m3q3Y=; b=D8M6u54m/tUXmrHIBb2un/EW5E5YaF5UExOuS4WbudRlGvxtRi60WajYyQZjBacNmI WvT5l0AKvllv6Iw64qvehpkeVK2eYaTArP0ErrKYi7Mvkl7TeujLKK+TjgaGCkLJ23LE WeTgah5E2++b/2nExfmKHXJ6xR6BsJwCAoitWD2uVhULcL072MZweSF+KFWKPB+dl6sq upYujBN+dvztgausQC/8J4zekbHPkhb6p9aaEhIfgyp3lcyGY/OZ3hpec5z7Hsy3lx/+ vIXsxenQZwCzeL6dMb6W8ZYmK3BLJhO5ENoj7+Ir/Ro1fMg27MpUT8OvItgcBt6oCYD7 CL3Q== MIME-Version: 1.0 X-Received: by 10.112.143.8 with SMTP id sa8mr10931896lbb.89.1405769696225; Sat, 19 Jul 2014 04:34:56 -0700 (PDT) Received: by 10.112.56.233 with HTTP; Sat, 19 Jul 2014 04:34:56 -0700 (PDT) In-Reply-To: References: <201407180921319324715@gmail.com> Date: Sat, 19 Jul 2014 13:34:56 +0200 Message-ID: Subject: Re: what exactly does data in HDFS look like? From: Bertrand Dechoux To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=089e0112c0a461308704fe8a435b X-Virus-Checked: Checked by ClamAV on apache.org --089e0112c0a461308704fe8a435b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable But basically you are right : it is the same concept as with a classical file system. A file is seen as a sequence of bytes. For various efficiency reasons, the whole sequence is not stored like that but first splits into blocks (subsequences). With a local file system, these blocks will be within the local drives. With HDFS, they are somewhere within the cluster (and replicated, of course). So really, the filesystem doesn't care about what is inside the file and the format is something it is really oblivious to. Bertrand On Sat, Jul 19, 2014 at 7:02 AM, Shahab Yunus wrote: > The data itself is eventually store in a form of file. Each blocks of the > file and it replicas are stored in files and directories on different > nodes. The Namenode that keep the information and maintains it about each > file and where its blocks (and replicated blocks exist in the cluster.) > > As for the format, it is stored as bytes. In the normal cases you use the > DFS or FileOutputStream classes to write data and in those instances it = is > written in byte form (conversion to bytes i.e. serialize data.) When you > read the data, you use the same counterpart classes like InputStream and > those convert the data from byte to text (i.e. deserialization). Point > being, HDFS is oblivious to the fact whether it was JSON of XML. > > This would be more evident if you see the code to read/write from HDFS > (writing example below): > > https://sites.google.com/site/hadoopandhive/home/how-to-write-a-file-in-h= dfs-using-hadoop > > Now on the other hand, if you were using compression or other storage > formats like Avro or Parquet then those formats come with their own class= es > which take care of serialization and deserialization. > > For basic cases, this should be helpful: > > https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapte= r-3/data-flow > > More here on data storage: > > http://stackoverflow.com/questions/2358402/where-hdfs-stores-files-locall= y-by-default > http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Organization > https://developer.yahoo.com/hadoop/tutorial/module1.html#data > > Regards, > Shahab > > > On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" Wakefield, MBA < > adaryl.wakefield@hotmail.com> wrote: > >> And by that I mean is there an HDFS file type? I feel like I=E2=80=99m= missing >> something. Let=E2=80=99s say I have a HUGE json file that I import into = HDFS. Does >> it retain it=E2=80=99s JSON format in HDFS? What if it=E2=80=99s just ra= ndom tweets I=E2=80=99m >> streaming. Is it kind of like a normal disk where there are all kinds of >> files sitting on disk in their own format it=E2=80=99s just that in HDFS= they are >> spread out over nodes? >> >> B. >> > > --089e0112c0a461308704fe8a435b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
But basically you are right : it is the same concept as wi= th a classical file system. A file is seen as a sequence of bytes. For vari= ous efficiency reasons, the whole sequence is not stored like that but firs= t splits into blocks (subsequences). With a local file system, these blocks= will be within the local drives. With HDFS, they are somewhere within the = cluster (and replicated, of course).

So really, the filesystem doesn't care about what is ins= ide the file and the format is something it is really oblivious to.

Bertrand


On Sat, Jul 19, 2014 at 7:02 AM, Shahab = Yunus <shahab.yunus@gmail.com> wrote:
The data itself is eventually store in a form of file. Eac= h blocks of the file and it replicas are stored in files and directories on= different nodes. The Namenode that keep the information and maintains it a= bout each file and where its blocks (and replicated blocks exist in the clu= ster.)

As for the format, it is stored as bytes. In the normal case= s you use the DFS or FileOutputStream classes to =C2=A0write data and in th= ose instances it is written in byte form (conversion to bytes i.e. serializ= e data.) When you read the data, you use the same counterpart classes like = InputStream and those convert the data from byte to text (i.e. deserializat= ion). Point being, HDFS is oblivious to the fact whether it was JSON of XML= .

This would be more evident if you see the code to read/= write from HDFS (writing example below):

Now on the other hand, if you were using compress= ion or other storage formats like Avro or Parquet then those formats come w= ith their own classes which take care of serialization and deserialization.=

For basic cases, this should be helpful:

More here on data storage:

Regards,
Shahab


On Sat, Jul 19, 2014 at 12:12 AM, Adaryl "Bob" W= akefield, MBA <adaryl.wakefield@hotmail.com> wrot= e:
And by that I mean is there an HDFS file type? I feel like I=E2=80=99m= missing=20 something. Let=E2=80=99s say I have a HUGE json file that I import into HDF= S. Does it=20 retain it=E2=80=99s JSON format in HDFS? What if it=E2=80=99s just random t= weets I=E2=80=99m streaming.=20 Is it kind of like a normal disk where there are all kinds of files sitting= on=20 disk in their own format it=E2=80=99s just that in HDFS they are spread out= over=20 nodes?
=C2=A0
B.


--089e0112c0a461308704fe8a435b--