Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA7F210CEE for ; Mon, 30 Sep 2013 17:40:50 +0000 (UTC) Received: (qmail 90784 invoked by uid 500); 30 Sep 2013 17:40:44 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 90252 invoked by uid 500); 30 Sep 2013 17:40:39 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 90244 invoked by uid 99); 30 Sep 2013 17:40:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Sep 2013 17:40:38 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mohajeri@gmail.com designates 209.85.214.179 as permitted sender) Received: from [209.85.214.179] (HELO mail-ob0-f179.google.com) (209.85.214.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Sep 2013 17:40:34 +0000 Received: by mail-ob0-f179.google.com with SMTP id wn1so5079470obc.24 for ; Mon, 30 Sep 2013 10:40:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Gc4JIswRKe4PCu8xHc3xY5Bgc/KQst9ADBGcH4/kgvU=; b=fA6xZNPrMK8YWcw30Celk+eOt0bSzGLwjTWzyrEuGDx40d+9Qt51s3UOsuekfQxC6L z38TfAHd5bXz432AhxcFd24zRpm7yYndA99MoTAN7nUmalIFeZ+yz3B5qIM+cg/mB5D1 TNzb5HRWMd03t4hJ/7zfYvqXNsufnhUQO+RFnivAsOj6Ek58QcrwYsRoUpLKBjkftEs0 hatlVUS3qtdny2det6R4fd2KCU6wDDM6fLdrnHa+q5HOsmBTmHJS4ebkT+308h695+Nx rC0kno1ArNnnda7PWiTBUwh3KtlA1GiRCoLQQ4DCpOG8SoLE6Is3Uofkirs/jjLJ7hg2 kYfg== MIME-Version: 1.0 X-Received: by 10.60.52.244 with SMTP id w20mr20725712oeo.30.1380562814022; Mon, 30 Sep 2013 10:40:14 -0700 (PDT) Received: by 10.76.157.229 with HTTP; Mon, 30 Sep 2013 10:40:13 -0700 (PDT) In-Reply-To: References: Date: Mon, 30 Sep 2013 10:40:13 -0700 Message-ID: Subject: Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC From: Peyman Mohajerian To: user@hadoop.apache.org Cc: wolfgang.wyremba@hotmail.com Content-Type: multipart/alternative; boundary=001a113346001ebb6a04e79d54d3 X-Virus-Checked: Checked by ClamAV on apache.org --001a113346001ebb6a04e79d54d3 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable It is not recommended to keep the data at rest in sequences format, because it is Java specific and you cannot share it with other none-java systems easily, it is ideal for running map/reduce jobs. On approach would be to bring all the data of different formats in HDFS as is and then convert them to a single format that works best for you depending on whether you will export this data out or not (in addition to many other considerations). But as already mentioned Hive can directly read any of these formats. On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh wrote: > for xml files processing hadoop comes with a class for this purpose calle= d > StreamXmlRecordReader,You can use it by setting your input format to > StreamInputFormat and setting the > stream.recordreader.class property to > org.apache.hadoop.streaming.StreamXmlRecordReader. > > for Json files, an open-source project ElephantBird that contains some > useful utilities for working with LZO compression, has a > LzoJsonInputFormat, which can read JSON, but it requires that the input > file be LZOP compressed. We=92ll use this code as a template for our own = JSON > InputFormat, which doesn=92t have the LZOP compression requirement. > > if you are dealing with small files then sequence file format comes in > rescue, it stores sequences of binary key-value pairs. Sequence files are > well suited as a format for MapReduce data since they are > splittable,support compression. > > > :::::::::::::::::::::::::::::::::::::::: > Raj K Singh > http://in.linkedin.com/in/rajkrrsingh > http://www.rajkrrsingh.blogspot.com > Mobile Tel: +91 (0)9899821370 > > > On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba < > wolfgang.wyremba@hotmail.com> wrote: > >> Hello, >> >> the file format topic is still confusing me and I would appreciate if yo= u >> could share your thoughts and experience with me. >> >> From reading different books/articles/websites I understand that >> - Sequence files (used frequently but not only for binary data), >> - AVRO, >> - RC (was developed to work best with Hive -columnar storage) and >> - ORC (a successor of RC to give Hive another performance boost - Stinge= r >> initiative) >> are all container file formats to solve the "small files problem" and al= l >> support compression and splitting. >> Additionally, each file format was developed with specific >> features/benefits >> in mind. >> >> Imagine I have the following text source data >> - 1 TB of XML documents (some millions of small files) >> - 1 TB of JSON documents (some hundred thousands of medium sized files) >> - 1 TB of Apache log files (some thousands of bigger files) >> >> How should I store this data in HDFS to process it using Java MapReduce >> and >> Pig and Hive? >> I want to use the best tool for my specific problem - with "best" >> performance of course - i.e. maybe one problem on the apache log data ca= n >> be >> best solved using Java MapReduce, another one using Hive or Pig. >> >> Should I simply put the data into HDFS as the data comes from - i.e. as >> plain text files? >> Or should I convert all my data to a container file format like sequence >> files, AVRO, RC or ORC? >> >> Based on this example, I believe >> - the XML documents will be need to be converted to a container file >> format >> to overcome the "small files problem". >> - the JSON documents could/should not be affected by the "small files >> problem" >> - the Apache files should definitely not be affected by the "small files >> problem", so they could be stored as plain text files. >> >> So, some source data needs to be converted to a container file format, >> others not necessarily. >> But what is really advisable? >> >> Is it advisable to store all data (XML, JSON, Apache logs) in one specif= ic >> container file format in the cluster- let's say you decide to use sequen= ce >> files? >> Having only one file format in HDFS is of course a benefit in terms of >> managing the files and writing Java MapReduce/Pig/Hive code against it. >> Sequence files in this case is certainly not a bad idea, but Hive querie= s >> could probably better benefit from let's say RC/ORC. >> >> Therefore, is it better to use a mix of plain text files and/or one or >> more >> container file formats simultaneously? >> >> I know that there will be no crystal-clear answer here as it always >> "depends", but what approach should be taken here, or what is usually us= ed >> in the community out there? >> >> I welcome any feedback and experiences you made. >> >> Thanks >> >> > --001a113346001ebb6a04e79d54d3 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
It is not recommended to keep the data at rest in sequences format, because it is Java specific and= you cannot share it with other none-java systems easily, it is ideal for r= unning map/reduce jobs. On approach would be to bring all the data of diffe= rent formats in HDFS as is and then convert them to a single format that wo= rks best for you depending on whether you will export this data out or not = (in addition to many other considerations). But as already mentioned Hive c= an directly read any of these formats.


= On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <rajkrrsingh@gmail.com> wrote:
for xml fi= les processing hadoop comes with a class for this purpose called StreamXmlR= ecordReader,You can us= e it by setting your input format to StreamInputFormat and setting the
stream.recor= dreader.class property to org.apache.hadoop.streaming.StreamXmlRecordReader= .=A0

for Json files, an open-source project ElephantBird that contains some use= ful utilities for working with LZO compression, has a LzoJsonInputFormat, w= hich can read JSON, but it requires that the input file be LZOP compressed.= We=92ll use this code as a template for our own JSON InputFormat, which do= esn=92t have the LZOP compression requirement.

<= /div>
if you = are dealing with small files then sequence file format comes in rescue, it = stores sequences of binary key-value pairs. Sequence=A0files are well suited as a format for Ma= pReduce data since they are splittable,support compression.



On Mon, Sep 30, 2013 at 1:10 PM, Wolfgan= g Wyremba <wolfgang.wyremba@hotmail.com> wrote:
Hello,

the file format topic is still confusing me and I would appreciate if you could share your thoughts and experience with me.

>From reading different books/articles/websites I understand that
- Sequence files (used frequently but not only for binary data),
- AVRO,
- RC (was developed to work best with Hive -columnar storage) and
- ORC (a successor of RC to give Hive another performance boost - Stinger initiative)
are all container file formats to solve the "small files problem"= and all
support compression and splitting.
Additionally, each file format was developed with specific features/benefit= s
in mind.

Imagine I have the following text source data
- 1 TB of XML documents (some millions of small files)
- 1 TB of JSON documents (some hundred thousands of medium sized files)
- 1 TB of Apache log files (some thousands of bigger files)

How should I store this data in HDFS to process it using Java MapReduce and=
Pig and Hive?
I want to use the best tool for my specific problem - with "best"=
performance of course - i.e. maybe one problem on the apache log data can b= e
best solved using Java MapReduce, another one using Hive or Pig.

Should I simply put the data into HDFS as the data comes from - i.e. as
plain text files?
Or should I convert all my data to a container file format like sequence files, AVRO, RC or ORC?

Based on this example, I believe
- the XML documents will be need to be converted to a container file format=
to overcome the "small files problem".
- the JSON documents could/should not be affected by the "small files<= br> problem"
- the Apache files should definitely not be affected by the "small fil= es
problem", so they could be stored as plain text files.

So, some source data needs to be converted to a container file format,
others not necessarily.
But what is really advisable?

Is it advisable to store all data (XML, JSON, Apache logs) in one specific<= br> container file format in the cluster- let's say you decide to use seque= nce
files?
Having only one file format in HDFS is of course a benefit in terms of
managing the files and writing Java MapReduce/Pig/Hive code against it.
Sequence files in this case is certainly not a bad idea, but Hive queries could probably better benefit from let's say RC/ORC.

Therefore, is it better to use a mix of plain text files and/or one or more=
container file formats simultaneously?

I know that there will be no crystal-clear answer here as it always
"depends", but what approach should be taken here, or what is usu= ally used
in the community out there?

I welcome any feedback and experiences you made.

Thanks



--001a113346001ebb6a04e79d54d3--