hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: File formats in Hadoop: Sequence files vs AVRO vs RC vs ORC
Date Mon, 30 Sep 2013 17:45:46 GMT
Sequence files are language neutral as Avro. Yes , but not sure about the
support of other language lib for processing seq files.


On Mon, Sep 30, 2013 at 11:10 PM, Peyman Mohajerian <mohajeri@gmail.com>wrote:

> It is not recommended to keep the data at rest in sequences format,
> because it is Java specific and you cannot share it with other none-java
> systems easily, it is ideal for running map/reduce jobs. On approach would
> be to bring all the data of different formats in HDFS as is and then
> convert them to a single format that works best for you depending on
> whether you will export this data out or not (in addition to many other
> considerations). But as already mentioned Hive can directly read any of
> these formats.
> On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <rajkrrsingh@gmail.com>wrote:
>> for xml files processing hadoop comes with a class for this purpose
>> called StreamXmlRecordReader,You can use it by setting your input format
>> to StreamInputFormat and setting the
>> stream.recordreader.class property to
>> org.apache.hadoop.streaming.StreamXmlRecordReader.
>> for Json files, an open-source project ElephantBird that contains some
>> useful utilities for working with LZO compression, has a
>> LzoJsonInputFormat, which can read JSON, but it requires that the input
>> file be LZOP compressed. We’ll use this code as a template for our own JSON
>> InputFormat, which doesn’t have the LZOP compression requirement.
>> if you are dealing with small files then sequence file format comes in
>> rescue, it stores sequences of binary key-value pairs. Sequence files
>> are well suited as a format for MapReduce data since they are
>> splittable,support compression.
>> ::::::::::::::::::::::::::::::::::::::::
>> Raj K Singh
>> http://in.linkedin.com/in/rajkrrsingh
>> http://www.rajkrrsingh.blogspot.com
>> Mobile  Tel: +91 (0)9899821370
>> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba <
>> wolfgang.wyremba@hotmail.com> wrote:
>>> Hello,
>>> the file format topic is still confusing me and I would appreciate if you
>>> could share your thoughts and experience with me.
>>> From reading different books/articles/websites I understand that
>>> - Sequence files (used frequently but not only for binary data),
>>> - AVRO,
>>> - RC (was developed to work best with Hive -columnar storage) and
>>> - ORC (a successor of RC to give Hive another performance boost - Stinger
>>> initiative)
>>> are all container file formats to solve the "small files problem" and all
>>> support compression and splitting.
>>> Additionally, each file format was developed with specific
>>> features/benefits
>>> in mind.
>>> Imagine I have the following text source data
>>> - 1 TB of XML documents (some millions of small files)
>>> - 1 TB of JSON documents (some hundred thousands of medium sized files)
>>> - 1 TB of Apache log files (some thousands of bigger files)
>>> How should I store this data in HDFS to process it using Java MapReduce
>>> and
>>> Pig and Hive?
>>> I want to use the best tool for my specific problem - with "best"
>>> performance of course - i.e. maybe one problem on the apache log data
>>> can be
>>> best solved using Java MapReduce, another one using Hive or Pig.
>>> Should I simply put the data into HDFS as the data comes from - i.e. as
>>> plain text files?
>>> Or should I convert all my data to a container file format like sequence
>>> files, AVRO, RC or ORC?
>>> Based on this example, I believe
>>> - the XML documents will be need to be converted to a container file
>>> format
>>> to overcome the "small files problem".
>>> - the JSON documents could/should not be affected by the "small files
>>> problem"
>>> - the Apache files should definitely not be affected by the "small files
>>> problem", so they could be stored as plain text files.
>>> So, some source data needs to be converted to a container file format,
>>> others not necessarily.
>>> But what is really advisable?
>>> Is it advisable to store all data (XML, JSON, Apache logs) in one
>>> specific
>>> container file format in the cluster- let's say you decide to use
>>> sequence
>>> files?
>>> Having only one file format in HDFS is of course a benefit in terms of
>>> managing the files and writing Java MapReduce/Pig/Hive code against it.
>>> Sequence files in this case is certainly not a bad idea, but Hive queries
>>> could probably better benefit from let's say RC/ORC.
>>> Therefore, is it better to use a mix of plain text files and/or one or
>>> more
>>> container file formats simultaneously?
>>> I know that there will be no crystal-clear answer here as it always
>>> "depends", but what approach should be taken here, or what is usually
>>> used
>>> in the community out there?
>>> I welcome any feedback and experiences you made.
>>> Thanks

View raw message