parquet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r1610824 [10/10] - in /incubator/parquet/site: ./ source/ source/assets/ source/assets/css/ source/assets/fonts/ source/assets/img/ source/assets/js/ source/documentation/ source/documentation/latest/ source/layouts/
Date Tue, 15 Jul 2014 19:30:19 GMT
Added: incubator/parquet/site/source/
--- incubator/parquet/site/source/ (added)
+++ incubator/parquet/site/source/ Tue Jul 15 19:30:18 2014
@@ -0,0 +1,107 @@
+#How To Contribute
+## Pull Requests
+We prefer to receive contributions in the form of GitHub pull requests. Please send pull
requests against the [](
repository. If you've previously forked Parquet from its old location, you will need to add
a remote or update your origin remote to
+Here are a few tips to get your contribution in:
+  1. Break your work into small, single-purpose patches if possible. It’s much harder
to merge in a large change with a lot of disjoint features.
+  2. Create a JIRA for your patch on the [Parquet Project JIRA](
+  3. Submit the patch as a GitHub pull request against the master branch. For a tutorial,
see the GitHub guides on forking a repo and sending a pull request. Prefix your pull request
name with the JIRA name (ex:
+  4. Make sure that your code passes the unit tests. You can run the tests with `mvn test`
in the root directory. 
+  5. Add new unit tests for your code. 
+If you’d like to report a bug but don’t have time to fix it, you can still post
it to our [issue tracker](, or email the mailing
list (
+## Committers
+Merging a pull request requires being a comitter on the project.
+How to merge a Pull request (have an apache and github-apache remote setup):
+	git remote add github-apache
+	git remote add apache
+run the following command
+	dev/
+example output:
+	Which pull request would you like to merge? (e.g. 34):
+Type the pull request number (from
and hit enter.
+	=== Pull Request #X ===
+	title	Blah Blah Blah
+	source	repo/branch
+	target	master
+	url
+	Proceed with merging pull request #3? (y/n): 
+If this looks good, type y and hit enter.
+	From
+	* [new branch]      master     -> PR_TOOL_MERGE_PR_3_MASTER
+	Switched to branch 'PR_TOOL_MERGE_PR_3_MASTER'
+	Merge complete (local ref PR_TOOL_MERGE_PR_3_MASTER). Push to apache? (y/n):
+A local branch with the merge has been created. type y and hit enter to push it to apache
+	Counting objects: 67, done.
+	Delta compression using up to 4 threads.
+	Compressing objects: 100% (26/26), done.
+	Writing objects: 100% (36/36), 5.32 KiB, done.
+	Total 36 (delta 17), reused 0 (delta 0)
+	To
+	   b767ac4..485658a  PR_TOOL_MERGE_PR_X_MASTER -> master
+	Restoring head pointer to b767ac4e
+	Note: checking out 'b767ac4e'.
+	You are in 'detached HEAD' state. You can look around, make experimental
+	changes and commit them, and you can discard any commits you make in this
+	state without impacting any branches by performing another checkout.
+	If you want to create a new branch to retain commits you create, you may
+	do so (now or later) by using -b with the checkout command again. Example:
+	  git checkout -b new_branch_name
+	HEAD is now at b767ac4... Update
+	Deleting local branch PR_TOOL_MERGE_PR_X
+	Deleting local branch PR_TOOL_MERGE_PR_X_MASTER
+	Pull request #X merged!
+	Merge hash: 485658a5
+	Would you like to pick 485658a5 into another branch? (y/n):
+For now just say n as we have 1 branch
+## Website
+We use middleman to generate the website content from markdown and other 
+dynamic templates. The following steps assume you have a working 
+ruby environment setup
+	gem install bundler
+	bundle install
+### Generating the website
+To generate the static wesbite for Apache Parquet run the following commands
+	bundle exec middleman build
+### Live Development 
+Live development of the site enables automatic reload when changes are saved. 
+To enable run the following command and then open a browser and navigate to 
+	bundle exec middleman 
+### Publishing the Site
+The website uses svnpubsub. The publish folder contains the websites content
+and when committed to the svn repository it will be automatically deployed to 
+the live site. 

Added: incubator/parquet/site/source/
--- incubator/parquet/site/source/ (added)
+++ incubator/parquet/site/source/ Tue Jul 15 19:30:18 2014
@@ -0,0 +1,13 @@
+# Developers
+<div class="row-fluid">
+  <h4 name="reportbugs">Report or track a bug</h4>
+  <p>New bugs can be reported on our <a href="">Jira
issue tracker</a>. In order to create a new issue, you'll need to signup for an account.</p>
+  <h4 name="contribute">Contribute a core patch</h4>
+  <p>Follow our <a href="/docs/howtocontribute/">contribution guidelines</a>
when submitting a patch.</p>
+  <h4 name="ircchannel">IRC Channel</h4>
+	<p>Many of the Aurora developers and users chat in the #aurora channel on</p>
+	<p>If you are new to IRC, you can use a <a href="">web-based

Added: incubator/parquet/site/source/documentation/
--- incubator/parquet/site/source/documentation/ (added)
+++ incubator/parquet/site/source/documentation/ Tue Jul 15 19:30:18 2014
@@ -0,0 +1,204 @@
+## Motivation
+We created Parquet to make the advantages of compressed, efficient columnar data representation
available to any project in the Hadoop ecosystem.
+Parquet is built from the ground up with complex nested data structures in mind, and uses
the [record shredding and assembly algorithm](
described in the Dremel paper. We believe this approach is superior to simple flattening of
nested name spaces.
+Parquet is built to support very efficient compression and encoding schemes. Multiple projects
have demonstrated the performance impact of applying the right compression and encoding scheme
to the data. Parquet allows compression schemes to be specified on a per-column level, and
is future-proofed to allow adding more encodings as they are invented and implemented.
+Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing
frameworks, and we are not interested in playing favorites. We believe that an efficient,
well-implemented columnar storage substrate should be useful to all frameworks without the
cost of extensive and difficult to set up dependencies.
+## Modules
+The `parquet-format` project contains format specifications and Thrift definitions of metadata
required to properly read Parquet files.
+The `parquet-mr` project contains multiple sub-modules, which implement the core components
of reading and writing a nested, column-oriented data stream, map this core onto the parquet
format, and provide Hadoop Input/Output Formats, Pig loaders, and other java-based utilities
for interacting with Parquet.
+The `parquet-compatibility` project contains compatibility tests that can be used to verify
that implementations in different languages can read and write each other's files.
+## Building
+Java resources can be build using `mvn package.` The current stable version should always
be available from Maven Central.
+C++ thrift resources can be generated via make.
+Thrift can be also code-genned into any other thrift-supported language.
+## Glossary
+  - Block (hdfs block): This means a block in hdfs and the meaning is 
+    unchanged for describing this file format.  The file format is 
+    designed to work well on top of hdfs.
+  - File: A hdfs file that must include the metadata for the file.
+    It does not need to actually contain the data.
+  - Row group: A logical horizontal partitioning of the data into rows.
+    There is no physical structure that is guaranteed for a row group.
+    A row group consists of a column chunk for each column in the dataset.
+  - Column chunk: A chunk of the data for a particular column.  These live
+    in a particular row group and is guaranteed to be contiguous in the file.
+  - Page: Column chunks are divided up into pages.  A page is conceptually
+    an indivisible unit (in terms of compression and encoding).  There can
+    be multiple page types which is interleaved in a column chunk.
+Hierarchically, a file consists of one or more row groups.  A row group
+contains exactly one column chunk per column.  Column chunks contain one or
+more pages. 
+## Unit of parallelization
+  - MapReduce - File/Row Group
+  - IO - Column chunk
+  - Encoding/Compression - Page
+## File format
+This file and the thrift definition should be read together to understand the format.
+    4-byte magic number "PAR1"
+    <Column 1 Chunk 1 + Column Metadata>
+    <Column 2 Chunk 1 + Column Metadata>
+    ...
+    <Column N Chunk 1 + Column Metadata>
+    <Column 1 Chunk 2 + Column Metadata>
+    <Column 2 Chunk 2 + Column Metadata>
+    ...
+    <Column N Chunk 2 + Column Metadata>
+    ...
+    <Column 1 Chunk M + Column Metadata>
+    <Column 2 Chunk M + Column Metadata>
+    ...
+    <Column N Chunk M + Column Metadata>
+    File Metadata
+    4-byte length in bytes of file metadata
+    4-byte magic number "PAR1"
+In the above example, there are N columns in this table, split into M row 
+groups.  The file metadata contains the locations of all the column metadata 
+start locations.  More details on what is contained in the metadata can be found 
+in the thrift files.
+Metadata is written after the data to allow for single pass writing.
+Readers are expected to first read the file metadata to find all the column 
+chunks they are interested in.  The columns chunks should then be read sequentially.
+ ![File Layout](
+## Metadata
+There are three types of metadata: file metadata, column (chunk) metadata and page
+header metadata.  All thrift structures are serialized using the TCompactProtocol.
+ ![Metadata diagram](
+## Types
+The types supported by the file format are intended to be as minimal as possible,
+with a focus on how the types effect on disk storage.  For example, 16-bit ints
+are not explicitly supported in the storage format since they are covered by
+32-bit ints with an efficient encoding.  This reduces the complexity of implementing
+readers and writers for the format.  The types are:
+  - BOOLEAN: 1 bit boolean
+  - INT32: 32 bit signed ints
+  - INT64: 64 bit signed ints
+  - INT96: 96 bit signed ints
+  - FLOAT: IEEE 32-bit floating point values
+  - DOUBLE: IEEE 64-bit floating point values
+  - BYTE_ARRAY: arbitrarily long byte arrays.
+### Logical Types
+Logical types are used to extend the types that parquet can be used to store,
+by specifying how the primitive types should be interpreted. This keeps the set
+of primitive types to a minimum and reuses parquet's efficient encodings. For
+example, strings are stored as byte arrays (binary) with a UTF8 annotation.
+These annotations define how to further decode and interpret the data.
+Annotations are stored as a `ConvertedType` in the file metadata and are
+documented in
+## Nested Encoding
+To encode nested columns, Parquet uses the Dremel encoding with definition and 
+repetition levels.  Definition levels specify how many optional fields in the 
+path for the column are defined.  Repetition levels specify at what repeated field
+in the path has the value repeated.  The max definition and repetition levels can
+be computed from the schema (i.e. how much nesting there is).  This defines the
+maximum number of bits required to store the levels (levels are defined for all
+values in the column).  
+Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it
supersedes BIT_PACKED.
+## Nulls
+Nullity is encoded in the definition levels (which is run-length encoded).  NULL values 
+are not encoded in the data.  For example, in a non-nested schema, a column with 1000 NULLs

+would be encoded with run-length encoding (0, 1000 times) for the definition levels and
+nothing else.  
+## Data Pages
+For data pages, the 3 pieces of information are encoded back to back, after the page
+header.  We have the 
+ - definition levels data,  
+ - repetition levels data, 
+ - encoded values.
+The size of specified in the header is for all 3 pieces combined.
+The data for the data page is always required.  The definition and repetition levels
+are optional, based on the schema definition.  If the column is not nested (i.e.
+the path to the column has length 1), we do not encode the repetition levels (it would
+always have the value 1).  For data that is required, the definition levels are
+skipped (if encoded, it will always have the value of the max definition level). 
+For example, in the case where the column is non-nested and required, the data in the
+page is only the encoded values.
+The supported encodings are described in [](
+## Column chunks
+Column chunks are composed of pages written back to back.  The pages share a common 
+header and readers can skip over page they are not interested in.  The data for the 
+page follows the header and can be compressed and/or encoded.  The compression and 
+encoding is specified in the page metadata.
+## Checksumming
+Data pages can be individually checksummed.  This allows disabling of checksums at the 
+HDFS file level, to better support single row lookups.
+## Error recovery
+If the file metadata is corrupt, the file is lost.  If the column metdata is corrupt, 
+that column chunk is lost (but column chunks for this column in other row groups are 
+okay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If 
+the data within a page is corrupt, that page is lost.  The file will be more 
+resilient to corruption with smaller row groups.
+Potential extension: With smaller row groups, the biggest issue is placing the file 
+metadata at the end.  If an error happens while writing the file metadata, all the 
+data written will be unreadable.  This can be fixed by writing the file metadata 
+every Nth row group.  
+Each file metadata would be cumulative and include all the row groups written so 
+far.  Combining this with the strategy used for rc or avro files using sync markers, 
+a reader could recover partially written files.  
+## Separating metadata and column data.
+The format is explicitly designed to separate the metadata from the data.  This
+allows splitting columns into multiple files, as well as having a single metadata
+file reference multiple parquet files.  
+## Configurations
+- Row group size: Larger row groups allow for larger column chunks which makes it 
+possible to do larger sequential IO.  Larger groups also require more buffering in 
+the write path (or a two pass write).  We recommend large row groups (512MB - 1GB). 
+Since an entire row group might need to be read, we want it to completely fit on 
+one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An 
+optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block 
+per HDFS file.
+- Data page size: Data pages should be considered indivisible so smaller data pages 
+allow for more fine grained reading (e.g. single row lookup).  Larger page sizes 
+incur less space overhead (less page headers) and potentially less parsing overhead 
+(processing headers).  Note: for sequential scans, it is not expected to read a page 
+at a time; this is not the IO chunk.  We recommend 8KB for page sizes.
+## Extensibility
+There are many places in the format for compatible extensions:
+- File Version: The file metadata contains a version.
+- Encodings: Encodings are specified by enum and more can be added in the future.  
+- Page types: Additional page types can be added and safely skipped.

Added: incubator/parquet/site/source/
--- incubator/parquet/site/source/ (added)
+++ incubator/parquet/site/source/ Tue Jul 15 19:30:18 2014
@@ -0,0 +1,23 @@
+# Downloads
+The Parquet team recently moved to the Apache Software Foundation and is working to publish
its first release there.
+### Downloading from the Maven central repository
+The Parquet team publishes its [releases to Maven Central](
+Add the following dependency section to your pom.xml:
+	<dependencies>
+	...
+	   <dependency>
+          <groupId>com.twitter</groupId>
+          <artifactId>parquet</artifactId>
+          <version>2.1.0</version> <!-- or latest version -->
+       </dependency>
+    ...
+    </dependencies>
+### Older Releases
+Older releases can be found on GitHub: 

Added: incubator/parquet/site/source/
--- incubator/parquet/site/source/ (added)
+++ incubator/parquet/site/source/ Tue Jul 15 19:30:18 2014
@@ -0,0 +1,25 @@
+<!-- masthead -->
+<div class="masthead">
+	<div class="jumbotron">
+	  <div>
+	    <p class="lead">Apache Parquet is a <a href="">columnar
storage</a> format available to any project in the Hadoop ecosystem, regardless of the
choice of data processing framework, data model or programming language.</p>
+	  </div>
+	</div>
+</div><!-- /masthead -->
+<!-- overviewsection -->
+<div class="row">
+            <div class="col-lg-6">
+           	 <h3>Parquet Videos (more <a href="/presentations">presentations</a>)</h3>
+            <iframe width="460" height="315" src="//"
frameborder="0" allowfullscreen></iframe>
+	    </div>
+            <div class="col-lg-6">
+                <h3>News</h3>
+		<p>
+                <a class="twitter-timeline" href="" data-widget-id="487276069435633664">Tweets
by @ApacheParquet</a>
+<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+"://";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
+                </p>
+            </div>

Added: incubator/parquet/site/source/layouts/layout.erb
--- incubator/parquet/site/source/layouts/layout.erb (added)
+++ incubator/parquet/site/source/layouts/layout.erb Tue Jul 15 19:30:18 2014
@@ -0,0 +1,38 @@
+    <head>
+        <meta charset="utf-8">
+        <title>Apache Parquet</title>
+		    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+		    <meta name="description" content="">
+		    <meta name="author" content="">
+		    <link href="/assets/css/bootstrap.css" rel="stylesheet">
+		    <link href="/assets/css/bootstrap-theme.css" rel="stylesheet">
+                    <link href="/assets/css/font-awesome.css" rel="stylesheet">
+		    <!-- JS -->
+		    <script type="text/javascript" src="/assets/js/jquery-2.1.1.min.js"></script>
+		    <script type="text/javascript" src="/assets/js/bootstrap.js"></script>
+				<!-- Analytics -->
+				<script type="text/javascript">
+					  var _gaq = _gaq || [];
+					  _gaq.push(['_setAccount', 'UA-45879646-1']);
+					  _gaq.push(['_setDomainName', '']);
+					  _gaq.push(['_trackPageview']);
+					  (function() {
+					    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async
= true;
+					    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www')
+ '';
+					    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga,
+					  })();
+				</script>
+	</head>
+    <body>	
+      <%= partial "header" %>	
+      <div class="container">
+        <%= yield %>
+	  </div>
+      <%= partial "footer" %>
+	</body>

Added: incubator/parquet/site/source/
--- incubator/parquet/site/source/ (added)
+++ incubator/parquet/site/source/ Tue Jul 15 19:30:18 2014
@@ -0,0 +1,21 @@
+# Presentations
+## Videos
+### Hadoop Summit 2014: Efficient Data Storage for Analytics with Parquet 2.0
+<iframe width="560" height="315" src="//"
frameborder="0" allowfullscreen></iframe>
+### #CONF 2014: Parquet Format at Twitter
+<iframe width="560" height="315" src="//" frameborder="0"
+## Slides
+### Hadoop Summit 2014: Efficient Data Storage for Analytics with Parquet 2.0
+<iframe src="//" width="512" height="421"
frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;
border-width:1px 1px 0; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe>
<div style="margin-bottom:5px"> <strong> <a href=""
title="Efficient Data Storage for Analytics with Apache Parquet 2.0" target="_blank">Efficient
Data Storage for Analytics with Apache Parquet 2.0</a> </strong> from <strong><a
href="" target="_blank">Cloudera, Inc.</a></strong>
+### Strata 2013 : Parquet: Columnar storage for the people
+<iframe src="//" width="512" height="421"
frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;
border-width:1px 1px 0; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe>
<div style="margin-bottom:5px"> <strong> <a href=""
title="Parquet Strata/Hadoop World, New York 2013" target="_blank">Parquet Strata/Hadoop
World, New York 2013</a> </strong> from <strong><a href=""
target="_blank">Julien Le Dem</a></strong> </div>

View raw message