hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12620) Advanced Hadoop Architecture (AHA) - Common
Date Mon, 07 Dec 2015 21:30:11 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15045772#comment-15045772

Allen Wittenauer commented on HADOOP-12620:

In case it isn't obvious yet: *EVERY* update to *ANY* field which generates email sends out
the entire description.  This includes any and all comments made to the JIRA....

> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>                 Key: HADOOP-12620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12620
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Dinesh S. Atreya
> h1. Advance Hadoop Architecture (AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses with just
minimal enhancements to Hadoop to transition Hadoop to Advanced/Cloud Data Architecture. 
> HDFS has traditionally had a write-once-read-many access model for files until  “[Append
to files in HDFS | https://issues.apache.org/jira/browse/HADOOP-1700 ]”  capability was
introduced. The next minimal enhancements to core Hadoop include capability to do “updates-in-place”
in HDFS. 
> •	Support seeks for writes (in addition to reads).
> •	After seek, if the new byte length is the same as the old byte length, in place update
is allowed.
> •	Delete is an update with appropriate Delete marker
> •	If byte length is different, old entry is marked as delete with new one appended
as before. 
> •	It is the client’s discretion to perform either update, append or both and the
API changes in different Hadoop components should provide these capabilities.
> Please note that this JIRA is limited to essentially a specific type of updates, in-place
updates that do not change the byte length (e.g., buffer spaces are included in the length).
 Updates that change the byte length are not-supported in-place and are considered as Appends/Inserts.
Similarly Deletes that create holes are not supported. The reason is simple, fragmentations
and holes cause performance penalties and make the process complicated and may involve a lot
of changes to Hadoop and are out-of-scope.
> These minimal changes will enable laying the basis for transforming the core Hadoop to
an interactive and real-time platform and introducing significant native capabilities to Hadoop.
These enhancements will lay a foundation for all of the following processing styles to be
supported natively and dynamically. 
> •	Real time 
> •	Mini-batch  
> •	Stream based data processing
> •	Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the type of data
and volume of data sets and enhance/replace prevailing approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O resources 
with increasing efficiency. The Hadoop task engines can use vectorized/pipelined processing
and greater use of memory throughout the Hadoop platform. 
> These will enable enhanced performance optimizations to be implemented in HDFS and made
available to all the Hadoop components. This will enable Fast processing of Big Data and enhance
all the characteristics volume, velocity and variety of big data.
> There are many influences for this umbrella JIRA:
> •	Preserve and Accelerate Hadoop
> •	Efficient Data Management of variety of Data Formats natively in Hadoop
> •	Enterprise Expansion 
> •	Internet and Media 
> •	Databases offer native support for a variety of Data Formats such as JSON, XML Indexes,
and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address portions of
this. This JIRA captures a variety of use-cases in one place.  Some Data Management /Platform
initial use-cases are given hereunder.
> h2. WEB
> With the AHA (Advance Hadoop Architecture) enhancements, a variety of Web standards can
be natively supported  such as updateable JSON [http://json.org/], XML, RDF and other documents.
> While Hadoop origination can be traced to the WEB, some of the [Web standards | http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html]
are not completely supported natively in Hadoop such as HTTP PUT and PATCH (PUT and POST are
only partially supported in terms of creation). With the proposed enhancement all of the standards
POST, PUT and PATCH (new addition to Web standards) can be natively completely supported (in
addition to GET) through Hadoop. 
> Hypertext Transfer Protocol -- HTTP/1.1 ([Original RFC | http://tools.ietf.org/html/rfc2616],
[Current RFC | http://tools.ietf.org/html/rfc7231] ) 
> Current RFCS:
> •	[ Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing | http://tools.ietf.org/html/rfc7230]
> •	[ Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content | http://tools.ietf.org/html/rfc7231
> •	[ Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests  | http://tools.ietf.org/html/rfc7232
> •	[ Hypertext Transfer Protocol (HTTP/1.1): Range Requests | http://tools.ietf.org/html/rfc7233
> •	[ Hypertext Transfer Protocol (HTTP/1.1): Caching | http://tools.ietf.org/html/rfc7234
> •	[ Hypertext Transfer Protocol (HTTP/1.1): Authentication | http://tools.ietf.org/html/rfc7235
> RFC ([PATCH Method for HTTP | http://tools.ietf.org/html/rfc5789#section-9.1)
>  provides direct support for updates. 
> Roy Fielding himself said that [PATCH was something he created for the initial HTTP/1.1
proposal because partial PUT is never RESTful | https://twitter.com/fielding/status/275471320685367296
]. With HTTP PATCH  you are not transferring a complete representation, but REST does not
require representations to be complete anyway. 
> The method PATCH is not idempotent. With the proposed enhancement, we can now formalize
the behavior and provide feedback to the Web standard RFC.
> •	If the update can be carried out in-place, it is idempotent.
> •	If the update causes new data (first entry marked as delete along with corresponding
insert/append), then it is not idempotent.
> h3. JSON
> Some RFCs for JSON are given hereunder.
> •	[JavaScript Object Notation (JSON) Patch | http://tools.ietf.org/html/rfc6902 ]
> •	[JSON Merge Patch | https://tools.ietf.org/html/rfc7386 ]
> h3. RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
> The simplest triple statement is a sequence of (subject, predicate, object) terms, separated
by whitespace and terminated by '.' after each triple.
> h2. Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and Resources can also
be managed using the Hadoop . Some examples of such usage can include App Data and Resources
for Apple and other App stores.
> About Apps Resources: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html

> On-Demand Resources Essentials: https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/

> Resource Programming Guide: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf

> h2. Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts, Hadoop will
have proper and natural support for ETL and Analytics.
> h2. Key-Value Store
> With the proposed enhancements, it will become very convenient to implement Key-Value
Store natively in Hadoop.
> h2. MVCC (Multi Version Concurrency Control)
> Modified example of how MVCC can be implemented with the proposed enhancements from PostgreSQL
MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC   
> http://momjian.us/main/writings/pgsql/mvcc.pdf    
> || Data ID
> 	|| Activity
> 	|| Data Create Counter	|| Data Expiry Counter	|| Comments
> | 1
> 	| Insert	| 40	| MAX_VAL	| Conventionally MAX_VAL is null.
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
> | 1	| Delete	| 40	| 47	| Marked as delete when current counter was 47.
> | 2	| Update (old Delete)	| 64	| 78	| Mark old data is DELETE
> | 2	| Update (new insert)	| 78	| MAX_VAL	| Insert new data.
> h2. Graph Stores
> Enable native storage and processing for a variety of graph stores. 
> h3. Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency 
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.  
> h3. Graph Store 2 (Facebook Social Graph - TAO)
> Object:  (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> TAO: Facebook’s Distributed Data Store for the Social Graph 
> https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson 
> https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf

> TAO: The power of the graph
> https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-thegraph/10151525983993920

> h2. Temporal Data 
> https://en.wikipedia.org/wiki/Temporal_database 
> https://en.wikipedia.org/wiki/Valid_time 
> In temporal data, data may get updated to reflect changes in data.
> For example data change from 
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> h2. Media
> Media production typically involves a lot of changes and updates prior to release. The
enhancements will lay a basis for the full lifecycle to be managed in Hadoop ecosystem. 
> h2. Indexes
> With the changes, a variety of updatable indexes can be supported natively in Hadoop.
Search software such as Solr, ElasticSearch etc. can then in turn leverage Hadoop’s enhanced
native capabilities. 
> h2. Google References
> While Google’s research in this area is interesting (and some extracts are listed hereunder),
the evolution of Hadoop is quite interesting. Proposed enhancements to support in-place-update
to the core Hadoop will enable and make it easier for a variety of enhancements for each of
the Hadoop components and has a variety of influences as has been indicated in this JIRA.
> We propose a basis for allowing a system for incrementally processing updates to large
data sets and reduce the overhead of always having to do large batches. Hadoop engines can
dynamically choose processing style to use based on the type of data and volume of data sets
and enhance/replace prevailing approaches.
> || Year	|| Title	|| Links
> | 2015	| Announcing Google Cloud Bigtable: The same database that powers Google Search,
Gmail and Analytics is now available on Google Cloud Platform 
> | http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/ 
> | 2014	| Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing	| http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf

> | 2013	| F1: A Distributed SQL Database That Scales	| http://research.google.com/pubs/pub41344.html

> | 2013	| Online, Asynchronous Schema Change in F1	| http://research.google.com/pubs/pub41376.html

> | 2013	| Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams	| http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf

> | 2012	| F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business	|
> | 2012	| Spanner: Google's Globally-Distributed Database	| http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf

> | 2012	| Clydesdale: structured data processing on MapReduce	| http://dl.acm.org/citation.cfm?doid=2247596.2247600

> | 2011	| Megastore: Providing Scalable, Highly Available Storage for Interactive Services
| http://research.google.com/pubs/pub36971.html 
> | 2011	| Tenzing A SQL Implementation On The MapReduce Framework	| http://research.google.com/pubs/pub37200.html

> | 2010	| Dremel: Interactive Analysis of Web-Scale Datasets	| http://research.google.com/pubs/pub36632.html

> | 2010	| FlumeJava: Easy, Efficient Data-Parallel Pipelines	| http://research.google.com/pubs/pub35650.html

> | 2010	| Percolator: Large-scale Incremental Processing Using Distributed Transactions
and Notifications	| http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 
> h2.Application Domains
> The enhancements will lay a path for comprehensive support of all application domains
in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing  
> Supply Chain Planning
> Web Sites 
> Mobile App Stores
> Financials 
> Media 
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM 
> Corresponding umbrella JIRAs can be found for each of the following Hadoop platform components.

This message was sent by Atlassian JIRA

View raw message