hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/ShellPlans" by stack
Date Thu, 12 Jul 2007 19:12:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by stack:

The comment on the change is:
Page two of hbase shell split (after chatting with Edward Yoon)

New page:
* Work in progress


= Introduction =
A basic version of an [wiki:Hbase/HbaseShell HBase Shell] was added to HBase in July, 2007.
 This page discusses future HBase Shell features and directions.

= Hbase Shell Goals =
 * A Simplified Import/Export/Migrate Functionality Between different data sources (Hadoop,
 * A Simplified processing of a logical data model
 * A Simplified algebraic operations
 * A Simplified Parallel Numerical Analysis by abstracting/numericalizing points, lines, [[BR]]or
plane data across multiple maps in HBase.

== HBase Shell Background ==

I expect Hadoop + Hbase to handle sparsity and data explosion very well in near future. Moreover,
i believe the design of the multi-dimensional structure and the 3-dim space model of the data
are optimized for rapid ad-hoc information retrieval in any orientation, as well as for fast,
flexible calculation and transformation of raw data based on formulaic relationships.

Then, I thought it would require a more user-friendly interface to enable querying the data

=== Rationale ===

It will probably take a while for Hadoop + HBase to provide reliable real-time service like
other DBMS. Thus, I decided to develop a shell to process linear algebraic computing and large
scale data using Hadoop's parallel processing and HBase storage.

''Then you may ask "What is a difference from MapReduce using MapFiles?"''

I don't expect it to give us a high-performance just yet,
but it will sure make data management and development much easier.
First, let's take a look at HBase's data model.

HBase provides a unified data model and it represents a data in 3-dimensional
- Row, Column, and TImestamp. Also, Row and Column may be extended infinitely.

If we decide to cut the data model in time version, then we may view the new data as a 2D
If index is in string, we may view it as a huge map. If index is in integer, then it is one
huge 2D array.

So each table may have such data storages in 3D (ColumnFamilies)
Locality Group(Columnfamilies) is a relationship that can occur between multiple references
whenever one reference brings in much of the data used by the other references.

  ''-- I hope physical files on networks are grouped together with locality grouping.[[BR]]by

== People Involved ==

 * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] (NHN corp.)
 * [:boyo:Sewon Kim] [[MailTo(ebow31 AT SPAMFREE gmail.com)]] (Empas, Inc.)
 * [:mskim:Minsu Kim] [[MailTo(minsu.kim AT SPAMFREE gmail.com)]] (Daum, Inc.)

= Suggested Future Hbase Shell Operators =
'''Note''' that Data should be located by their row, column, and timestamp.

== Commands ==
||<bgcolor="#ececec">'''Command''' ||<bgcolor="#ececec">'''Explanation''' ||
||Substitute || '''Substitute''' expression to [A~Z][[BR]][[BR]]~-''X = Matrix(table_name,
||Store ||'''STORE''' command will store results to specified table. [[BR]][[BR]]~-''A = Table('movieLog_table');
[[BR]]B = A.Selection('length' > 100); [[BR]]STORE B TO X run_style;''-~ ||
||Set ||'''SET''' command will change the values. [[BR]][[BR]]~-''SET table_name[[BR]] VALUES('columnfamily_name:column_key','entry')[[BR]]WHERE
row='row_key' AND time='Specified_Timestamp';''-~ ||
== Relational Operators ==

||<bgcolor="#ececec">'''Operator''' ||<bgcolor="#ececec">'''Explanation''' ||
||Projection ||<99%>'''Projection''' of a relation ~+R+~, It makes a new relation as
the set that is obtained when all tuples(rows) in ~+R+~ are restricted to the set {columnfamily,,1,,,...,columnfamily,,n,,}.[[BR]][[BR]]~-''A
= Table('movieLog_table');[[BR]]B = A.Projection('year','length');''-~||
||Selection ||<99%>'''Selection''' of a relation ~+R+~, It makes a new relation as the
set of specified tuples(rows) of the relation ~+R+~[[BR]]'''Set Operations''' : ~-''OR, AND,
NOT''-~[[BR]][[BR]]~-''A = Table('movieLog_table');[[BR]]B = A.Selection('length' > 100);[[BR]]C
= A.Selection('length' > 100 AND 'year' > 1979);''-~||
||Product ||<99%>'''Product''' of relations R and S, It makes a new relation as the
set of all possible combinations of tuples of the two operation relations.[[BR]]'''NOTE'''
that this is the most computationally expensive operator in the relational algebra.||
||Rename ||<99%>'''Rename''' r to x, The columnfamily names in the columnfamily-list
replace the columnfamily names of the relation.[[BR]][[BR]]~-''A = Table('movieLog_table');[[BR]]B
= A.Rename('length' = 'movieLength');''-~||
||Group ||<99%>'''Group''' tuples by value of an attribute and apply aggregate function
independently to each group of tuples.[[BR]]'''Aggregate Functions''' : ~-''AVG( attribute
), SUM( attribute ), COUNT( attribute ), MIN( attribute ), MAX( attribute )''-~[[BR]][[BR]]~-''A
= Table('movieLog_table);[[BR]]B = A.Group('studioName', MIN('year'));''-~||
||Sort ||<99%>'''Sort''' of tuples(rows) of R, ordered according to columnfamilies on
columnfamily-list[[BR]][[BR]]~-''A = Table('movieLog_table');[[BR]]B = A.Sort('length', 'vote');''-~||

== Matrix Operators ==

* matrix operator

||<bgcolor="#ececec">'''Operator''' ||<bgcolor="#ececec">'''Explanation''' ||
||Addition ||<99%>... ||
||subtraction ||<99%>... ||
||multiplication ||<99%>... ||
||division ||<99%>... ||
||transpose ||<99%>interchanging rows and columns ||
||permutation ||<99%>... ||
||norms ||<99%>... ||

* decompositions

||<bgcolor="#ececec">'''Operator''' ||<bgcolor="#ececec">'''Explanation''' ||
||LU ||<99%>... ||
||QR ||<99%>... ||
||Cholesky ||<99%>... ||
||SVD ||<99%>... ||
||Inverse ||<99%>interchanging rows and columns ||
||Pseudoinverse ||<99%>... ||
||Condition ||<99%>... ||
||Determinant ||<99%>... ||
||Rank ||<99%>... ||

View raw message