hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/HbaseShell/Altools" by udanax
Date Wed, 19 Sep 2007 05:31:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell/Altools

The comment on the change is:
The name Altools will no longer be used due to trademark issues

------------------------------------------------------------------------------
+ deleted
- [[TableOfContents(5)]]
- ----
-  ''-- This project is currently in the planning stage.  [https://issues.apache.org/jira/browse/HADOOP-1608
HADOOP-1608] to add "Relational Algrebra Operators" is currently in process.
  
- = Hbase Shell Altools Plan =
- 
- Hbase altools is an Hbase Shell sub 'interpreter' (or 'shell)' program to provide scalable
data processing capabilities like  aggregation, algebraic calculation(groups and sets, commutative
rings, algebraic geometry, and linear algebra) on Hadoop + Hbase based parallel machines.
especially, it will focus on storing and manipulating numeric, sparse matrices on Hbase.
- 
- Altools operations will show or explain how Google search's LSI, Google Earth's algebraic
topology, Google News' recommendation system are related to Bigtable.
- 
- == Initial Contributor ==
-  * [:udanax:Edward Yoon] (R&D center, NHN corp.)
-   -- If you have constructive ideas, Please advise me. [[MailTo(webmaster AT SPAMFREE udanax
DOT org)]]''
- 
- == Background ==
- I expect Hadoop + Hbase to handle sparsity and data explosion very well in near future.
Moreover, i believe the design of the multi-dimensional map structure and the 3d space model
of the data are optimized for rapid ad-hoc information retrieval in any orientation, as well
as for fast, flexible calculation and transformation of raw data based on formulaic relationships.
It is advantageous with respect to Analysis Processing as it allows users to easily formulate
complex queries, and filter or slice data into meaningful subsets, among other things.
- 
- === Parallelization ===
- 
- Altools provides automatic parallelization of the most time-consuming relational/matrix/vector
operations, and will ensure that the iterative solvers are scalable.
- 
-  * Survey List
-   * Parallel Processing of Relational Data
-   * Parallel Algorithms of Multi-Dimensional Matrix Operations
-   * Parallel Gaussian Elimination Algorithm
- 
- ----
- 
- = Suggested Hbase altools Syntax =
- '''Note''' that Data should be located by their row, column, and timestamp.
- 
- == Commands ==
- ||<bgcolor="#E5E5E5">'''Command''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
- ||Table ||<99%>'''Table''' command loads specified table. [[BR]][[BR]]~-''Table('table_name');''-~
||
- ||Matrix ||<99%>'''Matrix''' command constructs the configuration of the logic matrix.[[BR]]'''Options'''
: features not yet. [[BR]][[BR]]~-''Matrix(table_name, columnfamily_name[, option]);''-~ ||
- ||Substitute ||<99%>'''Substitute''' specific algebraic expression to [A~Z][[BR]][[BR]]~-''A
= Projection(column-list);''-~ ||
- ||IF...ELSE ||<99%>'''IF...ELSE''', Imposes conditions on the execution. [[BR]][[BR]]~-''IF
( boolean_expression )[[BR]]{{{    }}}B = command_statements;[[BR]]ELSE[[BR]]{{{    }}}B =
command_statements;''-~||
- ||Store ||<99%>'''Store''' command will store results to specified table or file.
[[BR]][[BR]]~-''A = Table('table_name'); [[BR]]B = A.Selection(condition_expression); [[BR]]Store
B TO table(result_table)[or file('result_file_name')];''-~ ||
- 
- '''Type''' 'help;' for Hbase altools usage.
- 
- {{{
- HBase > altools;
- 
- Hbase altools, 0.0.1 version
- Type 'help;' for Hbase altools usage.
- 
- Hbase.altools > help;
- }}}
- 
- == Relational Operators ==
- ||<bgcolor="#E5E5E5">'''Operator''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
- ||Projection ||<99%>'''Projection''' of a relation ~+R+~, It makes a new relation
as the set that is obtained when all tuples(rows) in ~+R+~ are restricted to the set {columnfamily,,1,,,...,columnfamily,,n,,}.[[BR]][[BR]]~-''A
= Table('table_name');[[BR]]B = A.Projection(column-list);{{{    }}}'''//π,,column-list,,(A)'''
''-~ ||
- ||Selection ||<99%>'''Selection''' of a relation ~+R+~, It makes a new relation as
the set of specified tuples(rows) of the relation ~+R+~.[[BR]]'''Set Operations''' : ~-''OR,
AND, NOT''-~[[BR]][[BR]]~-''A = Table('table_name');[[BR]]B = A.Selection(condition_expression);{{{
   }}}'''//σ,,condition,,(A)''' ''-~ ||
- ||JOINs ||<99%>Table '''JOIN''' operations, linking and extracting data from two different
internal source.[[BR]]'''Operations''' : ~-''naturalJoin(), thetaJoin(), cartesianProduct()
''-~ [[BR]][[BR]]~-''R = Table('table_name1');[[BR]]S = Table('table_name2');[[BR]]C = R.naturalJoin(S);{{{
   }}}'''//C = R▷◁S''' ''-~ ||
- ||Group ||<99%>'''Group''' tuples by value of an attribute and apply aggregate function
independently to each group of tuples.[[BR]]'''Aggregate Functions''' : ~-''AVG(attribute),
SUM(attribute), COUNT(attribute), MIN(attribute), MAX(attribute)''-~[[BR]][[BR]]~-''A = Table('table_name');[[BR]]B
= A.Group(column-list);{{{    }}}'''//γ,,column-list,,(A)''' ''-~ ||
- ||Sort ||<99%>'''Sort''' of tuples(rows) of R, ordered according to columnfamilies
on columnfamily-list.[[BR]][[BR]]~-''A = Table('table_name');[[BR]]B = Sort A by (column-list);{{{
   }}}'''//τ,,column-list,,(A)''' ''-~ ||
- 
- '''(ex. 1)''' Search the subject and the year of the movies which were produced by 'Fox'
company and where running time is more than 100 minutes.
- [[BR]]~-'''''π ,,title.year,, (σ ,,length > 100,, (movieLog_table) ∩ σ ,,studioName
= 'Fox',, (movieLog_table))'''''-~
- 
- {{{
- Hbase.altools > A = Table('movieLog_table'); 
- Hbase.altools > B = A.Selection(length > 100 AND studioName = 'Fox'); 
- Hbase.altools > C = B.Projection('year'); 
- 
- Hbase.altools > store C to table('result_table'); 
- }}}
- 
- '''(ex. 2)''' Theta Join : ▷◁,,C,,
- [[BR]]~-'''''movieStars_table▷◁,,actor < year,,movieLog_table = σ,,actor < year,,(movieStars_table
X movieLog_table)'''''-~
- 
- {{{
- Hbase.altools > A = Table('movieStars_table'); 
- Hbase.altools > B = Table('movieLog_table');
- Hbase.altools > C = A.thetaJoin(B);
- 
- Hbase.altools > store C to table('result_table'); 
- }}}
- 
- '''(ex. 3)''' Find the year of the earliest movie for each actor.
- [[BR]]~-'''''γ ,,starName.MIN(year) → minYear,, (movieStars_table)'''''-~
- 
- {{{
- Hbase.altools > A = Table('movieStars_table');
- Hbase.altools > B = A.Group('starName',MIN('year'));
- 
- Hbase.altools > store B to table('result_table');
- }}}
- 
- == Matrix Operators ==
- '''Note''' that matrix operations are the core of many linear systems.
- === Arithmetic Operators ===
- ||<bgcolor="#E5E5E5">'''Operator''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
- ||Addition ||<99%>'''Adding''' entries with the same indices. [[BR]][[BR]]~-''A =
Matrix('table_name1','columnfamily_name1');[[BR]]B = Matrix('table_name2','columnfamily_name2');[[BR]]C
= A + B;{{{    }}}'''// c,,ij,, = a,,ij,, + b,,ij,, (i : row key, j : column key)''' ''-~
||
- ||Subtraction ||<99%>'''Subtracting''' entries with the same indices.[[BR]][[BR]]~-''A
= Matrix('table_name1','columnfamily_name1');[[BR]]B = Matrix('table_name2','columnfamily_name2');[[BR]]C
= A - B;{{{    }}}'''// c,,ij,, = a,,ij,, - b,,ij,, (i : row key, j : column key)''' ''-~
||
- ||Multiplication ||<99%>'''Multiplication''' of two matrices, Product C of two matrices
A and B.[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B = Matrix('table_name2','columnfamily_name2');[[BR]]C
= A * B;{{{    }}}'''//C = A · B''' ''-~ ||
- ||Division ||<99%>'''Division''' is solving the matrix equation AX = B for X.[[BR]][[BR]]~-''A
= Matrix('table_name1','columnfamily_name1');[[BR]]B = Matrix('table_name2','columnfamily_name2');[[BR]]C
= A /[or \] B;{{{    }}}'''// C = A / B''' ''-~||
- ||Transpose ||<99%>'''Transpose''' of a Matrix, A matrix which is formed by turning
all the rows of a given matrix into columns and vice-versa.[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B
= Transpose(A);{{{    }}}'''// B = A'''' ''-~||
- 
- 
- '''(ex. 1)''' Matrix Addition
- [[BR]]~-'''''C = A + B = (a,,ij,, + b,,ij,,)'''''-~
- 
- {{{
- //Set up the matrix A, B from mapped matrix in hbase.
- 
- Hbase.altools > A = Matrix('m_table','cf_1');
- Hbase.altools > B = Matrix('m_table','cf_2');
- Hbase.altools > C = A + B;
- }}}
- 
- 
- '''(ex. 2)''' The product C of two matrices A and B
- [[BR]]~-'''''C,,ij,, = ΣA,,ik,,B,,kj,, (1 ≤ i ≤ m , 1 ≤ j ≤n)'''''-~
- 
- {{{
- //Set up the matrix A, B from mapped matrix in hbase.
- 
- Hbase.altools > A = Matrix('m_table','cf_1');
- Hbase.altools > B = Matrix('m_table','cf_2');
- Hbase.altools > C = A * B;
- }}}
- 
- === Factorization and Decomposition Operators ===
- 
- ||<bgcolor="#E5E5E5">'''Function''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
- ||LU ||<99%>'''LU Decomposition'''[[BR]]A procedure for decomposing an N by N matrix
A into a product of a lower triangular matrix L and an upper triangular matrix U, LU = A.[[BR]]'''Functions'''
: ~-''getL(), getU(), isSingular(), getPivot()''-~ [[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B
= LUDecomposition(A);[[BR]]C = B.getU();[[BR]]D = B.getL();''-~||
- ||QR ||<99%>'''QR Decomposition'''[[BR]]For an m-by-n matrix A with m >= n, the
QR decomposition is an m-by-n orthogonal matrix Q and an n-by-n upper triangular matrix R
so that A = Q*R.[[BR]]'''Functions''' : ~-''getH(), getQ(), getR()''-~[[BR]][[BR]]~-''A =
Matrix('table_name','columnfamily_name');[[BR]]B = QRDecomposition(A);[[BR]]C = B.getH();''-~||
- ||Cholesky ||<99%>'''Cholesky Decomposition'''[[BR]]It is a special case of LU decomposition
applicable only if matrix to be decomposed is symmetric positive definite.[[BR]]'''Functions'''
: ~-''getL(), getU(), isSPD()''-~ [[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B
= CholeskyDecomposition(A);[[BR]]C = B.getL();''-~||
- ||SVD ||<99%>'''SVD(Singular Value Decomposition)'''[[BR]]For an m-by-n matrix A with
m >= n, the singular value decomposition is an m-by-n orthogonal matrix U, an n-by-n diagonal
matrix S, and an n-by-n orthogonal matrix V so that A = U*S*V'.[[BR]]'''Functions''' : ~-''getS(),
getU(), getV(), norm2(), rank()''-~ [[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B
= SVDdecomposition(A);[[BR]]C = B.getU();''-~||
- 
- {{{
- //Set up the matrix M from mapped matrix in hbase.
- Hbase.altools > M = Matrix('m_table','cf_1'); 
- 
- M ([1, 2],
-    [3, 4])
- }}}
- 
- '''(ex. 1)''' To find the Singular Value decomposition in Altools, do the following:
- [[BR]]~-'''''M = UΣV*'''''-~
- 
- {{{
- Hbase.altools > A = M.SVDdecomposition();
- Hbase.altools > U = A.getU();
- Hbase.altools > S = A.getS();
- Hbase.altools > V = A.getV();
- 
- U ([[-0.40455358, -0.9145143 ],
-     [-0.9145143 ,  0.40455358]])
- 
- S ([ 5.4649857 ,  0.36596619])
- 
- V ([[-0.57604844, -0.81741556],
-     [ 0.81741556, -0.57604844]])
- }}}
- 
- '''(ex. 2)''' To find the QR decomposition in Altools, do the following:
- [[BR]]~-'''''M = QR'''''-~
- 
- {{{
- Hbase.altools > A = M.QRDecomposition();
- Hbase.altools > U = A.getQ();
- Hbase.altools > U = A.getR();
- 
- Q ([[-0.31622777, -0.9486833 ],
-     [-0.9486833 ,  0.31622777]])
- 
- R ([[-3.16227766, -4.42718872],
-     [ 0.        , -0.63245553]])
- }}}
- ----
- = Example Of Altools Use =
- == Latent Semantic Analysis by SVD ==
- Latent semantic analysis (LSA) is a technique in natural language processing, in particular
in vectorial semantics, of analyzing relationships between a set of documents and the terms
they contain by producing a set of concepts related to the documents and terms. LSA was patented
in 1988 by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer,
Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval,
it is sometimes called latent semantic indexing (LSI). -- wikipedia
- 
- This LSA example used SVD decomposition with k=3. 
- 
- The SVD is typically computed using large matrix methods (for example, Lanczos methods)
but may also be computed incrementally and with greatly reduced resources via a neural network-like
approach which does not require the large, full-rank matrix to be held in memory -- wikipedia
- 
-  * ~-'''NOTATION'''-~
-   * ~-''T,,0,,'' : orthogonal, unit-length columns-~
-   * ~-''D,,0,,'' : orthogonal, unit-length columns-~
-   * ~-''S,,0,,'' : eigenvalues of a diagonal matrix-~
-   * ~-''t'' : Row number of matrix X-~
-   * ~-''d'' : Column number of matrix X-~
-   * ~-''m'' : Rank of matrix X ( ≤ min(t,d) )-~
- 
- I made a new relation "t_table" using "relational operators" from raw data(web_table) as
described below :
- [[BR]]~-''TODO: do explain "theoretical way of manipulating table(web_table) using relational
operators"''-~
- 
- ||Row Key ||<-6>Column Families ||
- ||<rowbgcolor="#ececec">term ||<-2> corpus: ||<-2>link: ||<-2>..
||
- ||t01 ||corpus:c01 ||0 ||link:cnn.com ||1 ||.. ||.. ||
- || ||corpus:c03 ||0 ||  ||  ||.. ||.. ||
- ||t02 ||corpus:c01 ||1 || || ||.. ||.. ||
- || ||corpus:c02 ||0 ||link:google.com  ||0  || || ||
- ||.. ||.. ||.. ||.. ||.. ||.. ||.. ||
- 
- {{{
- //Set up the matrix X
- Hbase.altools > X = Matrix('t_table','corpus'); 
- }}}
- 
- This example used tf*idf for more efficient and economical calculations.
- 
-  * ~-''w,,ij,, = tf,,ij,, '''·''' log( N/df,,j,, )''-~
-   * ~-''w'' : weighted value-~
-   * ~-''tf'' : the number of times the word appears in a i-th document-~
-   * ~-''N'' : the number of documents of an entire corpus-~
-   * ~-''df'' : the number of documents containing the j-th term-~
- 
- ||<rowbgcolor="#ececec"> ||c01 ||c02 ||c03 ||.. ||
- ||<bgcolor="#ececec">t01 ||0.00 ||0.00 ||0.15 ||.. ||
- ||<bgcolor="#ececec">t02 ||0.08 ||0.00 ||0.00 ||.. ||
- ||<bgcolor="#ececec">t02 ||0.00 ||0.26 ||0.00 ||.. ||
- ||<bgcolor="#ececec">.. ||.. ||.. ||.. ||.. ||
- 
- Weighted Matrix W
- 
- {{{
- //Diagonal eigenvalue S
- 
- Hbase.altools > M = W.SVDdecomposition(); 
- Hbase.altools > S = M.getS();
- }}}
- 
-  * ~-''X = T,,0,,S,,0,,D,,0,,''-~
- 
- {{{
- //Othonormal eigenvector T, D
- 
- Hbase.altools > T = M.getU();
- Hbase.altools > D = M.getV();
- }}}
- 
- SVD allows a simple strategy for optimal approximate fit using smaller matrices. If the
singular
- values in S0 are ordered by size, the first k largest values may be kept and the remaining
smaller ones
- set to zero. The product of the resulting matrices is a matrix X′ which is only approximately
equal to X,
- and is of rank k. Since zeros were introduced into S0, the representation can be simplified
by deleting
- the zero rows and columns of S0 to obtain a new diagonal matrix S, and deleting the corresponding
- columns of T0 and D0 to obtain T and D respectively. The result is a reduced model :
- 
-  * ~-''X′ = TSDT ≈ X''-~
- 
- which is the rank-k model with the best possible least square fit to X.
- 
- ||<rowbgcolor="#ececec"> ||c01 ||c02 ||c03 ||.. ||
- ||<bgcolor="#ececec">t01 ||0.18 ||0.03 ||0.12 ||.. ||
- ||<bgcolor="#ececec">t02 ||0.08 ||0.00 ||0.30 ||.. ||
- ||<bgcolor="#ececec">t02 ||0.00 ||0.26 ||0.07 ||.. ||
- ||<bgcolor="#ececec">.. ||.. ||.. ||.. ||.. ||
- 
- Reduced Matrix R
- 
- ----
- = Papers =
-  * [http://www.uib.no/People/nmabh/art/hpj.pdf High performance numerical libraries in Java]
-  * ''[http://labs.google.com/papers/bigtable.html Bigtable] : A Distributed Storage System
for Structured Data''
-  * ''Interpreting the Data: Parallel Analysis with [http://labs.google.com/papers/sawzall.html
Sawzall]''
-  * ''Y!'s Research Project : [http://research.yahoo.com/project/pig Pig] Document''
-  * ''[http://portal.acm.org/citation.cfm?doid=1247480.1247602 Map-Reduce-Merge] : Simplified
Relational Data Processing on Large Clusters''
-  * ''[http://www.pytables.org/ PyTables] : Hierarchical Datasets in Python''
-  * ''[http://numpy.scipy.org/ Numpy] : Scientific Tools for Python''
-  * ''[http://db.lcs.mit.edu/projects/cstore/ C-Store] : A Column Oriented DBMS''
- 

Mime
View raw message