hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/HbaseShell" by udanax
Date Mon, 02 Jul 2007 09:23:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell

New page:
'''research/work in progress''' 

 * https://issues.apache.org/jira/browse/HADOOP-1375 [[BR]]but, implementation has yet to
be started.

[[TableOfContents(4)]]

----
= Hbase Shell Introduction =
Hbase Shell is an 'interpreter' (or 'shell)' to provide scalable data processing capabilities
like [[BR]]aggregation, algebraic calculation on Hadoop + Hbase.

== Hbase Shell Goals ==
HBase Shell is developed to achieve the following goals.

 * A Simplified Import/Export/Migrate Functionality Between different data sources (Hadoop,
HBase)
 * A Simplified processing of a logical data model
 * A Simplified algebraic operations
 * A Simplified Parallel Numerical Analysis by abstracting/numericalizing points, lines, [[BR]]or
plane data across multiple maps in HBase.

== Background ==

I expect Hadoop + Hbase to handle sparsity and data explosion very well in near future. [[BR]]Moreover,
i believe the design of the multi-dimensional structure and the 3-dim space model of the data
are [[BR]]optimized for rapid ad-hoc information retrieval in any orientation, as well as
for fast, flexible calculation and transformation of [[BR]]raw data based on formulaic relationships.

Then, I thought it would require a more user-friendly interface to enable querying the data
interactive.

== Rationale ==

It will probably take a while for Hadoop + HBase to provide reliable real-time service like
other DBMS. 
[[BR]]Thus, I decided to develop a shell to process linear algebraic computing 
[[BR]]and large scale data using Hadoop's parallel processing and HBase storage. 

''Then you may ask "What is a difference from MapReduce using MapFiles?"''

I don't expect it to give us a high-performance just yet, 
[[BR]]but it will sure make data management and development much easier. 
[[BR]]First, let's take a look at HBase's data model. 

HBase provides a unified data model and it represents a data in 3-dimensional 
[[BR]]- Row, Column, and TImestamp. Also, Row and Column may be extended infinitely. 
  
If we decide to cut the data model in time version, then we may view the new data as a 2D
table. 
[[BR]]If index is in string, we may view it as a huge map. If index is in integer, then it
is one huge 2D array. 
[[BR]]So each table may have such data storages in 3D (ColumnFamilies)


----
= Hbase Shell Client Syntax Definition =
'''Note''' that Data should be located by their row, column, and timestamp.

== Commands ==
||<bgcolor="#ececec">'''Command''' ||<bgcolor="#ececec">'''Explanation''' ||
||HELP ||<99%>'''Help''' command provides information about the use of shell script.[[BR]][[BR]]~-''HELP
[function_name];''-~ ||
||SHOW ||<99%>'''Show''' command will list the tables.[[BR]][[BR]]~-''SHOW tables;''-~
||
||DESC ||'''Desc''' command will provides information about the columnfamilies in a table.[[BR]][[BR]]~-''DESC
table_name;''-~ ||
||CREATE ||'''Create''' command will create a new table.[[BR]][[BR]]~-''CREATE table_name[[BR]]COLUMNFAMILIES('columnfamily_name1'[,
'columnfamily_name2', ...])[[BR]][LIMIT=limitNumber_of_Version];''-~ ||
||DROP ||'''Drop''' command will droping columnfamilies in a table or tables.[[BR]][[BR]]~-''DROP
table_name1[, table_name2, ...] or columnfamily_name1[, columnfamily_name2, ...];''-~ ||
||SUBSTITUTE[[BR]] || '''Substitute''' query to [A~Z][[BR]][[BR]]~-''X = Matrix(table_name,
columnfamily_name);''-~||
||STORE ||'''STORE''' command will store results to specified table. [[BR]][[BR]]~-''A = Table('movieLog_table');
[[BR]]B = A.Selection('length' > 100); [[BR]]STORE B TO X run_style;''-~ ||
||EXIT ||<99%>'''Exit''' from the current shell script.[[BR]][[BR]]~-''EXIT;''-~ ||
And, Commands to manually manipulate data on more detailed parts.
||<bgcolor="#ececec">'''Command''' ||<bgcolor="#ececec">'''Explanation''' ||
||INSERT ||<99%>'''Insert''' command will insert one row into the table with a value
for specified column in the table.[[BR]][[BR]]~-''INSERT table_name ('columnfamily_name1:column_key'[,
'columnfamily_name2:column_key', ...])[[BR]] VALUESVALUES ('entry1'[, 'entry2', ...])[[BR]]WHERE
row='row_key';''-~ ||
||SET ||'''SET''' command will change the values. [[BR]][[BR]]~-''SET table_name[[BR]] VALUES('columnfamily_name:column_key','entry')[[BR]]WHERE
row='row_key' AND time='Specified_Timestamp';''-~ ||
||DELETE ||'''Delete''' command will delete specified rows in table. [[BR]][[BR]]~-''DELETE
table_name[[BR]]WHERE row='row_key'[[BR]][AND column='columnfamily_name:column_key'];''-~
||
||SELECT ||<99%>'''Select''' command will retrieves rows from a table.[[BR]][[BR]]~-''SELECT
table_name[[BR]][WHERE row='row_key'][[BR]][AND column='columnfamily_name:column_key'];[[BR]][AND
time='Specified_Timestamp'];[[BR]][LIMIT=Number_of_Version];''-~ ||

== Relational Operations ==

||<bgcolor="#ececec">'''Operators''' ||<bgcolor="#ececec">'''Explanation''' ||
||PROJECTION||<99%>is defined as the set that is obtained when all tuples in ~+R+~ are
restricted to the set {a,,1,,,...,a,,n,,}.||
||SELECTION||<99%>...||
||PRODUCT||<99%>...||
||RENAME||<99%>'''Rename''' r to x||
||GROUP||<99%>...||
||SORT||<99%>...||


||<bgcolor="#ececec">'''Operators''' ||<bgcolor="#ececec">'''Explanation''' ||
||UNION ||<99%>'''Union''' A∪B contains all the elements of A and it contains all
the elements of B.||
||INTERSECTION ||<99%>'''Intersection''' A∩B is a subset of A and it is a subset of
B.||
||DIFFERENCE ||'''Difference''' of A and B (A-B).||

||<bgcolor="#ececec">'''Functions''' ||<bgcolor="#ececec">'''Explanation''' ||
||AVG ||<99%>...||
||SUM ||<99%>...||
||COUNT ||<99%>...||
||MIN ||<99%>...||
||MAX ||<99%>...||

== Matrix Operations ==

||<bgcolor="#ececec">'''Operation''' ||<bgcolor="#ececec">'''Explanation''' ||
||DOUBLEMATRIX||<99%>...||
||BOOLEANMATRIX||<99%>...||

||<bgcolor="#ececec">'''Functions''' ||<bgcolor="#ececec">'''Explanation''' ||
||QR ||<99%>...||
||LU||<99%>...||
||SVD ||<99%>...||

----
= Example Of Hbase Shell Use =
== Basic Usage ==

=== Create the table in a HBase ===

~-''CREATE movieLog_table
[[BR]]COLUMNFAMILIES('year', 'length', 'inColor', 'studioName', 'vote', 'producer')
[[BR]]LIMIT=1;''-~ 

=== Insert data into a table ===
~-''INSERT table_name ('year:', 'length:', 'inColor:', 'studioName:', 'vote:user_1', 'producer')
[[BR]]VALUES ('1977', '124', 'true', 'Fox', '5', 'George Lucas')
[[BR]]WHERE row='Star Wars';''-~ 

=== Show all data in a table ===
~-''SELECT movieLog_table;''-~ 

||Row Key ||<-12>Column Families ||
||<rowbgcolor="#ececec">title ||<-2> year ||<-2>length ||<-2>inColor
||<-2> studioName ||<-2> vote ||<-2> producer ||
||Star Wars ||year: || 1977 ||length: || 124 ||inColor: || true ||studioName: || Fox || vote:''user_1''
|| 5 || producer: || George Lucas ||
|| || || || || || || || || || vote:''user_2'' || 2 || || ||
||Mighty Ducks ||year: || 1991 ||length: || 104 ||inColor: || true ||studioName: || Disney
|| vote:''user_1'' || 2 || producer: || Blair Peters ||
|| || || || || || || || || || vote:''user_3'' || 4 || || ||
||Wayne's World ||year: || 1992 ||length: || 95 ||inColor: || true ||studioName: || Paramount
|| vote:''user_2'' || 3 || producer: || Penelope Spheeris ||
|| || || || || || || || || || vote:''user_3'' || 4 || || ||


== Relation Operations ==

=== Projection ===

~-''A = Table('movieLog_table');
[[BR]]B = A.Projection('year','length');''-~

'''~+^π^+~'''~-title-~,~-year-~,~-length-~'''~+^(movieLog_table)^+~'''

||<rowbgcolor="#ececec">title ||year ||length ||
||Star Wars ||1977 ||124 ||
||Mighty Ducks ||1991 ||104 ||
||Wayne's World ||1992 ||95 ||



=== Selection ===

~-''A = Table('movieLog_table');
[[BR]]B = A.Selection('length' > 100);''-~

'''~+^σ^+~'''~-length>100-~'''~+^(movieLog_table)^+~'''

||<rowbgcolor="#ececec">title ||year ||length ||inColor ||studioName ||producer ||
||Star Wars ||1977 ||124 ||true ||Fox ||12345 ||
||Mighty Ducks ||1991 ||104 ||true ||Disney ||67890 ||


=== Example ===

~-''A = Table('movieLog_table');
[[BR]]B = A.Selection(length > 100 AND studioName = 'Fox');
[[BR]]C = B.Projection('year');''-~

'''~+^π^+~'''~-title-~,~-year-~'''~+^(σ^+~'''~-length>100-~'''~+^(movieLog_table)∩σ^+~'''~-studioName='Fox'-~'''~+^(movieLog_table))^+~'''

||<rowbgcolor="#ececec">title ||year ||
||Star Wars ||1977 ||

== Matrix Operations ==

Lets construct a abstract sparse row-by-column matrix.

~-''A = doubleMatrix('movieLog_table','vote');''-~

||<rowbgcolor="#ececec"> ||user_1 ||user_2 ||user_3 ||
||<bgcolor="#ececec">Star Wars || 5.0 || 2.0 ||   ||
||<bgcolor="#ececec">Mighty Ducks || 2.0 ||   || 4.0 ||
||<bgcolor="#ececec">Wayne's World ||   || 3.0 || 4.0 ||


----
= Matrix Extension Example On Hbase Shell =
== Latent Semantic Analysis By Singular Value Decomposition ==
'''Motivation'''
Lexical matching at term level inaccurate (claimed)

  * Polysemy - words with number of ‘meanings’ - term matching returns irrelevant documents
- impacts precision
  * Synonomy - number of words with same ‘meaning’ - term matching misses relevant documents
- impacts recall

LSA assumes that there exists a LATENT structure in word usage - obscured by variability in
word choice 
[[BR]]Analogous to signal + additive noise model in signal processing



== Scalable Collaborative Filtering With A Large User-By-Item Matrix ==

Title-By-Title Triangular Matrix 

||<rowbgcolor="#ececec"> ||Star Wars  ||Mighty Ducks ||Wayne's World ||
||<bgcolor="#ececec">Star Wars ||   || 0.415 || 0.222 ||
||<bgcolor="#ececec">Mighty Ducks ||   ||   || 0.715 ||
||<bgcolor="#ececec">Wayne's World ||   ||   ||   ||


== Consistency Assessment Of Topological Relationship By Matrix-Union ==
..

----
= Performance Reports =
..

----
= People Involved =

 * [:udanax:Edward Yoon] udanax@nhncorp.com
 * [:boyo:Sewon Kim] ebow31@gmail.com
 * [:mskim:Minsu Kim] minsu.kim@gmail.com

Mime
View raw message