hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong
Date Sat, 18 Aug 2007 06:53:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by InchulSong:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

The comment on the change is:
related projects added

------------------------------------------------------------------------------
  [[TableOfContents(4)]]
  ----
- = HbaseRDF, an Hbase subsystem for RDF =
+ = HbaseRDF, an Hbase Subsystem for RDF =
  
   -- ''Any comments on HbaseRDF are welcomed.''
  
@@ -16, +16 @@

  We can further accelerate query performance by using MapReduce for 
  parallel, distributed query processing. 
  
+ === Related projects ===
+   * The issue [https://issues.apache.org/jira/browse/HADOOP-1608 Relational Algrebra Operators]
is designing and implementing relational algebra operators. See [http://wiki.apache.org/lucene-hadoop/Hbase/ShellPlans
Algebric Tools] for various algebric operators we are designing and planing to implement,
including relational algebra operators.
+   * [http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell HbaseShell] provides a command
line tool in which we can manipulate tables in Hbase. We are also planning to use HbaseShell
to manipulate and query RDF data to be stored in Hbase.
+  
  == Initial Contributors ==
  
-  * [:udanax:Edward Yoon] [[MailTo(webmaster AT SPAMFREE udanax DOT org)]] (Research and
Development center, NHN corp.)
+  * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] (Research and Development
center, NHN corp.)
   * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab.
, KAIST) 
  
  == Considerations ==
- The Sawzall paper from Google says that MapReduce framework 
- is not good for table joins. It is possible, but  while we are reading one table 
+ When we store RDF data in a single Hbase table and process queries on them, an important
issue we have to consider is how to reduce costly self-joins needed to process RDF path queries.

+ 
+ To speed up these costly self-joins, it is natural to think about using 
+ the MapReduce framework we already have. However, in the Sawzall paper from Google, the
authors say that the MapReduce framework is 
+ not good, or inappropriate for performing table joins. 
+ It is possible, but while we are reading one table in map 
- in map or reduce functions, we have to read other tables on the fly.
+ or reduce functions, we have to read other tables on the fly, which
+ results in less parallelized join processing.
  
  There is a paper on this subject written by Yang et al., from Yahoo (SIGMOD 07). 
  The paper provides Map-Reduce-Merge, which is an extended version of the MapReduce framework,

  that implements several relational operators, including joins. They have extended the 
  MapReduce framework with an additional Merge phase to implement efficient data relationship
processing.
  See the Paper section below for more information. -- Thanks stack.
+ (Somebody help us here!)
  
  But the problem is that there is an initial delay in executing MapReduce jobs due to 
  the time spent in assigning the computations to multiple machines. This 
@@ -46, +56 @@

  
  == HbaseRDF Data Loader ==
  HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
- into a Hbase table in such a way that efficient query processing is possible.
+ into a Hbase table in such a way that efficient query processing is possible. In Hbase,
we can store everything in a single table.
+ The sparsicy of RDF data is not a problem, because Hbase, which is 
+ a column-based storage and adopts various compression techniques, 
+ is very good at dealing with nulls in the table
+ 
- It reads a triple at a time and inserts the triple into a Hbase table as follows:
+ HDL reads a triple at a time and inserts the triple into a Hbase table as follows:
  
  {{{#!python numbering=off
  value_count = 0
@@ -90, +104 @@

  Query processing steps are as follows:
  
   * Parsing, in which a parse tree, representing the SPARQL query is constructed.
-  * Query rewrite, in which the parse tree is converted to an initial query plan, which is,
in turn, transformed into an equivalent plan that is expected to require less time to execute.
We have to choose which algorithm to use for each operation in the selected plan. Among them
are MapReduce jobs for parallel algorithms.
+  * Query rewrite, in which the parse tree is converted to an initial query plan, which is,
in turn, transformed into an equivalent plan that is expected to require less time to execute.
We have to choose which algorithm to use for each operation in the selected plan. Among them
are parallel versions of algorithms, such as parallel joins with MapReduceMerge.
   * Execute the plan
   
  == HbaseRDF Data Materializer ==
@@ -137, +151 @@

  
  [http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg]
  
- == Rationale ==
- 
-  * What is RDF
-  * Previous methods for storing RDF data and processing queries
-   * Their weak points
-  * The method in Hbase
-   * Strong points
- 
- == Food for thought ==
-  * What are the differences between Hbase and C-Store.
-  
-  * Is DSM suitable for Hbase?
- 
-  * How to translate SPARQL queries into MapReduce functions, or Hbase APIs. 
- 
  == Hbase RDF Storage Subsystems Architecture ==
- 
   * [:Hbase/RDF/Architecture] Hbase RDF Storage Subsystems Architecture.
   * [:Hbase/HbaseShell/HRQL] Hbase Shell RDF Query Language.
  

Mime
View raw message