hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong
Date Fri, 17 Aug 2007 03:13:50 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by InchulSong:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

The comment on the change is:
a related work added.

------------------------------------------------------------------------------
  ----
  = Hbase RDF Storage Subsystems =
  
+  -- ''Any comments on HbaseRDF are welcomed.''
- We Start to think about storing and querying RDF and RDF Schema in Hbase. 
- [[BR]]but we'll do it at the last after prudence investigation and Hbase shell's HQL, Altools
POC review.
  
+ We have started to think about storing and querying RDF data in Hbase. But we'll jump into
its implementation after prudence investigation. 
+ 
- We propose a Hbase subsystem for RDF called HbaseRDF, which uses Hbase + MapReduce to store
RDF data and execute queries (e.g., SPARQL) on them.
+ We propose an Hbase subsystem for RDF called HbaseRDF, which uses Hbase + MapReduce to store
RDF data and execute queries (e.g., SPARQL) on them.
  We can store very sparse RDF data in a single table in Hbase, with as many columns as 
  they need. For example, we might make a row for each RDF subject in a table and store all
the properties and their values as columns in the table. 
  This reduces costly self-joins, which results in efficient processing of queries, although
we still need self-joins for RDF path queries.
@@ -18, +19 @@

  == Initial Contributors ==
  
   * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] (Research and Development
center, NHN corp.)
-  * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab
Division of Computer Science, KAIST) 
+  * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab.
, KAIST) 
  
  == Background ==
  
@@ -35, +36 @@

    * Strong points
  
  == Considerations ==
- The Sawzall paper says that the record-at-a-time model is not good for table joins. I think
this problem occurs in typical join operations. Think about what kinds of join operations
Sawzall or MapReduce can perform efficienly in parallel, or possibly process them at all at
the first time. 
+ The Sawzall paper says that the record-at-a-time model is not good for table joins. 
+ We think this problem occurs for typical join operations. 
  
- When we perform a nested loop join, for each tuple in the outer table, table R, we have
to go through the inner table, table S, to find 
+ Let us think how to implement general join operations with MapReduce. 
+ When we perform a nested loop join, for each tuple in the outer table, 
+ table R, we have to go through the inner table, table S, to find 
  all the joining tuples. To perform nested loop joins with MapReduce,
  we divide table R into M partitions, and each 
  map worker joins one partition of the table at a time with S. 
  Each map worker produces the amount of join results corresponding 
  to the partition it is in charge of.
  
- However, in merge joins, since table R is already sorted by 
+ For merge joins, since table R is already sorted by 
  subject, each map worker only needs to read logN tuples in table S, where
  N is the number of tuples in S, which results in fast join performance.
  
- C-Store also says that join operations can be performed efficiently in DSM because
- tables in DSM are sorted by subject, thus we can use merge joins on a sorted attribute.

+ We have been notified by one of the researchers who have already done 
+ a good research on this subject. There is a paper on this subject 
+ written by Yang. et al, from Yahoo (SIGMOD 07). The paper provides 
+ Map-Reduce-Merge, which is an extended version of MapReduce, 
+ that implements several relational operators including joins. 
+ See the Paper section below for more information.
  
- The key parts are how fast Sawzall or MapReduce performs merge joins and 
+ The key parts are how efficienly MapReduce performs joins and 
  how cleverly we materialize join results for efficient query processing of RDF path queries.
Especially, the formal part is the key to beat 
- C-Store query performance: defeat C-Store with massively parallelized query processing.
+ C-Store query performance: defeat C-Store with massively parallelized query 
+ processing.
  
  But the problem is that there is an initial delay in executing MapReduce jobs due to 
  the time spent in assigning the computations to multiple machines. This 
- might take far more time than necessary, thus hurt query response time. So, parallelism
obtained by using MapReduce is best enjoyable for queries over huge amount of RDF data, where
+ might take far more time than necessary, thus hurt query response time. So, parallelism
obtained by using MapReduce is best enjoyable for queries over huge amount of RDF data, where
it takes much time to process them. 
- it takes much time to process them. 
  We might consider a selective parallelism where 
  people can decide whether to use MapReduce or not to process their queries, as in 
  "select ... '''in parallel'''".
@@ -169, +177 @@

   * ~-OSDI 2004, MapReduce: Simplified Data Processing on Large Clusters" - proposes a very
simple, but powerfull, and highly parallelized data processing technique.-~
   * ~-CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf Column-Stores For
Wide and Sparse Data]'' - discusses the benefits of using C-Store to store RDF and XML data.-~
   * ~-VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable Semantic
Web Data Management Using Vertical Partitoning]'' - proposes an efficient method to store
RDF data in table projections (i.e., columns) and executes queries on them.-~
+  * ~-SIGMOD 2007, ''Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters''
-- MapReduce implementation of several relational operators.-~
  

Mime
View raw message