hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Trivial Update of "Hbase/RDF" by udanax
Date Fri, 07 Dec 2007 01:41:56 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

------------------------------------------------------------------------------
  
  We have started to think about storing and querying RDF data in Hbase. But we'll jump into
its implementation after prudence investigation. 
  
- We introduce an Hbase subsystem for RDF, called HbaseRDF, which uses Hbase + MapReduce to
store RDF data and execute queries (e.g., SPARQL) on them.
+ We introduce an Hbase subsystem for RDF, called HbaseRDF, which uses Hbase + !MapReduce
to store RDF data and execute queries (e.g., SPARQL) on them.
  We can store very sparse RDF data in a single table in Hbase, with as many columns as 
  they need. For example, we might make a row for each RDF subject in a table and store all
the properties and their values as columns in the table. 
  This reduces costly self-joins in answering queries asking questions on the same subject,
which results in efficient processing of queries, although we still need self-joins to answer
RDF path queries.
  
- We can further accelerate query performance by using MapReduce for 
+ We can further accelerate query performance by using !MapReduce for 
  parallel, distributed query processing. 
  
  === Related projects ===
  
-  * [:Hbase/HbaseShell: HbaseShell] provides a command line tool in which we can manipulate
tables in Hbase. We are also planning to use HbaseShell to manipulate and query RDF data stored
in Hbase.
+  * [:Hbase/HbaseShell: Hbase Shell] provides a command line tool in which we can manipulate
tables in Hbase. We are also planning to use !HbaseShell to manipulate and query RDF data
stored in Hbase.
   * [http://www.openrdf.org/forum/mvnforum/viewthread?thread=1423 A forum at Aduna/Sesame]
would be interested in working with this group.
   
  === Initial Contributors ===
@@ -29, +29 @@

  When we store RDF data in a single Hbase table and process queries on them, an important
issue we have to consider is how to efficiently perform costly self-joins needed to process
RDF path queries. 
  
  To speed up these costly self-joins, it is natural to think about using 
- the MapReduce framework we already have. However, in the Sawzall paper from Google, the
authors say that the MapReduce framework is 
+ the !MapReduce framework we already have. However, in the Sawzall paper from Google, the
authors say that the !MapReduce framework is 
  not good, or inappropriate for performing table joins. 
  It is possible, but while we are reading one table in map 
  or reduce functions, we have to read other tables on the fly, which
  results in less parallelized join processing.
  
  There is a paper on this subject written by Yang et al., from Yahoo (SIGMOD 07). 
- The paper provides Map-Reduce-Merge, which is an extended version of the MapReduce framework,

+ The paper provides Map-Reduce-Merge, which is an extended version of the !MapReduce framework,

  that implements several relational operators, including joins. They have extended the 
- MapReduce framework with an additional Merge phase to implement efficient data relationship
processing.
+ !MapReduce framework with an additional Merge phase to implement efficient data relationship
processing.
  See the Paper section below for more information. -- Thanks stack.
- (Edward is now implementing join operators using the MapReduce framework.)
+ (Edward is now implementing join operators using the !MapReduce framework.)
  
- But the problem is that there is an initial delay in executing MapReduce jobs due to 
+ But the problem is that there is an initial delay in executing !MapReduce jobs due to 
  the time spent in assigning the computations to multiple machines. This 
- might take far more time than necessary, thus hurt query response time. So, parallelism
obtained by using MapReduce is best enjoyable for queries over huge amount of RDF data, where
it takes much time to process them. 
+ might take far more time than necessary, thus hurt query response time. So, parallelism
obtained by using !MapReduce is best enjoyable for queries over huge amount of RDF data, where
it takes much time to process them. 
  We might consider a selective parallelism where 
- people can decide whether to use MapReduce or not to process their queries, as in 
+ people can decide whether to use !MapReduce or not to process their queries, as in 
  "select ... '''in parallel'''".
  
- Now that we have two sets of join algorithms, non-parallel versions and parallel versions
with MapReduceMerge,
+ Now that we have two sets of join algorithms, non-parallel versions and parallel versions
with !MapReduceMerge,
  we are ready to do some massive parallel query processing on tremendous amount of RDF data.
  Currently, C-Store shows the best query performance on RDF data.
- However, we, armed with Hbase and MapReduceMerge, can do even better.
+ However, we, armed with Hbase and !MapReduceMerge, can do even better.
  ----
  == Resources ==
   * http://www.w3.org/TR/rdf-sparql-query/ - The SPARQL RDF Query Language, a candidate recommendation
of W3C as of 14 June 2007.
@@ -72, +72 @@

  
  === HbaseRDF Query Processor ===
  HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase table.

- It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers and returns
the results
+ It translates RDF queries into API calls to Hbase, or !MapReduce jobs, gathers and returns
the results
  to the user. 
  
  Query processing steps are as follows:
@@ -117, +117 @@

  ----
  == Papers ==
  
-  * OSDI 2004, ''MapReduce: Simplified Data Processing on Large Clusters'', proposes a very
simple, but powerfull, and highly parallelized data processing technique.
+  * OSDI 2004, ''!MapReduce: Simplified Data Processing on Large Clusters'', proposes a very
simple, but powerfull, and highly parallelized data processing technique.
   * CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf Column-Stores For
Wide and Sparse Data]'', discusses the benefits of using C-Store to store RDF and XML data.
   * VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable Semantic Web
Data Management Using Vertical Partitoning]'', proposes an efficient method to store RDF data
in table projections (i.e., columns) and executes queries on them.
-  * SIGMOD 2007, ''Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters'',
MapReduce implementation of several relational operators.
+  * SIGMOD 2007, ''Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters'',
!MapReduce implementation of several relational operators.
  

Mime
View raw message