hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong
Date Mon, 20 Aug 2007 21:08:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by InchulSong:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

------------------------------------------------------------------------------
- 
  [[TableOfContents(4)]]
  ----
- = HbaseRDF, an Hbase Subsystem for RDF =
+ == HbaseRDF, an Hbase Subsystem for RDF ==
  
-  -- ''Any comments on HbaseRDF are welcomed.''
+  -- ''Volunteers and any comments on HbaseRDF are welcomed.''
  
  We have started to think about storing and querying RDF data in Hbase. But we'll jump into
its implementation after prudence investigation. 
  
- We propose an Hbase subsystem for RDF called HbaseRDF, which uses Hbase + MapReduce to store
RDF data and execute queries (e.g., SPARQL) on them.
+ We call for the introduction of an Hbase subsystem for RDF, called HbaseRDF, which uses
Hbase + MapReduce to store RDF data and execute queries (e.g., SPARQL) on them.
  We can store very sparse RDF data in a single table in Hbase, with as many columns as 
  they need. For example, we might make a row for each RDF subject in a table and store all
the properties and their values as columns in the table. 
  This reduces costly self-joins in answering queries asking questions on the same subject,
which results in efficient processing of queries, although we still need self-joins to answer
RDF path queries.
@@ -18, +17 @@

  parallel, distributed query processing. 
  
  === Related projects ===
-   * The issue [https://issues.apache.org/jira/browse/HADOOP-1608 Relational Algrebra Operators]
is designing and implementing relational algebra operators. See [http://wiki.apache.org/lucene-hadoop/Hbase/ShellPlans
Algebric Tools] for various algebric operators we are designing and planing to implement,
including relational algebra operators.
+  * The issue [https://issues.apache.org/jira/browse/HADOOP-1608 HADOOP-1608 Relational Algrebra
Operators] is designing and implementing relational algebra operators. See [http://wiki.apache.org/lucene-hadoop/Hbase/ShellPlans
Algebric Tools] for various algebric operators we are designing and planing to implement,
including relational algebra operators.
-   * [http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell HbaseShell] provides a command
line tool in which we can manipulate tables in Hbase. We are also planning to use HbaseShell
to manipulate and query RDF data to be stored in Hbase.
+  * [http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell HbaseShell] provides a command
line tool in which we can manipulate tables in Hbase. We are also planning to use HbaseShell
to manipulate and query RDF data to be stored in Hbase.
+  * [https://issues.apache.org/jira/browse/HADOOP-1120 contrib/data_join] provides helper
classes to help implement data join operations through MapReduce jobs. Thanks to Runping.
   
- == Initial Contributors ==
+ === Initial Contributors ===
  
   * [:udanax:Edward Yoon] [[MailTo(webmaster AT SPAMFREE udanax DOT org)]] (Research and
Development center, NHN corp.)
-  * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab.
, KAIST) 
+  * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab,
KAIST) 
  
- == Considerations ==
+ == Some Ideas ==
  When we store RDF data in a single Hbase table and process queries on them, an important
issue we have to consider is how to efficiently perform costly self-joins needed to process
RDF path queries. 
  
  To speed up these costly self-joins, it is natural to think about using 
@@ -55, +55 @@

  Currently, C-Store shows the best query performance on RDF data.
  However, we, armed with Hbase and MapReduceMerge, can do even better.
  
+ == Resources ==
+  * http://www.w3.org/TR/rdf-sparql-query/ - The SPARQL RDF Query Language, a candidate recommendation
of W3C as of 14 June 2007.
+  * A test suit for SPARQL can be found at http://www.w3.org/2001/sw/DataAccess/tests/r2.
The web page provides test RDF data, SPARQL queries, and expected results.
+  * [https://jena.svn.sourceforge.net/svnroot/jena/ARQ/trunk/Grammar/sparql.jj SPARQL Grammer
in JavaCC] - from Jena ARQ
+  * [http://esw.w3.org/topic/LargeTripleStores Large triple stores]
+ 
+ == Architecture Sketch ==
+ 
- == HbaseRDF Data Loader ==
+ === HbaseRDF Data Loader ===
  HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
  into a Hbase table in such a way that efficient query processing is possible. In Hbase,
we can store everything in a single table.
  The sparsicy of RDF data is not a problem, because Hbase, which is 
@@ -65, +73 @@

  HDL reads a triple at a time and inserts the triple into a Hbase table as follows:
  
  {{{#!python numbering=off
- value_count = 0
+ value_count = 1
  for s, p, o in triples:
    insert into rdf_table ('p:value_count') values ('o')
      where row='s'
    value_count = value_count + 1
  }}}
  
- Examples with the data from C-Store.
- 
- {{{#!CSV ;  
- Subj.; Prop.; Obj.
- ID1; type; BookType
- ID1; title; “XYZ”
- ID1; author; “Fox, Joe”
- ID1; copyright; “2001”
- ID2; type; CDType
- ID2; title; “ABC”
- ID2; artist; “Orr, Tim”
- ID2; copyright; “1985”
- ID2; language; “French”
- ID3; type; BookType
- ID3; title; “MNO”
- ID3; language; “English”
- ID4; type; DVDType
- ID4; title; “DEF”
- ID5; type; CDType
- ID5; title; “GHI”
- ID5; copyright; “1995”
- ID6; type; BookType
- ID6; copyright; “2004”
- }}}
- 
- == HbaseRDF Query Processor ==
+ === HbaseRDF Query Processor ===
  HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase table.

  It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers and returns
the results
  to the user. 
@@ -108, +91 @@

   * Query rewrite, in which the parse tree is converted to an initial query plan, which is,
in turn, transformed into an equivalent plan that is expected to require less time to execute.
We have to choose which algorithm to use for each operation in the selected plan. Among them
are parallel versions of algorithms, such as parallel joins with MapReduceMerge.
   * Execute the plan
   
- == HbaseRDF Data Materializer ==
+ === HbaseRDF Data Materializer ===
  HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the results
  into a Hbase table. Later, HQP uses those materialized data for efficient processing of

  RDF path queries. 
  
- == Hbase Shell Extention ==
+ === Hbase Shell Extension ===
- === Hbase Shell - RDF Shell ===
+ 
  {{{
  Hbase > rdf;
  
@@ -134, +117 @@

  Hbase > 
  }}}
  
- === Hbase SPARQL ===
-  * Support for the full SPARQL syntax
-  * Support for a syntax to load RDF data into an Hbase table
- 
  == Alternatives ==
   * A triples table stores RDF triples in a single table with three attributes, subject,
property, and object.
  
@@ -146, +125 @@

   * A dicomposed storage model (DSM), one table for each property, sorted by the subject.
Used in C-Store.
    * ''Actually, the discomposed storage model is almost the same as the storage model in
Hbase.''
  
- == Hbase Storage for RDF ==
- 
- ~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- udanax''-~
- 
- [http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg]
- 
- == Hbase RDF Storage Subsystems Architecture ==
-  * [:Hbase/RDF/Architecture] Hbase RDF Storage Subsystems Architecture.
-  * [:Hbase/HbaseShell/HRQL] Hbase Shell RDF Query Language.
- 
- ----
- = Papers =
+ == Papers ==
  
   * ~-OSDI 2004, MapReduce: Simplified Data Processing on Large Clusters" - proposes a very
simple, but powerfull, and highly parallelized data processing technique.-~
   * ~-CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf Column-Stores For
Wide and Sparse Data]'' - discusses the benefits of using C-Store to store RDF and XML data.-~

Mime
View raw message