Return-Path: Delivered-To: apmail-lucene-hadoop-commits-archive@locus.apache.org Received: (qmail 57557 invoked from network); 20 Aug 2007 21:08:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Aug 2007 21:08:38 -0000 Received: (qmail 75005 invoked by uid 500); 20 Aug 2007 21:08:35 -0000 Delivered-To: apmail-lucene-hadoop-commits-archive@lucene.apache.org Received: (qmail 74988 invoked by uid 500); 20 Aug 2007 21:08:35 -0000 Mailing-List: contact hadoop-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-commits@lucene.apache.org Received: (qmail 74979 invoked by uid 99); 20 Aug 2007 21:08:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Aug 2007 14:08:35 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Aug 2007 21:08:35 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 292215A250 for ; Mon, 20 Aug 2007 21:08:15 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Apache Wiki To: hadoop-commits@lucene.apache.org Date: Mon, 20 Aug 2007 21:08:14 -0000 Message-ID: <20070820210814.9398.58486@eos.apache.org> Subject: [Lucene-hadoop Wiki] Update of "Hbase/RDF" by InchulSong X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification. The following page has been changed by InchulSong: http://wiki.apache.org/lucene-hadoop/Hbase/RDF ------------------------------------------------------------------------------ - [[TableOfContents(4)]] ---- - = HbaseRDF, an Hbase Subsystem for RDF = + == HbaseRDF, an Hbase Subsystem for RDF == - -- ''Any comments on HbaseRDF are welcomed.'' + -- ''Volunteers and any comments on HbaseRDF are welcomed.'' We have started to think about storing and querying RDF data in Hbase. But we'll jump into its implementation after prudence investigation. - We propose an Hbase subsystem for RDF called HbaseRDF, which uses Hbase + MapReduce to store RDF data and execute queries (e.g., SPARQL) on them. + We call for the introduction of an Hbase subsystem for RDF, called HbaseRDF, which uses Hbase + MapReduce to store RDF data and execute queries (e.g., SPARQL) on them. We can store very sparse RDF data in a single table in Hbase, with as many columns as they need. For example, we might make a row for each RDF subject in a table and store all the properties and their values as columns in the table. This reduces costly self-joins in answering queries asking questions on the same subject, which results in efficient processing of queries, although we still need self-joins to answer RDF path queries. @@ -18, +17 @@ parallel, distributed query processing. === Related projects === - * The issue [https://issues.apache.org/jira/browse/HADOOP-1608 Relational Algrebra Operators] is designing and implementing relational algebra operators. See [http://wiki.apache.org/lucene-hadoop/Hbase/ShellPlans Algebric Tools] for various algebric operators we are designing and planing to implement, including relational algebra operators. + * The issue [https://issues.apache.org/jira/browse/HADOOP-1608 HADOOP-1608 Relational Algrebra Operators] is designing and implementing relational algebra operators. See [http://wiki.apache.org/lucene-hadoop/Hbase/ShellPlans Algebric Tools] for various algebric operators we are designing and planing to implement, including relational algebra operators. - * [http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell HbaseShell] provides a command line tool in which we can manipulate tables in Hbase. We are also planning to use HbaseShell to manipulate and query RDF data to be stored in Hbase. + * [http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell HbaseShell] provides a command line tool in which we can manipulate tables in Hbase. We are also planning to use HbaseShell to manipulate and query RDF data to be stored in Hbase. + * [https://issues.apache.org/jira/browse/HADOOP-1120 contrib/data_join] provides helper classes to help implement data join operations through MapReduce jobs. Thanks to Runping. - == Initial Contributors == + === Initial Contributors === * [:udanax:Edward Yoon] [[MailTo(webmaster AT SPAMFREE udanax DOT org)]] (Research and Development center, NHN corp.) - * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab. , KAIST) + * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab, KAIST) - == Considerations == + == Some Ideas == When we store RDF data in a single Hbase table and process queries on them, an important issue we have to consider is how to efficiently perform costly self-joins needed to process RDF path queries. To speed up these costly self-joins, it is natural to think about using @@ -55, +55 @@ Currently, C-Store shows the best query performance on RDF data. However, we, armed with Hbase and MapReduceMerge, can do even better. + == Resources == + * http://www.w3.org/TR/rdf-sparql-query/ - The SPARQL RDF Query Language, a candidate recommendation of W3C as of 14 June 2007. + * A test suit for SPARQL can be found at http://www.w3.org/2001/sw/DataAccess/tests/r2. The web page provides test RDF data, SPARQL queries, and expected results. + * [https://jena.svn.sourceforge.net/svnroot/jena/ARQ/trunk/Grammar/sparql.jj SPARQL Grammer in JavaCC] - from Jena ARQ + * [http://esw.w3.org/topic/LargeTripleStores Large triple stores] + + == Architecture Sketch == + - == HbaseRDF Data Loader == + === HbaseRDF Data Loader === HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data into a Hbase table in such a way that efficient query processing is possible. In Hbase, we can store everything in a single table. The sparsicy of RDF data is not a problem, because Hbase, which is @@ -65, +73 @@ HDL reads a triple at a time and inserts the triple into a Hbase table as follows: {{{#!python numbering=off - value_count = 0 + value_count = 1 for s, p, o in triples: insert into rdf_table ('p:value_count') values ('o') where row='s' value_count = value_count + 1 }}} - Examples with the data from C-Store. - - {{{#!CSV ; - Subj.; Prop.; Obj. - ID1; type; BookType - ID1; title; “XYZ” - ID1; author; “Fox, Joe” - ID1; copyright; “2001” - ID2; type; CDType - ID2; title; “ABC” - ID2; artist; “Orr, Tim” - ID2; copyright; “1985” - ID2; language; “French” - ID3; type; BookType - ID3; title; “MNO” - ID3; language; “English” - ID4; type; DVDType - ID4; title; “DEF” - ID5; type; CDType - ID5; title; “GHI” - ID5; copyright; “1995” - ID6; type; BookType - ID6; copyright; “2004” - }}} - - == HbaseRDF Query Processor == + === HbaseRDF Query Processor === HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase table. It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers and returns the results to the user. @@ -108, +91 @@ * Query rewrite, in which the parse tree is converted to an initial query plan, which is, in turn, transformed into an equivalent plan that is expected to require less time to execute. We have to choose which algorithm to use for each operation in the selected plan. Among them are parallel versions of algorithms, such as parallel joins with MapReduceMerge. * Execute the plan - == HbaseRDF Data Materializer == + === HbaseRDF Data Materializer === HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the results into a Hbase table. Later, HQP uses those materialized data for efficient processing of RDF path queries. - == Hbase Shell Extention == + === Hbase Shell Extension === - === Hbase Shell - RDF Shell === + {{{ Hbase > rdf; @@ -134, +117 @@ Hbase > }}} - === Hbase SPARQL === - * Support for the full SPARQL syntax - * Support for a syntax to load RDF data into an Hbase table - == Alternatives == * A triples table stores RDF triples in a single table with three attributes, subject, property, and object. @@ -146, +125 @@ * A dicomposed storage model (DSM), one table for each property, sorted by the subject. Used in C-Store. * ''Actually, the discomposed storage model is almost the same as the storage model in Hbase.'' - == Hbase Storage for RDF == - - ~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- udanax''-~ - - [http://www.hadoop.co.kr/wiki/moin.cgi/Hbase/RDF?action=AttachFile&do=get&target=bt_lay.jpg] - - == Hbase RDF Storage Subsystems Architecture == - * [:Hbase/RDF/Architecture] Hbase RDF Storage Subsystems Architecture. - * [:Hbase/HbaseShell/HRQL] Hbase Shell RDF Query Language. - - ---- - = Papers = + == Papers == * ~-OSDI 2004, MapReduce: Simplified Data Processing on Large Clusters" - proposes a very simple, but powerfull, and highly parallelized data processing technique.-~ * ~-CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf Column-Stores For Wide and Sparse Data]'' - discusses the benefits of using C-Store to store RDF and XML data.-~