hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/RDF" by udanax
Date Fri, 17 Aug 2007 01:43:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by udanax:

New page:
= Hbase RDF Storage Subsystems =

We Start to think about storing and querying RDF and RDF Schema in Hbase. 
[[BR]]but we'll do it at the last after prudence investigation and Hbase shell's HQL, Altools
POC review.

We propose a Hbase subsystem for RDF called HbaseRDF, which uses Hbase + MapReduce to store
RDF data and execute queries (e.g., SPARQL) on them.
We can store very sparse RDF data in a single table in Hbase, with as many columns as 
they need. For example, we might make a row for each RDF subject in a table and store all
the properties and their values as columns in the table. 
This reduces costly self-joins, which results in efficient processing of queries, although
we still need self-joins for RDF path queries.

We can further accelerate query performance by using MapReduce for 
parallel, distributed query processing. 

== Initial Contributors ==

 * [:udanax:Edward Yoon] [[MailTo(udanax AT SPAMFREE nhncorp DOT com)]] (Research and Development
center, NHN corp.)
 * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab Division
of Computer Science, KAIST) 

== Background ==

~-''Do explain : Why do we think about storing and retrieval RDF in Hbase? -- udanax''-~


== Rationale ==

 * What is RDF
 * Previous methods for storing RDF data and processing queries
  * Their weak points
 * The method in Hbase
  * Strong points

== Considerations ==
The Sawzall paper says that the record-at-a-time model is not good for table joins. I think
this problem occurs in typical join operations. Think about what kinds of join operations
Sawzall or MapReduce can perform efficienly in parallel, or possibly process them at all at
the first time. 

When we perform a nested loop join, for each tuple in the outer table, table R, we have to
go through the inner table, table S, to find 
all the joining tuples. To perform nested loop joins with MapReduce,
we divide table R into M partitions, and each 
map worker joins one partition of the table at a time with S. 
Each map worker produces the amount of join results corresponding 
to the partition it is in charge of.

However, in merge joins, since table R is already sorted by 
subject, each map worker only needs to read logN tuples in table S, where
N is the number of tuples in S, which results in fast join performance.

C-Store also says that join operations can be performed efficiently in DSM because
tables in DSM are sorted by subject, thus we can use merge joins on a sorted attribute. 

The key parts are how fast Sawzall or MapReduce performs merge joins and 
how cleverly we materialize join results for efficient query processing of RDF path queries.
Especially, the formal part is the key to beat 
C-Store query performance: defeat C-Store with massively parallelized query processing.

But the problem is that there is an initial delay in executing MapReduce jobs due to 
the time spent in assigning the computations to multiple machines. This 
might take far more time than necessary, thus hurt query response time. So, parallelism obtained
by using MapReduce is best enjoyable for queries over huge amount of RDF data, where
it takes much time to process them. 
We might consider a selective parallelism where 
people can decide whether to use MapReduce or not to process their queries, as in 
"select ... '''in parallel'''".

== HbaseRDF Data Loader ==
HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
into a Hbase table in such a way that efficient query processing is possible.
It reads a triple at a time and inserts the triple into a Hbase table as follows:

{{{#!python numbering=off
value_count = 0
for s, p, o in triples:
  insert into rdf_table ('p:value_count') values ('o')
    where row='s'
  value_count = value_count + 1

Examples with the data from C-Store.

{{{#!CSV ;  
Subj.; Prop.; Obj.
ID1; type; BookType
ID1; title; “XYZ”
ID1; author; “Fox, Joe”
ID1; copyright; “2001”
ID2; type; CDType
ID2; title; “ABC”
ID2; artist; “Orr, Tim”
ID2; copyright; “1985”
ID2; language; “French”
ID3; type; BookType
ID3; title; “MNO”
ID3; language; “English”
ID4; type; DVDType
ID4; title; “DEF”
ID5; type; CDType
ID5; title; “GHI”
ID5; copyright; “1995”
ID6; type; BookType
ID6; copyright; “2004”

== HbaseRDF Query Processor ==
HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase table. 
It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers and returns
the results
to the user. 

Query processing steps are as follows:

 * Parsing, in which a parse tree, representing the SPARQL query is constructed.
 * Query rewrite, in which the parse tree is converted to an initial query plan, which is,
in turn, transformed into an equivalent plan that is expected to require less time to execute.
We have to choose which algorithm to use for each operation in the selected plan. Among them
are MapReduce jobs for parallel algorithms.
 * Execute the plan
== HbaseRDF Data Materializer ==
HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the results
into a Hbase table. Later, HQP uses those materialized data for efficient processing of 
RDF path queries. 

== Hbase Shell Extention ==
=== Hbase Shell - RDF Shell ===
Hbase > rdf;

Hbase RDF version 0.1
Type 'help;' for help.

Hbase.RDF > SELECT ?title
          > FROM rdf_table
          > WHERE { ?book author ‘‘Fox, Joe’’
          >         ?book copyright ‘‘2001’’
          >         ?book title ?title }

results here.

Hbase.RDF > exit;

Hbase > 

=== Hbase SPARQL ===
 * Support for the full SPARQL syntax
 * Support for a syntax to load RDF data into an Hbase table

== Alternatives ==
 * A triples table stores RDF triples in a single table with three attributes, subject, property,
and object.

 * A property table. Put properties frequently queried togather into a single table to reduce
costly self-joins. Used in Jena and Oracle. 

 * A dicomposed storage model (DSM), one table for each property, sorted by the subject. Used
in C-Store.
  * ''Actually, the discomposed storage model is almost the same as the storage model in Hbase.''

== Food for thought ==
 * What are the differences between Hbase and C-Store.
 * Is DSM suitable for Hbase?

 * How to translate SPARQL queries into MapReduce functions, or Hbase APIs. 

== Hbase RDF Storage Subsystems Architecture ==

 * [:Hbase/RDF/Architecture] Hbase RDF Storage Subsystems Architecture.
 * [:Hbase/HbaseShell/HRQL] Hbase Shell RDF Query Language.

= Papers =

 * ~-OSDI 2004, MapReduce: Simplified Data Processing on Large Clusters" - proposes a very
simple, but powerfull, and highly parallelized data processing technique.-~
 * ~-CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf Column-Stores For
Wide and Sparse Data]'' - discusses the benefits of using C-Store to store RDF and XML data.-~
 * ~-VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable Semantic Web
Data Management Using Vertical Partitoning]'' - proposes an efficient method to store RDF
data in table projections (i.e., columns) and executes queries on them.-~

View raw message