lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gregor Heinrich" <Gregor.Heinr...@igd.fhg.de>
Subject RE: Lucene as a high-performance RDF database.
Date Mon, 11 Aug 2003 09:27:11 GMT
Hi Kevin,

your idea could work for higher mega-byte ranges, I guess, don't know how
about several TBytes.

We have been considering a concept to use Lucene as an RDF backend for a
semantic search engine, because of its reported excellent scalability, on
the order of tens of Megs. The idea was similar to yours and but we thought
of using some index extension to introduce the class / properties hierarchy
(i.e., RDF Schema) and make them searchable via cascaded index lookups.

Didn't have the time, though, to test it, but would be grateful if you could
comment.

Here are the fields, in a draft with three index parts it's something like:

node (unique)
clss (class in schema)
prop (position-ordered)
prwt (a scalar value, weighting the relation or 1, position-ordered)
rsrc (resource, position-ordered)

and for the ontology itself:

clss
spcl (superclass, multi-inheritance)

and

prop (property)
sprp (super-property, multi-inheritance)
domn (domain)
rnge (range)

Best regards,

gregor



-----Original Message-----
From: Kevin A. Burton [mailto:burton@newsmonster.org]
Sent: Monday, August 11, 2003 12:33 AM
To: lucene-user@jakarta.apache.org
Subject: Lucene as a high-performance RDF database.


I have been giving some thought to using Lucene as an RDF database.
I'm specifically thinking about the RDF model and not the RDF syntax.

Essentially this would just comprise triples encoded in a document as
fields.

So for example we would have subject predicate and object relationships
as document fields.  Subject and predicates would be Tokens and then the
object field would be indexed.

For example a triple (document) would be:

    http://jakarta.apache.org -> title -> "A great Java developer's website"

This would be just one document in the index.

This would have a lot of advantages most importantly speed and the
reliability of Lucene and the ability to run a full text query on objects.

For example we could query on "Java" and get back
"http://jakarta.apache.org"

The major downside I could see is that this would mean that we would be
indexing a LOT of small documents with a LOT of index updates.

Can anyone see any problems here?  This database will eventually grow to
around 2TB in the next month so performance issues are non-trivial.

Most people have deployed Lucene with large document sizes and the fact
that most people are citing document COUNT makes me nervous.

Kevin



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message