incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <>
Subject Re: Lucene/Solr and Jena
Date Sun, 20 Feb 2011 21:20:29 GMT
I'd add to Paolo's remarks:

TDB is itself extensible - it wasn't the primary design criteria but it 
can be done.

For example:

is TDB running over Berkeley DB (Java or C) rather than TDB's own indexing.

The key design decisions you first face are:

1/ What technical use cases are you addressing?
    Scale or performance?  UI or batch query processing?

2/ What indexing and storage capabilities of Solr do you want to exploit?

then you can start to think about:

3/ How to store RDF terms (IRIs, bnodes, literal)

4/ What indexing structures do you plan to use?


On 18/02/11 15:45, Paolo Castagna wrote:
> Hi Frank,
> nice to meet you and I am glad you wrote your reply to the jena-dev
> mailing list.
> My comments are inline.
> Frank Tanz wrote:
>  > Hi Paolo,
>  >
>  > Our team has composed more details about the intentions surrounding
> this initiative. As requested, I am submitting this through the Jena-dev
> List.
>  >
>  > Regards,
>  > Frank
>  >
>  >
>  > 2/17/2011
>  >
>  >
>  >
>  > Dear Paolo:
>  >
>  >
>  > Thank you for your thoughtful response to our proposed project. First
> and foremost we want to acknowledge the complexity of this endeavor. As
> graduate students, we are very enthusiastic about the opportunity to
> participate in an actual Open Source project.
> As I already written, you should really read:
> I'll continue to point you at it, every time I have the suspect there
> is a misunderstanding on that. :-) If my suspect is wrong, better.
> I apologize.
>  > Keeping with the Open Source spirit and philosophy, we really
> appreciate your guidance pointing us to SIREN, LARQ, SARQ, and EARQ.
>  > We have done some high level research into these four projects and
> have determined that while they present some interesting similarities
> into some of the underlying design aspects for parts of what our
> interface must do, none of these endeavors are striving to meet our
> vision. We thought SIREN was very interesting as it provides plugins to
> Lucene which provide the capability for Lucene to index and query
> RDF/XML graphs natively. However, this solution requires additional
> software parts in addition to Lucene and currently works outside of Jena
> with an existing persisted RDF/XML graph. LARQ and SARQ were also very
> attention-grabbing to research as the ability exists to build Lucene
> indexes within Jena as well as the ability to query the Lucene index
> from ARQ. Unfortunately the RDF/XML model must exist separate from the
> Lucene index. Lastly, we reviewed EARQ. This project seems to provide a
> layer of abstraction that would allow the developer to plug and play an
> available index mechanism such as Lucene or Solr. Unfortunately, this
> project would be feasible to build upon only if and when our interface
> could be achieved. While these four Open Source projects do not provide
> “plug and play” capabilities for our immediate purpose, they do provide
> some really good technical guidance for design discussions and decisions
> that we must make in the weeks to come.
> There isn't a fundamental difference between LARQ, SARQ or EARQ. They
> all provide similar funcitonalities, they are just a proof of concept
> of a possible evolutionary path for LARQ within the Jena project.
>  From your comment, you are confirming my impression that you want to
> actually store RDF data into Lucene and implement a Jena graph over
> it. This is not what LARQ (or SARQ or EARQ) do. They assume you use
> Lucene only for free text searches, therefore they index only literals.
> SIREn is probably closer to what you want to do. What are the additional
> software parts you refer to?
> What makes you think that storing RDF data into Lucene will give you
> better performances than a native RDF store such as, say, TDB?
> How are you planning to evaluate performances of your solution?
> May I suggest BSBM and TDB as your baseline?
>  > We have included our Business Case for moving forward with this
> project. In addition to the Business Case, please understand that this
> project is more than just an academic exercise. Our academic advisor,
> Scott Streit serves as a CTO for a commercial corporation and has many
> clients utilizing Semantic Web applications. These clients include NITRD
> (The Networking and Information Technology Research and Development) and
> the U.S. Military. Upon successful completion of this interface, next
> steps would include transitioning the storage mechanisms that these
> clients currently use for RDF/XML graphs to Lucene/Solr structures.
>  > Please review and approve this initiative so we can begin our design
> activities.
> I can read your message and share my opinions or give technical suggestions
> on projects I have used, but it's not my role reviewing business cases or
> approving initiatives of people who want to do something interesting
> with Jena.
> Once again, I point you at:
>  >
>  > Sincerely,
>  >
>  > The SolrStore Project Team:
>  >
>  > Frank Tanz, Bharti Gupta, Bala Krishna Chitneni, Nimesh Shah
>  >
>  > SolrStore Business Case:
>  > It is our vision to add to Jena the capability to persist RDF/XML
> graphs by creating the data store directly within a Lucene inverted
> index structure. Simply said, our approach is to do this without the
> need for additional software parts and without using an additional RDBMS.
> TDB is a native RDF storage system for Jena and it does not use RDBMS.
> What's make you think a solution to store directly RDF data in Lucene
> will be faster/better?
> Don't get me wrong, I am not sure it will or it won't and I am myself
> curious about it. But, I have doubts. If I were you, I would try to
> quickly prove it's possible to achieve better performances with a small
> prototype.
>  > While Jena’s existing ability to persist RDF/XML graphs to an RDBMS
> is a convenient storage choice, we argue that an RDBMS is not really
> appropriate for the Semantic Web, as transaction processing and
> normalized schemas are not part of the dynamic nature of the Semantic
> Web domain. Building upon this argument, the dynamic nature of the
> Semantic Web is better suited to use versioning in lieu of heavy duty
> transaction processing. It is our intent to exploit and leverage this
> inherent capability within Lucene and ultimately present it to Jena
> developers in an abstract way within the Jena API. Additionally, we
> believe that the Lucene/Solr indexing engine is underutilized in that it
> serves primarily as an index with pointers back to the original data
> source. We intend to not only use Lucene/Solr as an indexing engine, but
> also as the repository for the data source.
>  > The prime directive for our project is to provide layers of
> abstraction between the Jena API and the Lucene/Solr API’s. This
> commitment is extremely important to us as the complexities of our
> interface should not over burden a Jena developer who might have limited
> experience with the components within Lucene and Solr. We acknowledge
> that the use of an RDMBS to persist RDF/XML graphs within the Jena API
> was an innovative design choice for the timeframe of its creation. Our
> team’s objective is to evolve that innovation by building upon it with
> new technologies that are now available and accessible.
> If I understand correctly your motivations/rational in wanting to try to
> store RDF in a Lucene index is because a sort of discontent with solutions
> which use RDBMS.
> However, you have not mentioned or looked at a native RDF storage system
> which comes with Jena.
> """
> There are two subsystems for persisting RDF and OWL data, SDB or TDB.
> These are separate downloads.
> TDB is a high-performance, native persistence engine using custom
> indexing and storage. SDB is a persistence layer that uses an SQL
> database and supports full ACID transactions.
> TDB is faster and simpler to setup.
> * TDB documentation:
> * SDB documentation:
> The original RDB system is still shipped with Jena for legacy
> applications. It is deprecated for new development.
> """
> --
> So, please, have a look at the TDB documentation, try to install and use
> it:
> -
>  > While this initiative is not a trivial task, we believe that the
> objective is important and if successful can benefit the Jena community.
> It's true that what you want to do is not trivial.
> However, IMHO, you should be able to proof with a quick prototype of a
> Jena Graph SPI that an RDF storage solution over Lucene indexes is faster
> than what's already there (in particular TDB). Then you probably get some
> more attention.
> I don't think we have specific documentation to guide you on how to
> put Jena over a different store/indexing system (in this case Lucene)
> implementing the Graph SPI. Have we?
> I can point you at these, though:
> -
> ... see GraphTDB and GraphTDBBase
> -
> ... this copied the TDB approach, but I'd like to see how things could
> work over HBase (it's not finished/working yet).
> -
> HTH,
> Paolo
>  >
>  > ---------- Forwarded message ----------
>  > From: Paolo Castagna
> <<>>
>  > Date: Wed, Feb 9, 2011 at 10:07 AM
>  > Subject: Re: Fwd: Lucene/Solr and Jena
>  > To:<>
>  > Cc: Scott Streit <<>>
>  >
>  >
>  > Hi Scott (hi all),
>  > first of all, thank you for your email and nice to "meet" you. Even if
>  > only via email, and even if we have never had the chance to interact
>  > before. (We clearly have common contacts though!).
>  >
>  > We (@Talis) use Lucene as well as Solr (as well as something else in the
>  > future) to provide our free text search capabilities. However we do not
>  > actually store RDF into Lucene indexes. For that, we use a "proper" RDF
>  > store with SPARQL support which otherwise you will need to implement on
>  > top of Lucene (and it's not a trivial task).
>  >
>  > I am very interested in the topic of free text search in the context
>  > of RDF and how free text searches can be 'integrated' with SPARQL.
>  >
>  > I'd like to know more about your project plans and, indeed, your
> motivations.
>  >
>  > I am not completely sure if your attachment made it to the jena-dev
> mailing
>  > list. I have received the attachment anyway, since you added my work
> related
>  > email (which I tend to try to protect from evil spammers) to the To:
> field.
>  > I am subscribed to the
> mailing list, so we
>  > can discuss here.
>  >
>  > Coming back to the idea of "placing Lucene and Solr into Jena as
> persistent
>  > store", can I suggest you take a look at SIREn [1]? There is a good
> chapter
>  > (a case study) on the "Lucene in Action, Second Edition" book [2]. I
> really
>  > recommend the book, it's a good one.
>  > SIREn's aim is to use Lucene indexes to provide a complete storage
> system
>  > for RDF, however I cannot possibly comment on the support for RDF store
>  > APIs or their level of compliance in relation to SPARQL queries, for
> example.
>  >
>  > A different approach it the one taken by LARQ [3] (and/or similar):
>  >
>  > "LARQ is a combination of ARQ and Lucene. It gives ARQ the ability to
>  > perform free text searches. Lucene indexes are additional information
>  > for accessing the RDF graph, not storage for the graph itself."
>  > --
>  >
>  > LARQ is, at the moment, included in ARQ, but we have an open JIRA issue
>  > (i.e. JENA-9 [4]) to separate it out as a separate module depending
> on ARQ.
>  > A development version or LARQ as separate module, ready to be tested,
>  > is available here:
>  > If you, or some of your students have time to try it, let me know if you
>  > have problems with it.
>  >
>  > As an experiment, I did a similar thing with Solr, it's called SARQ
>  > and it's available here:
>  > Labeled "experimental (and unsupported)" since I did it out-of-band as
>  > a proof of concept, but, because the design and functionalities are the
>  > same as LARQ, it should not require a lot of effort to make it ready for
>  > production. If others think this might be useful.
>  >
>  > While, I was writing SARQ, I though: "wouldn't be nice to make it
>  > extremely easy for developers to plug-in different indexing systems
>  > such as Lucene, Solr or Elastic Search?". So, I gave it a go at EARQ.
>  > It's available here:
>  > Again, it's labeled "experimental (and unsupported)", but if needed
>  > and people are interested in it, it might require only little
>  > improvements.
>  >
>  > One of the biggest problem I had in relation to LARQ, SARQ and EARQ is
>  > how to manage "deletes/removals". I've used a Jena Model as source for
>  > a poor man's reference counting to decide when to remove a document
>  > from the Lucene index. The source code should be clear on this.
>  >
>  > Last but not least, in relation to part of the content of your
> attachment,
>  > Jena is still in its incubating phase at Apache, but things work almost
>  > the same as for the Apache Software Foundation. Please, have a look at
>  > "How the ASF works" [5].
>  >
>  > Let's keep the discussion flowing and invite your students to interact
>  > with us on the jena-dev.
>  >
>  > Let me know your motivations for wanting to store RDF in a Lucene/Solr
>  > index.
>  >
>  > Regarding the "cloud" references in your project proposal, we should
>  > probably discuss it on a separate thread/message, always, on jena-dev.
>  >
>  > Paolo
>  >
>  >
>  > [1]
>  > [2]
>  > [3]
>  > [4]
>  > [5]
>  >
>  > Damian Steer wrote:
>  > (I didn't get a moderation message about this, but Paolo was Ccd and
> forwarded to me. Is moderation working for anyone?)
>  >
>  > Begin forwarded message:
>  >
>  > ---------- Forwarded message ----------
>  > From: Scott Streit <<>>
>  > Date: Wed, Feb 9, 2011 at 12:44 PM
>  > Subject: Lucene/Solr and Jena
>  > To:
>  >
>  >
>  > Jena-dev,
>  >
>  > A group of my students at Villanova would like their Master's Degree
>  > project to include placing lucene and solr into Jena as a persistent
>  > store. We are adding two more students.
>  >
>  > Attached is an overall project plan. Upon your approval, the next
>  > step is a design document.
>  >
>  > Scott Streit
>  >
>  >

View raw message