incubator-clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reto Bachmann-Gmuer <reto.bachm...@trialox.org>
Subject Re: [VOTE] Accept the proposed patch of CLEREZZA-540
Date Tue, 31 May 2011 23:52:46 GMT
On Sun, May 29, 2011 at 7:06 PM, Henry Story <henry.story@bblfish.net>wrote:

>
> On 26 May 2011, at 20:31, Reto Bachmann-Gmuer wrote:
>
> > With CLEREZZA-540 I suggest a GraphNodeProvider-Service that returns a
> > GraphNode given a named resource. Mainly this code that used to be in
> > DiscoBitsTypeHandler which has been generalized.
> >
> > The issue is described as:
> > "Implement a platform service that returns GraphNodes for URIs. The
> > GraphNode is the resource identified by that uri with as BaseGraph
> sources
> > considered authoritative for that resource. "
> >
> > Of course "considered authoritative" it not a very sharp description. The
> > issue is labeled with "platform" which implies it is not a generic
> utility
> > of clerezza.rdf but that it relies on platform default graphs.
> >
> > The solution proposed in commit
> > #1125477<http://svn.apache.org/viewvc?view=rev&rev=1125477>and
> > #1125652 <http://svn.apache.org/viewvc?view=rev&rev=1125652> sets the
> > basegarph as follows:
> > - always trust the content graph
> > - for remote resource trust the graph you get by dereferencing the uri
> > - for resources in the user-uri space trust that user
>
> Of course what one thinks of this patch depends completely on how this
> provider
> gets used. It has quite a lot of limitations it seems to me as implemented
> currently - which is of course not a final failing in a developing project,
> but it does seem like a good discussion would help us narrow perhaps on
> better
> solutions or on fixes, which is part of the reason I thought the ISSUE
> should not
> yet be closed.
>
Keeping the issue open blocks development, an issue is associated to a patch
as long as an issue is open other code should not depend on code committed
under this issue. This is way an issue should be closed shortly after the
patch has been committed, and conversely why code shouldn't be kept in trunk
it the issue cannot be closed. An open issue is more than a discussion
reminder.



>
> The failings I mentioned in the ISSUE-540, and develop below are:
>
> 1. it to relies on more and more Services that each require TcProviders.
> This
>  feels very ad-hoc. The ones I mentioned are
>
>    - UserManager: requires a TcManager
>    - WebIdGraphsService: also requires a TcManager
>    - PlatformConfig: requires TcManager
>    - ContentGraphProvider: requires TcManager
>    - TcManager
>
>   Each of these is used, and makes a call to the database. And the
> TcManager itself each time
> iterates through a number of TcProviders.
>
Yes, TcManager is the main entry point to the rdf data. I don't see any code
smell here. I a classical RDBMS java application a component will typically
use jdbc and rely on other components that use jdbc, here it's TcManager
instead of jdbc,



>
> 2. when asking for an external URI, you get the whole content graph too
>
You get a GraphNode


>   On my fresh install of ZZ that is 20 times more information than the
> initial graph.
>
- What is 20 times more?
- What do you mean by "get"?

A graphnode point to a resource and is designed for browsing from resource
to resource. It is not a graph but a node in a graph. The object is
associated to a base-graph which used to identify the propertied and
instanctiate another graphnode when hoping to a property value. The
underlying graph could for instance be Timbl's GGG (giant gloabl graph aka
the web).


> How big is that going to become as one's content graph grows over time?

The underlying graph would be the size of the virtual content graph + the
size of the remote graph - duplicate triples


> Is this not going to create a huge bottleneck very quickly?

No, the time complexity for accessing a property of graphnode grows linearly
to the number of graphs in the graph-union (which in this case is 2) but is
only O(log n) for the number of triples  in the indexed triple store.

I thought I had heard people mention issues with speed on this list.
>
You may check archives of this list at:
http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/

I mentioned performance concern directly to you when I saw that your
EasyGraph class copies to memory all triples of a TripleCollection rather
than accessing them in the original graph. So yes, the performance
implications of our design choices should always be in our minds.


>
>  So to verify this do the following:
>
> zz> import org.apache.clerezza.platform.graphnodeprovider._
> zz> val gnp = $[GraphNodeProvider]
> zz> val tbl = gnp.get(new UriRef("
> http://www.w3.org/People/Berners-Lee/card#i"))
> zz> tbl.getGraph.size
> res0: Int = 1878
>
> getGraph returns "the graph the node represented by this instance is in",
as I mentioned before this could be GGG. On a GraphNode you should usually
not invoke getGraph, the object exists to hop from resource to resource.


> If I get Tim Berners Lee's Graph on the command line I find
>
> $ rapper http://www.w3.org/People/Berners-Lee/card | wc
> rapper: Parsing URI http://www.w3.org/People/Berners-Lee/card with parser
> rdfxml
> rapper: Serializing with serializer ntriples
> rapper: Parsing returned 78 triples
>      78     380    9978
>
> So here we have 78 triples, but the resulting answer schlepped around is 20
> times bigger - on a new installation!
>
If by schlepped around you mean that they are copied in memory or something
like that this is not the case. It's a reference, that's all.



>
> I wondered what was going on, because to my surprise on a new installation
> the Content Graph contains only 1 triple.
> So I looked into a running instance of ContentGraphProvider and found that
> the additions array contained the following graphs in addition to the
> content graph:
>
>  - <urn:x-localinstance:/documentation.graph>   1002 triples
>  - <urn:x-localinstance:/config.graph>           176 triples
>  - <urn:x-localinstance:/web-resources.graph>    621 triples
>  - <urn:x-localinstance:/enrichment.graph>         0 triples
>
> So that does then indeed add up to the number.
>
Great, you found the answer :)


>
> What I am wondering is in what cases is this needed? It seems like this may
> indeed what a particular application may require, but does it have to be
> a general service? The name certainly suggests a very general service, not
> one required for a particular application.
>
This is about ContentGraphProvider then, not about issue 540. It's the
ContentGraphProvider which provides the graph of instance-wide and public
information for the platform



>
> Perhaps changing the name from GraphNodeProvider to
> ContentGraphPlusOtherProvider
> would make more sense.
>
It's a platform service that provides GraphNodes. Being a platform service
implies it usesthe platform means of getting trusted content. If it would
just dereference URIs the it would probably be placed in a subpackage of
clerezza.rdf.



>
> > This might not match an intuitive understanding of "authoritative" and
> I'm
> > happy to redefine the issue so that no confusion arises.
>
> One thing I am not quite clear about yet, is who writes to the content
> graph?
> I see a lot of modules use it.
>
Modules can write to the content graph or add temporary additions to it.
Actually writing to the content graph should happen when public and trusted
information is added. An information is considered trusted when added by a
user with respective permission or verified by privileged code (e.g. that
allows the public to add see-also references).


>
> >
> > What I do strongly believe is that the proposed patch offers a major and
> > very useful new functionality. Especially as it allows the following
> > features to be implemented:
> > - Thanks to CLEREZZA-544 one can call the render-method to delegate the
> > rendering of resources with a UriRef instead of a resource,
>
> I think you mean a "UriRef instead of a Graph".
>
"UriRef instead of a GraphNode" (GraphNode is rougly what is called
"Resource" in the jena api)

>
> Yes, that makes sense. But why does the GraphNodeProvider have to cast
> such a wide net to catch so many triples? It seems to me that if one
> is to use a URI then it would be better that the URI refer precisely to
> that named graph (or to a node it it). One could use other tools to create
> virtual graphs, like Simon Schenk's Networked Graphs I mentioned
>
> http://blogs.oracle.com/bblfish/entry/opening_sesame_with_networked_graphs
>
> These allow one to have virtual graphs depending on a SPARQL query pattern.
> There it would be easy for different services to specify different ones.
> And I think something like that would be really good to have.
>

The possible GGG base graph could be implemented exactly as you describe,
but using sparql would probably be quite an inefficient approach.



>
> > in this case the
> > resource is rendered using its own baseGraph rather than the one of the
> > calling template. An example usecase for this is rendering the author of
> a
> > comment, the whole profile of the (possibly remote) commenter isn't and
> > shall not be part of the baseGraph of the GraphNode returned by the
> jax-rs
> > resource method, yet for rendering the comment-author infobox it might be
> > beneficial to render a GarphNode with a baseGraph containing all of the
> > information in the users profile-document
>
> But why also all the information from the documentation and the config
> graphs?
> It may be useful in some very limited cases, but it may mostly not be. It
> seems that
> some use cases would be useful to help describe this in more detail.
>

The config graph will seldomly be in the reachability graph of a resource
but if it probably make sense of having it there.



>
> > - With CLEREZZA-541 the GraphNodeService is accessed from TypeHandler, I
> > posted a resolution to this issue because it was already quite there on
> my
> > local machine when Herny reopened CLEREZZA-540, to respect the reopening
> of
> > the issue I didn't mark the dependent issue as resolved. I will of course
> > revert the changes if requested to do so by a qualifying -1.
> >
> > I'm not arguing that my patches solve all issues one might have around
> > getting resource descriptions but I do think it is very valuable and to
> > allow to base other stuff on this service I would like the issue to be
> > closed. As Henry reopened the issue twice and I don't want to close the
> > issue again without a broader discussion. Yet as many thing depend on the
> > issue leaving it open doesn't seem an option to me.
>
> What depends on it is something you are wanting to do in your projects it
> seems
> to me, and that is not that clearly laid out.

Also because of the negative expirience with the monster issues recently I'm
trying to be very explicit on dependent issues. A motivator for this issues
was also your code around browsing remote foaf-profiles.


> Because it does not seem obvious to
> me why a service should make the decisions this one does about what is
> authoritative.
>
I mentioned earlier that other trust settings might be added with other
issues. You seem to agree that it is useful to assume the graph resulting
from dereferencing the resource as authoritative. The content graph is
authoritative by platform conventions. There is no specific decision of the
service here



> >
> > Future enhancement might include:
>
> > - manually force refresh of caches for graphs related to a requested
> > resource
>
> Yes, indeed. But why here, when it is not in the WebProxy? You would think
> cache
> update functionality should go in the WebProxy right?
>
Because when you access resources you don't (usually) care about the
underlying graphs. You can already forces cache-refreshes when using the
WebProxy.



>
>
> > - force an alternative set of baseGraphs to be used (e.g. Only local or
> only
> > remote sources)
>
> What I am wondering is why all this is done like this? If I go over the
> changes of the
> past few weeks this is what I see:
>
> So if we go over the history of refactorings that led us here.
>
> 1. You did not like the initial WebProxy you I wrote by refactoring your
> WebIdGraphsService.
>   Neither did I in fact - but it did work  at least and added minimum
> change - being new to ZZ
>  I did not want to play around too much in the internals.
>
well your code massively expanded the rdf api. Compare this with the length
of discussion which is just about one method and one interface in the higher
level platform api.



> 2. You moved the old WebProxy to what seemed like a nicer interface: the
> TcProvider interface. And
>   indeed that does look a lot better. BUT but this interface is really
> meant for direct, no interpretation
>   access to the database and so lack
>   - key notions of caching (well I suppose they could make sense even for
> other sesame or jena graphs?)
>
No I don't think this would make sense for jena or sesame graphs. I think
that most clients don't need to force custom caching/update policies, but if
they do they can access the WebProxy services which offers methods not
available via TcManager


>   - does not provide a method for returning the final name of the graph
> (for redireted resources, or foaf:knows),
>     when the WebProxy gets called
>      (since this the TcProvider assumes you give it exactly the correct
> name of the graph)
>   => So really it is quite uncomfortable there somehow.
>
This is something you could open an issue about, something like "Graph
aliasing in TcManager/TcProvider"




> 3. This led you then to move to this GraphNodeProvider in order get a graph
> from a URI -



> which is very similar
>   to the TcProvider in many ways, right?

No, its something fundamentally different

It even uses the code of the original WebProxy to do a HEAD
>   on a remote resource to find the graph name  (and which one would assume
> would be part of the WebProxy
>   code since it  will be making the real HTTP Connection, and so can follow
> the changes of the graph names.)
>
So what? (do you want to hear that your name is in the class comment or that
this particular code comes from the WebId-service which you extracted when
introducing your WebProxy)



>   But because the TcManager interface is really a database layer interface,
> that cannot be placed there, and
>   so is now placed into something outside - this class you have now
> written.
>
I don't understand. I think the possibilities the new storage.web service
offers are quite cool. For example that you can do sparql-queries on remote
graphs using the clerezza endpoint (if you have appropriate privileges). You
can browse resource in remote graphs as described here:
http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/201105.mbox/%3CBANLkTinFnYRFNkje0HmdH--VsBidLgpr2g@mail.gmail.com%3E


> 4. But instead of just having a GraphNodeProvider that just returns the
> graph, you have added some twists to
>   it and return more than jut the named graph. There is nothing to say that
> a named graph cannot be the union
>   of many other graphs, but it seems really arbitrary for me to get the
> documentation of clerezza along with the
>   triples of Tim Berners Lee's graph.
>

>   Somehow things have gone a bit haywire at the end here.

If you call getGraph on a GraphNode you're leaving the scope of the
GraphNode. Probably all this discussion would not be necessary if had been
using getNodeContext instead of getGraph. The NodeContext is what related to
the node. Using getGraph is a bit like doing the following:

File file = SomeService.getFileDescribing("Tim Berners Lee")
file.getParent().getParent().getParent().listChildrenRecursively()

The listed files can contain thigs that are completely unrelated to Tim
Berners Lee



> And I think this is due to a bit of confusion of the needs
>   of your application with trying to keep the general architecture clean.
>
As I said, I did not made this particularly for an application, my wall
application is merely a demo. When we want to do something like a
foaf-browser we want to be able to display the resource in their context,
just a usecase.


>
>   Now on the whole I have learnt a lot about Clerezza by following this,
> but I just can't say that this looks like
>  a good long term solution.  We are constantly moving around and around
> something.
>
This is your impression. I hope my explanations to the concrete points you
mention could help changing this impression.



>
>   Would any of the following work?
>
>   - TcProvider extended to specify caching options?
>
Unrelated to the issue. But no, I think only storage.web needs those options


>   - Graph to be extended so that it can contain its name (so that one can
> ask for a resource in a TcProvider,
>    and find out what its name really was by inspecting the resulting graph)
>
Again, his is unrelated to the issue at hand.

Having triple collection-aliases with primary names accessible via TcManager
would be fine, but I would not agree at having them as part of the
TripleCollection


>    -> if not, should WebProxy really be a TcProvider?
>
Since there is no way of knowing ahead of time what the name
>       of a graph for a resource is, given that redirects can occur at any
> time.
>
>    The WebProxy as TcProvider mostly makes sense otherwise, so it does feel
> like the above two things would help.
>
> >
> > So I'm asking you to kindly review the proposed code and vote about
> closing
> > CLEREZZA-540
> >
> > [ ] +1, I agree with accepting the proposed code into trunk
> > [ ] 0, I don't care
> > [ ] -1, I don't want this code in trunk (must specify a technical
> > explanation, please also specify what would have to be changed for the
> patch
> > to be acceptable to you.
>
> -1 for the moment on closing the issue. (not on removing the code)
>   Please answer the above points carefully.
>
-1 are against code, keeping the code in trunk if you can't accept makes
little sense to me.


Okay, I see two reasons which could qualify as technical reasons:
- The service returns huge amount of triples: this is just wrong as it
returns graphnodes
- The class should be named ContentGraphPlusOtherProvider instead of
GraphNodeProvider: As It doesn't provide a Graph but a GraphNode your name
seems wrong rather than just imprecise. A precise name might be
GraphNodeBasedOnContentGraphPlusOtherProvider.

Would the rename be okay for you to accept the proposed path? (I really
would like to go back to productive work, so I rather have a horrible name
than seeing the project stalled by your veto).

Reto

PS: You seem to be extensively using you're right to veto while ignoring
other's veto on your code, looking at
https://issues.apache.org/jira/browse/CLEREZZA-515 I see that the commits
have not been reverted even more than one week after my veto and request to
revert.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message