gump-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: RDF 102 s.v.p...
Date Wed, 01 Sep 2004 20:24:46 GMT
Adam R. B. Jack wrote:

>>>1.1) Ought we define the URI for a project (or other entity) to point to
>>>the standalone RDF for that entity? I'm sure there is no need to, but it
>>>might allow tools to discover upon demand.
>>
>>This would be a URL and my suggestion would be something like
>>
>>http://gump.apache.org/data/path/project/20040827
> 
> 
> Hmm. I wonder if we ought we have something like a 'timeless' URI of:
> 
>     http://apache.org/project/${project}
> 
> ... relying upon the organization to manage it's project names, and them
> (most likely) not being re-used over time.

yes, we could do that, but it's not that you gain much. Those dates need 
not to be precise, just the year of the project cration would suffice.

keep in mind that that is a URI not a URL referring to a model. This is 
the identifier of the project, it could well be "urn:apache.org:23" for 
what we know and it does not contain anything by design.

Several people in the semweb community (Dirk included), in fact, 
promotes the use of URNs instead of http-URIs because they allow more 
transparent persistence.... but it's long debate and it's not that 
useful here.

> and then:
> 
>   http://gump.apache.org/data/path/project/${project}/20040827
> 
> to refer to the 'make-up' of that project on that day? We'd have a triple to
> assert that this URI related to the top (fixed) one, and carries information
> for it.

Well, that's how i would have done it anyway: gump information is 
transitory and should not be in the same model of the project 
information which is much less so.

I see three layers:

  1) project own metadata (changes very slowly)
  2) project dependencies data (changes now and then)
  3) project gump-originated metadata (changes potentially at every gump 
run)

the three things should be grouped in 3 different models, then 
aggregated when needed. All of them, IMO should have URIs that are 
either numeric of date-based.

> I don't think there can be a magic bullet for solving changes over time, but
> this seems like one approach that might (at least) hint at time sensetivity.
> 
> I could really like seeing version information introduced (what version of
> the project is it [i.e. what is HEAD to become when released], and perhpas
> what version of metadata is there). Change detection is something I think is
> of interest here (i.e when was dependency X added) so somehow I'd like to be
> able to determine that from this information. Hmm, I wonder if changes are
> really part of the information we wish to be publishing, e.g versionX
> addedDependency Y.
> 
> BTW: what is the purpose/value of data/path in the URI above?

path was supposed to be the TLP in case you have subprojects (like in 
jakarta stuff), even if it's very unlikely that the ASF will allow 
projects to have the same name and being hosted in different TLP, so we 
could get rid of that.

data was supposed to make it easier to use mod_rewrite for that URL 
subspace, could well be "ns" but this is not really a namespace.

>>>If Cocoon
>>>dependsOn Avalon today, but not tomorrow, what happens to the Cocoon
>>>dependsOn Avalon triple? Is it wrong? Expired?
>>
>>This is where it starts to get very tricky.
> 
> Yup, I hear that. I want something stable and simple, some way for a store
> to extract Gump produced project information (once a day, whenever) and make
> some good current and historical determinations from it. I don't think we
> can expect masses of data to be stored semi-indefinately, so perhaps triples
> about delta is a way to compress the redundancy.

Don't! Premature optimization. Just publish all the data you have in a 
way that is consistent and persistent over time, the users making use of 
that data will do something else (we can even host a "RDQL" web service 
on top of that data in the future).

>>One way of doing it is by encoding "provenance". One way of doing it is
>>to add further statements about the statements using "reification".
>>Reification is the act of using a statement as a subject of another
>>statement. Basically, when you have a statement like
>>
>>  "Cocoon dependsOn Avalon"
>>
>>you can also say
>>
>>  ["Cocoon dependsOn Avalon"] wasAsserted 20040827
>>  ["Cocoon dependsOn Avalon"] wasAssertedBy <uri>
> 
> Does this assert two things at once, or can one reference an assertion by an
> ID or something?
 >
> I just don't feel comfortable with this approach, although maybe it is nice
> and simple. It just seems so incredibly verbose.

yep, that's why everybody thinks it's really elegant but nobody uses it ;-)

>>Dirk's group uses another method, basically encoding provenance directly
>>inside the statement (things calls 'quads' instead of 'triples'), this
>>is a non-recommended method and it's not as flexible as reification but
>>it's a *lot* more efficient. Their quad-based RDFStore is open source
>>(and very fast, I hear) but there are no bindings in python (as for now).
> 
> Interesting. I do suspect some form of versioning/timestamping of facts to
> be in order. That said, maybe also 'who told me this' (so you can judge how
> well you trust it). Hmm, I wonder if triples just need attributes...

eheh, the "provenance" thing will be huge when the W3C attempts to tacke 
the 'trust' issue, which they don't want to just yet, so I suggest we 
don't even go there for Gump ;-)

>>How to solve this?
>>
>>Well, I would just create a new model everytime, just loading the last
>>statements. For example, you can have a URL such as:
>>
>>http://gump.apache.org/data/path/project/20040827
>>
>>that gives you the /path/project of today or
>>
>>http://gump.apache.org/data/path/project
>>
>>that gives you the "latest" one.
> 
> 
> So similar to what I suggested where the non-dated URI was the project
> entity, and the dated was a view of it. Is 'latest' -- a moving concept -- a
> risky proposition? Yesterday's latest is today history, so a triple might
> fail to be true as time passes.

I really don't know what to say here. If the web architectural group 
can't agree on what a URI means, it's going to be hard for us to do it.

Also, the RDF data access WG is working on web services that allow you 
to access the RDF data that you want (rather then just harvest 
everything and do it yourself) [take a look at "joseki" 
http://www.joseki.org/ for an example of what I mean]

So, "latest" might well be just 'you know what day it is, so just ask 
for that one"

>>>2.2) I think we wish to map the Gump Ontology to DOAP and others (even
>>>parts of FOAF). How would we do that
>>
>>with some OWL ontologies.
>>
> 
> I want to try to play nice with DOAP. I want us to be flexible (a
> prototypical approach so we can flesh out time issues, etc.) so I don't want
> to be bound to DOAP, but I'd like to benefit from their endeavours. Can
> anybody help with such a mapping?

Just don't worry about it, focus on your stuff first, the mappings will 
come later.

>>>3) Usages:
>>>
>>>3.1) I was hoping to work on PSP to do queries into the RDBMS. This is
>>>primarily for historical information, but I was thinking about using it
>>>for dependency information also.  The more I think abotu the RDF
>>>information, and triple queries, it seems an RDF store might be a better
>>>place to hold/maintain and query. This information seems RDF-ish, not
>>>RDBMS-ish.
>>
>>Agreed. I would use a triple store with an RDQL query engine (Redland
>>has such a thing and has Python hooks)
> 
> 
> I might try the Jena (Java) version that Sam referenced. I think it is good
> to use Python inside Gump, but allow RDF (serialized to XML) to freely
> separate monitoring/using tools.

Our group uses Jena and it's very well written.

> Would we want to host a triple store on brutus and allow applications to
> access it? Or, would we want to publish RDF in XML and allow remote clients
> to download?

We could do both: first we publish, then we can aggregate the thing 
ourselves and serve a RDQL web service for people to ask for queries.... 
but again, this is a subsequent step so don't worry about it for now.

>>>3.2) What other 'users' of this descriptor information seem viable?
>>>Ought tools (e.g. Depot) be wishing to figure things out from it?
> 
> Others?
> 
>>Once the RDF infrastructure is in place, one of my goals is to add
>>"legal" metadata to the project and create an inferencing layer that
>>indicates whether or not a project is *legal* depending on the
>>combination of the licenses.
> 
> 
> Awesome, I love that idea. Ought we add the type attribute <license
> type="ASF2.0" (or whatever) to Gump XML-based metadata?

yep, that's the plan, but it should have a URI identifying the license, 
like the RDF version of creative commons.

> Me, I'm primarily interested in version compatibility (what lead me to Depot
> [http://incubator.apache.org/depot/version/] in the first place). I'd like
> us to be able to query this knowledge base to determine what products can
> co-exists, at what levels, and so forth.
> That, and recursive downloads from a repository.
> 
> Other thoughts?

oh, ok. that's an interesting requirements.

my suggestion is that we try to make gump work and publish that data 
first, then we find out what to do with it.

-- 
Stefano.


Mime
View raw message