www-repository mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew C. Oliver" <acoli...@apache.org>
Subject Re: [proposal] repository URI format
Date Sat, 08 Mar 2003 15:47:30 GMT
I'd like to take this opportunity to contrast my approach.  "stick some 
XML descriptors on a webserver wherever you like and point at existing 
files without renaming/moving them wherever they might be"..  The 
"virtual" repository. 

Leo Simons wrote:

> Hi all,
>
> just read what's in the archive until now. I've summarized (well, not
> really summarized, more elongated) the discussions up till now, added
> my own thoughts, done some reasearch, and then I came to a conclusion. I
> suggest y'all rip this apart, put it together again, (it's in
> wiki-compatible format :D) and then someone tallies a vote on whatever
> list is appropriate.
>
> cheers,
>
> - Leo
>
>         = THE URI FORMAT FOR A SOFTWARE ARTIFACT REPOSITORY =
>
> = Conclusion =
>
> I'll provide my conclusion first, as this is rather a lot of text :D
>
> When the following is known, the URI for any software distribution
> architecture is uniquely specified:
>
> * <FQDN>         - fully qualified domain name of the repository as
>                    defined in the URL spec
> * <protocol>     - <scheme> as defined in the URI spec
> * <base>         - base directory on the machine identified by <FQDN>
>                    (probably relative to documentroot), preferably
>            consisting of lowercase letters, dashes and slashes
> * <organisation> - the inverse of the domain name of the organisation
>                    that produces the artifact
> * <project>      - the division/group within the organisation that
>                    produces the artifact, preferably consisting of
>            lowercase letters, dashes and slashes, with a website
>            at http://<project>.<organisation>/
> * <name>         - the name of the artifact (unique within the
>                    <project>), preferably consisting of lowercase
>            letters and dashes
> * <type>         - the filetype of the artifact, consisting of whatever
>                    part of the artifact filename normally identifies the
>            filetype
> * <version>      - the version of the artifact, consisting of any set of
>                    characters allowed in an URI, augmented with any
>            information about software or hardware platform
>            requirements if not normally part of the version,
>            preferably consiting of numbers, letters, dashes and
>            points
>
> and for various reasons detailed below I think the URI should be
> composed based on the above as follows:
>
> <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>
>
> = Goal of the Discussion =
>
> The goal of this part of the discussion is to define additional
> constraints/ guidelines above and beyond the URI specification to make
> it possible to uniquely and unambiguously define the location of a
> software distribution artifact, when the following are known:
>
> * the /FQDN/ of the repository (ie repo.apache.org)
> * the /protocol/ used to access the repository (ie http)
> * the /base/ directory of the repository (ie /dist/repository or /)
> * the /organisation/ that produces the artifact
> * the /project/ within the organisation that produces the artifact
> * the /name/ of the artifact
> * the distribution /type/ of the artifact
> * the /version/ of the artifact
>
> = Requirements =
>
> * The URI should be stable
> * The URI should be easy to generate by humans and machines when
>   the above listed items are known
> * The URI should be unique based on the uniqueness of the above
>   listed items, ie:
>
> if ! otherURI.FQDN == thisURI.FQDN
>    return false
> elseif ! otherURI.protocol == thisURI.protocol
>    return false
> elseif ! otherURI.basedir == thisURI.basedir
>    return false
> elseif ! otherURI.organisation == thisURI.organisation
>    return false
> elseif ! otherURI.project == thisURI.project
>    return false
> elseif ! otherURI.name == thisURI.name
>    return false
> elseif ! otherURI.version == thisURI.version
>    return false
> else
>    return true
>
> * the part of the URI not containing FQDN, protocol and basedir should
>   be common across repositories, ie it is desirable that an artifact
>   identified by
>
> ** the organisation that produces the artifact
> ** the project within the organisation that produces the artifact
> ** the name of the artifact
> ** the version of the artifact
>
>   can be found on any repository by substituting the repository FQDN,
>   protocol and basedir from the current URI
>
> = Proposals =
>
> == Base Identifaction conventions ==
>
> base will often be "", but in the case of mirrors mirroring many
> repositories (ie ibiblio), that might be impractical, in which case
> I suggest the base is whatever maps to a directory on the filesystem
> the repository is using (ie whatever ext2/3, fat32, whatever accepts
> as a directory identifier).
>
> == Organisation Identification conventions ==
>
> It has been suggested that the identification of the organisation is
> done by reverse domain names, ie "org.apache", "org.sun" and "com.ibm".
>
> It has also been suggested that the organisation is not identified
> seperately (ie as is current practice on http://www.ibiblio.org/maven/).
>
> == Project Identification conventions ==
>
> It has been suggested that the identification of a project is done by
> lowercase letters seperated by dashes, ie jakarta-commons.
>
> I have seen no suggestions as to how the apache project sturcture should
> map into the project names in the repository, IOW, is the project part
> of commons-logging.jar to be "jakarta", "jakarta-commons", or
> "jakarta-commons-logging"? My suggestion is that the project structure
> mapping is based on top-level-projects (ie *.apache.org), so the answer
> to that question is "jakarta".
>
> In the context of sourceforge, the project identifaction would map
> similarly, ie the convention of ${projectname}.${host}.org would lead to
> project names of "jboss", "jedit", etc. Hence this sounds like a smart
> mapping to me.
>
> == Artifact Naming conventions ==
>
> It has been suggested that the name of the artifact is to be determined
> by the project providing the artifact, so that the "jakarta" project
> determines what artifact name it will associate with the subsubproject
> http://jakarta.apache.org/commons/logging. Of course, a project could
> choose to delegate such a choice to a subproject or subsubproject; I
> suggest we do not try and define who makes the artifact name choice
> within a project :D
>
> It has been suggested that the name of the artifact is to be comprised
> of lowercase letters seperated by dashes, ie commons-logging.
>
> == Versioning conventions ==
>
> I have seen no suggestions with regard to versioning. I assume everyone
> agrees that the format of a version is determined by a project, though
> the recommended practice is that a version is comprised of numbers
> seperated by dashes and dots, and optionally containing lowercase
> letters identifying part of the development cycle, ie
>
> * 1.0
> * 1.0a
> * 1.0-alpha
> * 1.0-alpha-1
> * 08032003
> * 03082003
> * 2003-03-08
> * SNAPSHOT-03.08.2003
>
> are all acceptable, and the choice is made to conform to the versioning
> number used by whomever supplies the artifact.
>
> == Distribution type conventions ==
>
> It has been suggested that a distribution type is defined by its
> three-letter acronym, in lowercase, ie:
>
> jar
> war
> ear
> rpm
> tgz
> zip
>
> I have not seen other suggestions. I myself suggest a distribution type
> is identified by whatever filename component normally represents the
> distribution type for a given artifact distribution, ie common types
> would be:
>
> jar
> war
> rpm
> tar.gz
> tgz
> zip
>
> where the use of tar.gz versus the use of tgz depends on the convention
> used by the authoritative distributor of the artifact (ie for apache
> httpd, the files are provied as .tar.gz, so the distribution type is
> tar.gz and not tgz).
>
> == The URI format ==
>
> (refer to http://www.ietf.org/rfc/rfc2396.txt; note we can make the
> assumption <protocol> == <scheme>)
>
> Adopting the convention <thing> to identify the parts of the URI, I
> have seen the following suggestions:
>
> * 
> <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<version>/<artifact>

>
> * 
> <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>

>
> * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact>
> * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact>
> * <protocol>://<FQDN>/<base><name>/<type>s/<artifact>
>
> where proposals for the format of <artifact> can be any of
>
> * <artifact> = <name>-<version>.<type>
> * <artifact> = <name>.<type>
> * <artifact> = ANY_VALID_URI_CHARACTERS
> * <artifact> = <name>-<version>.<type> | <name>.<type>
>
> === The current maven repository format ===
>
> Maven uses two different setups:
>
> <artifact> = <name>-<version>.<type>
> <protocol>://<FQDN>/<base><name>/<type>s/<artifact>
>
> if <type> == jar and
>
> <artifact> = <name>-<version>.<type>
> <protocol>://<FQDN>/<base><name>/distributions/<artifact>
>
> if <type> == zip || <type> == tar.gz
>
> I don't think it provides other <type>s in its repo atm.
>
> === How to choose a format ===
>
> I think we should start with taking into account
>
> if ! otherURI.FQDN == thisURI.FQDN
>    return false
> elseif ! otherURI.protocol == thisURI.protocol
>    return false
> elseif ! otherURI.basedir == thisURI.basedir
>    return false
> elseif ! otherURI.organisation == thisURI.organisation
>    return false
> elseif ! otherURI.project == thisURI.project
>    return false
> elseif ! otherURI.name == thisURI.name
>    return false
> elseif ! otherURI.version == thisURI.version
>    return false
> else
>    return true
>
> And once we have that settled, we should choose a layout which does
> not duplicate information, in order to keep the URI short, ie I cannot
> see why it is a good idea to specify (.*)<version>(.*)<version>(.*) for
> putting the version in the URI.
>
> The next choice is between
>
> * <artifact> = <name>-<version>.<type>
> * <artifact> = <name>.<type>
> * <artifact> = ANY_VALID_URI_CHARACTERS
> * <artifact> = <name>-<version>.<type> | <name>.<type>
>
> and when that is settled we can determine the rest of the URI.
>
> Note that the choice of <artifact> is important, as this is what most
> applications will provide as the normal name for the user to save the
> files.
>
> === My case for <artifact> ===
>
> The advantage of ANY_VALID_URI_CHARACTERS is that it reduces the need
> for renaming of files when included in the repository: one can just use
> the same filename as provided by the original artifact distributor.
>
> The big disadvantage is that this doesn't satisfy the requirment that an
> URI should be identified as detailed below: you need to know <artifact>
> in addition to all the other information. While this is easily solved
> using metainformation or introspection (in the case of machines), I
> think it makes an URI much harder to guess for a human, and is hence
> inconvenient.
>
> This argument also applies to <name>-<version>.<type> | <name>.<type>,
> though less so because you have to guess from only two possibilities.
> However, you still need to guess, defeating the "U" in URI.
>
> So I suggest we choose either
>
> * <artifact> = <name>-<version>.<type>
>
> or
>
> * <artifact> = <name>.<type>
>
> where my preference is for the former based on the dominant practice in
> distribution repository setup (re: maven, rpm, apt, ports, cpan, pear).
>
> === My case for the entire URI ===
>
> ==== Common Ground ===
>
> I think everyone agrees that the first part of the URI needs to be
>
> <protocol>://<FQDN>/<base>
>
> so lets start from that. Based on the principle that the URI should be
> as short as possible and simple to remember, and contain no duplicate
> information, and the assumption that
>
> * <artifact> = <name>-<version>.<type>
>
> ==== My Preference ====
>
> My preference is for
>
> * <protocol>://<FQDN>/<base><organisation>/<project>/<artifact>
>
> so the below information
>
> FQDN = www.apache.org
> protocol = http
> base = dist/repository/
> organisation = org.apache
> project = jakarta
> name = commons-logging
> type = jar
>
> results in an uri of
>
> http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging-1.0.jar
>
> ==== Coping with filesystem limits ====
>
> however, the potential danger here is that the project "jakarta" might
> distribute 100s of files (which it does), resulting in a very long list
> of files contained in the "jakarta" directory on the server, resulting
> in too much output when visiting
>
> http://www.apache.org/dist/repository/org.apache/jakarta/
>
> with a normal browser (a problem common when browsing RPM repositories,
> for example). To avoid that, I suggest we make the URI a bit longer by
> repeating the <name> and <type> elements:
>
> * 
> <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>

>
>
> resulting in
>
> http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging/jars/commons-logging-1.0.jar
>
> the choice of <name> as a repetition element is I think accepted by all.
> The rationale is that a user visiting
>
> http://www.apache.org/dist/repository/org.apache/jakarta
>
> will know what project he is looking for, but not neccessarily what
> version ("just give me the latest") or what type ("I'll take whatever
> you got, my tool can decompress anything").
>
> The choice of one of
>
> * <type>
> * <type>s
> * <type>/<version>
> * <version>/<type>
> * <version>/<type>s
> * <version>
>
> is less easy. I somewhat doubt that using either <type> or <version>
> will result in very long lists of files in a single directory, so I
> can't think of much of an argument for choosing between those, while I'd
> say that rules using both of them out, for reasons of wanting a short
> URI.
>
> So, <version> or <type>? Based on looking at the setup used by rpm and
> maven, I think the most common practice is <type>s, so I suggest we
> go with that.
>
> = We forgot something: architecture, os, language! =
>
> Since we're mostly java developers, we don't need to worry about
> architecture. However, for a general convention, we should take into
> account other languages, like C and C++, which often result in specific
> binaries. Even for java, there often are windows and linux-specific
> versions (though I know of no java package for 386 as opposed to 686
> architecture).
>
> Architecture can be split into operating system and hardware platform,
> though there is often some or a lot of overlap. Lets call the hardware
> platform "architecture", and the operating system "os".
>
> Then there's the case of languages: many software packages are not
> multi-lingual, and specific version are provided for many different
> languages.
>
> I suggest we wrap "architecture", "os" and "language" into "version",
> allowing distributors to figure out for themselves how to differentiate
> between the various options. This makes life easier for java developers
> and doesn't change the mess for other developers.
>
> I couldn't find a common pattern anyway. Many linux vendors seperate on
> language early (and then there's this dumb "en" directory with no
> friends, as everyone uses english anyways), don't seperate on os (being
> all about a single os after all), and seperate on architecture after
> having seperated on type. But even here things are inconsistent:
> just look at the language packs for KDE in a subsubsubdirectory of
> /en/ in the case of RedHat.
>
> Apache HTTPD does not seperate on language, but seperates binaries early
> on, then includes the architecture as part of the version.
>
> So there's no lesson to learn from prior art other than that it is a bit
> messy :D
>
>



Mime
View raw message