www-repository mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark R. Diggory" <mdigg...@latte.harvard.edu>
Subject Re: duplicate data
Date Sun, 22 Feb 2004 05:59:58 GMT
I'll try to expand on the functionalities of Maven below.

Sander Striker wrote:

> On Sat, 2004-02-21 at 01:01, Mark R. Diggory wrote:
>>Noel J. Bergman wrote:
>>>>The issue is... the jars/distributables are placed into the
>>>>java-repository using maven.
> Can you explain this a bit?  I thought Maven was used to fetch
> projects and dependencies.  Ofcourse I can read up on Maven,
> but a quick summary of the technicalities would be appreciated.

Maven is used to both fetch jars from the repository and to publish the 
jars to the repository. In regards to the latter, it does this basically 
through ssh sessions where it completes a number of commands (scp, md5, 
chmod, chgrp). Because its encapsulated within maven the user can rely 
on Mavens deployement mechanism to setup the jar/signature in the 
repository for their project, since its scripted, it is done the same 
way every time. This takes a great deal of the effort invovled with 
publishing jars to the repository out of the users hands.

Maven is really doing nothing more than acting as an ssh client for the 
user and automating the deployment process for them using their apache 

This benefits Maven because it can rely on the repository being 
maintained in a structure it can predict and locate dependencies within.
>> so, currently, if you look in
>>>>something like the commons project.properties you'll see that
>>>>they are pointing to the central repository for the location
>>>>of where to "publish" files.
>>>>The "convergence issues" we currently have for the repository:
>>>> 1.) We want single copies of files on the mirrors.
>>> +1
> This is the core point.

Yes, we all agree on this one...

>>>>My best conclusion is
>>>>keep "jars" in the java-repository, do not keep them
>>>>in your /dist/<project>/<binaries> directory.  Remove all
>>>>[jar/zip/tar files] from the java-repository.
>>>>symlnk the appropriate java-repository dir into their appropriate
>>>>"dist" directory.
> That would mean that this entire area would have to be rw to all
> groups producing releases that are to be in there.  This kindof means
> apcvs group ownership, which I don't really fancy doing.  The other
> way around, control and access of each projects dist/ area seperated,
> and symlinking to that from java-repository, seems a bit sa[fn]er to
> me.

Ultimately we are seeking a convergence here between what the repository 
folks want to see, the maven users want to see and the infrastructure 
folks want to see.

1.) For the repository (and Maven) folks, we want to see the contents of 
dist become standardized according to the Repository URI specification. 
This means "all" distributables (java or not) are organized according to 
this specification.

2.) For Maven users, no matter what happens, we need to maintain a 
functionally working repository the works with the existing version of 

3.) For Infrastructure, all this needs to be properly secured and 
maintained according to Apache standards.

The java-repository structure is broken down into


this would mean each project would need to maintain a separate set of 
symlinks for "jars", "distributables", "...".

>>>That sounds OK to me, but folks like Sander and others more involved in
>>>mirroring should be put in the loop.  Everything we put under dist/ effects
>>>100s of mirrors.
> Not me specifically, but Infrastructure.  Others are more actively
> maintaining the mirrors list and monitoring the mirrors.  The mirrors
> are a precious resource and we want to be careful not to 'scare' any
> mirrors away with actions on our end.
>>Yes, I learned that the hard way when we created the contents of 
>>java-repository... that was not a happy weekend. I don't make any "rash" 
>>changes to dist any more...Only well thought out moves. But we are in a 
>>state of cleanup now as well, we have to consider what we are going to 
>>do next.
> If you are making large changes to the directory structure and the
> majority of the files is already on the mirrors, send a mail to
> mirrors@, attach a shell script that moves everything around locally,
> and give them a heads up on when this shuffle is happening.  This
> save a _lot_ of bandwidth.
> Also, when adding a lot, make sure to inform the mirrors, so they
> are prepared.
>>>>Discussion about how to finalize the directory structure such
>>>>that "Repository", "Dist/Mirror" and "Maven" has to move forward.
> I don't parse this, but since Noel can read it, I am probably missing
> context/background.

Just that these groups are all focused on different aspects of the 
distributables in the dist directory:

The Repository projects Url structure is important in standardizing and 
improving the dist contents into a more formal structure.

The Maven project represents a working example of a tool that implements 
itself upon this structure.

Between the dist directory maintainers and the the mirrors out there 
represent a "control" on the whole situation, if it doesn't work for 
them, then its not realistic as a strategy.

>>>That would be good.
>>In our last discussion, I think one of the conclusions that was arrived 
>>at as well, was the idea of breaking the java-repository up into two 
>>different locations.
>>www/cvs.apache.org/dist/java-repository --> nightly builds
>>www/www.apache.org/dist/java-repository --> official releases.
>>the idea was that nightly/weekly builds are not things we want to see on 
>>mirrors but to be available for developers. And that official release of 
>>jars are things we want to see mirrored.
> Is Maven using the mirrors today, like getting the list of active
> mirrors from the main site and finding the closest?  Or is it only
> using the main site and perhaps iblibio?
Currently, all Maven clients use www.ibiblio.org/maven to retrieve 
content. www.ibibilio.org is also a mirror of /java-repository for all 
its apache content. Actually Maven users DO NOT go to 
www.apache.org/dist/java-repository to download files, and only Apache 
developers can publish to www.apache.org/dist/java-repository.

What server is used is currently based on the configuration of the Maven 
client, servers currently do not maintain any capability to hand this 
client off to another mirror. I think, in the future as the Repository 
comes into existence and machine readable metadata or mechanisms for 
directing clients off to mirrors come into existence, then clients like 
Maven will implement such capabilities.

>>When it comes to things like the ibiblio maven repository, it would only 
>>maintain full version releases of apache projects.
> Can you explain why ibiblio is special here?  I mean, what you describe
> is what is supposed to be on all the mirrors right?

Just because it is the "default" repository used by the Maven Client.
>> If your an apache 
>>project and need to be on the bleeding edge for a component, then you 
>>can simply add
>>as your first repository location and get your apache jars straight off 
>>the nightly builds...
>>The big question is how to facilitate this a build process, I think the 
>>last decision on the Jakarta Commons/General/Maven lists was that we 
>>would automate the build process for releasing the nightly jars into
>>And the only publishing of jars by actual humans (Release Managers) 
>>would be the full releases onto
> Symlinks I hope.  Mirrors handle symlinks efficiently, that is,
> if they follow our rsync instructions.

The only mirroring that would be done would be via:


All other content in cvs.apache.org or archive.apache.org is not to be 
"synced" as its not to be published out to mirrors, such content are 
"developer build" and not for public consumption.

Within the www.apache.org/dist directory, yes symlinking should be used 
to resolve duplication.

> Take a look at http://www.apache.org/~henkp/md5/, specifically
> the fyi: some duplicates section.  Dups are a waste of bandwidth
> and diskspace.

Yes, approx 50% of instances of duplication on this page are currently 
caused by avalon components (avalon also was using their dist directory 
as a private maven repository). For example:


I understand it can be the policy that when rsyncing, if the symlink and 
the target directory do not have the same ownership, that it will not be 
followed. I believe this creates a problem in that I cannot simply 
create symlinks from java-repository/excalibur-component/ to 
avalon/excalibur-component/ as they will not be followed by rsync.

However, the other 50% of duplicates within the java-repository 
directory should be properly alleviated with symlinking, I can work on 
this as I now (as of a couple days ago) own all the files :-). I will 
start working on a script I can run periodically which will accomplish this.

> I'll ask Henk to disable the checks for presence of md5 in the
> dist/java-repository, since that doesn't seem to be applicable
> there.  It seems to me that you do want to do some verification
> in maven, but you are probably storing signature information
> somewhere in the maven 'database'?

No, it is in the directory structure (no db) and md5's should exist next 
to the files, there is a bug in maven caused by the fact that on BSD 
checksums are generated by "md5" not "md5sum" like on linux, this needs 
to be addressed, for example, you see my md5 was bad on the math jar 
(which I just fixed).

Mark Diggory
Software Developer
Harvard MIT Data Center

View raw message