Received: by taz.hyperreal.com (8.6.12/8.6.5) id JAA18464; Thu, 30 May 1996 09:31:49 -0700 Received: from life.ai.mit.edu by taz.hyperreal.com (8.6.12/8.6.5) with SMTP id JAA18456; Thu, 30 May 1996 09:31:47 -0700 Received: from volterra.ai.mit.edu by life.ai.mit.edu (4.1/AI-4.10) for new-httpd@hyperreal.com id AA26827; Thu, 30 May 96 12:31:45 EDT From: rst@ai.mit.edu (Robert S. Thau) Received: by volterra.ai.mit.edu (8.6.12/AI-4.10) id MAA01219; Thu, 30 May 1996 12:31:43 -0400 Date: Thu, 30 May 1996 12:31:43 -0400 Message-Id: <199605301631.MAA01219@volterra.ai.mit.edu> To: new-httpd@hyperreal.com Subject: Stuff that happened at a workshop in Cambridge. Sender: owner-new-httpd@apache.org Precedence: bulk Reply-To: new-httpd@hyperreal.com I got a last-minute invite to a workshop on distributed searching and indexing which happened in Cambridge over the last couple of days. Some things that happened which might be of interest to people here, in no particular order: 1) I was asked a couple of times, "could Apache support XXX"? My general answer, after stressing that I was not in a postion to make any commitments on behalf of the group, and that we're still trying to make up our minds about what *exactly* the criteria ought to be, was that if someone wanted to add Apache support for their hack of choice, the right approach would be to: a) Supply code, if it isn't easy to do that, and be willing to see it distributed on our terms. b) Persuade the group that it was a good feature for Apache to support, fit well with the distribution, and was worth the risk of taking a CERT advisory on the code. c) Be willing to stick around and support the code when problems came up (as is inevitable in the real world). I hope this isn't too far from the actual (still evolving, I think) consensus. 2) It turns out that a *lot* of people want some way of asking a server, "what's changed since Tuesday?". Netscape actually has implemented exactly that capability in their enterprise and catalog servers; they've just been very quiet about the spec. Hopefully, in this instance at least, they'll be adopting a more open attitude --- if the stars are right, a description of what they have implemented (which, BTW, the Netscape rep said he wouldn't mind seeing cloned), may be showing up as a W3C draft in the near term. One thing that would fall out of this sort of effort, BTW, is the ability to give a well-mannered spider very precise directions about what it should and should not try to index at any particular site. (Ill-mannered spiders are, unfortunately, uncontrollable by definition). BTW, one of the reasons that people were asking about putting stuff into Apache is that we could potentially provide a vehicle for widespread deployment of this sort of thing if we *did* slot it into our own core distribution. 3) Another item which was of interest to a number of people and groups is establishing a common query format. There's a group at Stanford which is trying to do exactly this by getting Excite, Verity, etc., to agree on whatever they can manage to agree on (with a spec that leaves hooks for proprietary extensions). 4) In the meantime, the Z39.50 community is wondering "why don't these people just use *our* stuff"? General Magic is wondering the same thing, but they have even less of a chance. (Z39.50 is a common query protocol framework which is used by a lot of library catalog systems, and has been used in other fields; it's also the basis of WAIS). 5) Yet *another* thing that may appear over the medium term is a common full-text inverted index format. There is an interest on the part of a number of search-engine vendors in defining such a thing, provided that it leaves them with enough information to deploy their own proprietary tricks without exposing them in public. 6) Since there was a representative from Microsoft, I asked him if he knew anything about the WBCLI and TinyWeb spiders which seemed to be giving people trouble here last month; I also forwarded him a few samples of the complaints I was seeing (with names stripped off, in case any of you are contemplating business deals with Microsoft). FWIW, jericho2 is actually microsoft's firewall. It has a lot of individual users behind it, and the access patterns they collectively generate can wind up looking like a "stupid robot". However, it isn't obvious to either me or the Microsoft guy that that accounts for WBCLI or TinyWeb; he said he'd try talking things over with the people who run jericho, and see if they have any idea what's up. 7) For your amusement, since the better spiders have stopped indexing the now-ubiquitous HTML comments, current practice in fooling search engines has evolved. The new state of the art is apparently as follows: sex breast sex breast sex breast... This says something about human ingenuity and resourcefulness, but I'm not sure I want to know what. 8) On a similar note, there was a talk by one of the guys running the c|net virtual software library; one of their serious concerns is trying to figure out a way *not* to tell a ten-year-old who looks at their list of most popular downloads that the number 3 item (or whatever it is) is the hooters screen saver --- in particular, they'd like a way to do this which does not involve human editorial decisions and the associated potential for legal liability. rst