From: rst@ai.mit.edu (Robert S. Thau)
Date: Thu, 30 May 1996 12:31:43 -0400
Message-Id: <199605301631.MAA01219@volterra.ai.mit.edu>
To: new-httpd@hyperreal.com
Subject: Stuff that happened at a workshop in Cambridge.
Sender: owner-new-httpd@apache.org
Precedence: bulk
Reply-To: new-httpd@hyperreal.com

I got a last-minute invite to a workshop on distributed searching and
indexing which happened in Cambridge over the last couple of days.
Some things that happened which might be of interest to people here,
in no particular order:

1) I was asked a couple of times, "could Apache support XXX"?  My
   general answer, after stressing that I was not in a postion to make
   any commitments on behalf of the group, and that we're still trying
   to make up our minds about what *exactly* the criteria ought to be,
   was that if someone wanted to add Apache support for their hack of
   choice, the right approach would be to:

   a) Supply code, if it isn't easy to do that, and be willing to see
      it distributed on our terms.
   b) Persuade the group that it was a good feature for Apache to
      support, fit well with the distribution, and was worth the risk
      of taking a CERT advisory on the code.
   c) Be willing to stick around and support the code when problems
      came up (as is inevitable in the real world).

   I hope this isn't too far from the actual (still evolving, I think)
   consensus. 

2) It turns out that a *lot* of people want some way of asking a
   server, "what's changed since Tuesday?".  Netscape actually has
   implemented exactly that capability in their enterprise and catalog
   servers; they've just been very quiet about the spec.  Hopefully,
   in this instance at least, they'll be adopting a more open attitude
   --- if the stars are right, a description of what they have
   implemented (which, BTW, the Netscape rep said he wouldn't mind
   seeing cloned), may be showing up as a W3C draft in the near term.

   One thing that would fall out of this sort of effort, BTW, is the
   ability to give a well-mannered spider very precise directions
   about what it should and should not try to index at any particular
   site.  (Ill-mannered spiders are, unfortunately, uncontrollable by
   definition). 

   BTW, one of the reasons that people were asking about putting stuff
   into Apache is that we could potentially provide a vehicle for
   widespread deployment of this sort of thing if we *did* slot it
   into our own core distribution.

3) Another item which was of interest to a number of people and groups
   is establishing a common query format.  There's a group at Stanford
   which is trying to do exactly this by getting Excite, Verity, etc.,
   to agree on whatever they can manage to agree on (with a spec that
   leaves hooks for proprietary extensions).

4) In the meantime, the Z39.50 community is wondering "why don't these
   people just use *our* stuff"?  General Magic is wondering the same
   thing, but they have even less of a chance.  (Z39.50 is a common
   query protocol framework which is used by a lot of library catalog
   systems, and has been used in other fields; it's also the basis of
   WAIS).

5) Yet *another* thing that may appear over the medium term is a
   common full-text inverted index format.  There is an interest on
   the part of a number of search-engine vendors in defining such a
   thing, provided that it leaves them with enough information to
   deploy their own proprietary tricks without exposing them in
   public.

6) Since there was a representative from Microsoft, I asked him if he
   knew anything about the WBCLI and TinyWeb spiders which seemed to
   be giving people trouble here last month; I also forwarded him a
   few samples of the complaints I was seeing (with names stripped
   off, in case any of you are contemplating business deals with
   Microsoft).

   FWIW, jericho2 is actually microsoft's firewall.  It has a lot of
   individual users behind it, and the access patterns they
   collectively generate can wind up looking like a "stupid robot".
   However, it isn't obvious to either me or the Microsoft guy that
   that accounts for WBCLI or TinyWeb; he said he'd try talking things
   over with the people who run jericho, and see if they have any idea
   what's up.

7) For your amusement, since the better spiders have stopped indexing
   the now-ubiquitous <!-- sex breast sex breast sex breast ... -->
   HTML comments, current practice in fooling search engines has
   evolved.  The new state of the art is apparently as follows:

   <body background=white>
   <font color=white>
   sex breast sex breast sex breast...
   </font>

   This says something about human ingenuity and resourcefulness, but
   I'm not sure I want to know what.

8) On a similar note, there was a talk by one of the guys running the
   c|net virtual software library; one of their serious concerns is
   trying to figure out a way *not* to tell a ten-year-old who looks
   at their list of most popular downloads that the number 3 item (or
   whatever it is) is the hooters screen saver --- in particular,
   they'd like a way to do this which does not involve human editorial
   decisions and the associated potential for legal liability.

rst