From ooo-dev-return-24001-apmail-incubator-ooo-dev-archive=incubator.apache.org@incubator.apache.org Fri Aug 3 16:19:53 2012 Return-Path: X-Original-To: apmail-incubator-ooo-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-ooo-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 05BB1DD11 for ; Fri, 3 Aug 2012 16:19:53 +0000 (UTC) Received: (qmail 58725 invoked by uid 500); 3 Aug 2012 16:19:52 -0000 Delivered-To: apmail-incubator-ooo-dev-archive@incubator.apache.org Received: (qmail 58540 invoked by uid 500); 3 Aug 2012 16:19:52 -0000 Mailing-List: contact ooo-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: ooo-dev@incubator.apache.org Delivered-To: mailing list ooo-dev@incubator.apache.org Received: (qmail 58526 invoked by uid 99); 3 Aug 2012 16:19:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2012 16:19:52 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 98.139.212.186 is neither permitted nor denied by domain of kay.schenk@gmail.com) Received: from [98.139.212.186] (HELO nm27.bullet.mail.bf1.yahoo.com) (98.139.212.186) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2012 16:19:46 +0000 Received: from [98.139.212.151] by nm27.bullet.mail.bf1.yahoo.com with NNFMP; 03 Aug 2012 16:19:25 -0000 Received: from [68.142.200.227] by tm8.bullet.mail.bf1.yahoo.com with NNFMP; 03 Aug 2012 16:19:25 -0000 Received: from [66.94.237.111] by t8.bullet.mud.yahoo.com with NNFMP; 03 Aug 2012 16:19:25 -0000 Received: from [127.0.0.1] by omp1016.access.mail.mud.yahoo.com with NNFMP; 03 Aug 2012 16:19:25 -0000 X-Yahoo-Newman-Id: 546033.8135.bm@omp1016.access.mail.mud.yahoo.com Received: (qmail 44215 invoked from network); 3 Aug 2012 16:19:25 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1344010765; bh=uRN/58gt2CXqsfB6a/9GQKVNviCuJosAceWUJNw9AN0=; h=X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:Received:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=iE0whRadlt/wejB/XwGaaGS9v1awzzZpv0N0PA9f7WF0SScr1/3gPGI+0XOeaXNAoHK5wEDTH4/iXYSLNsJUdh4oJnUcaS08Germ3mAGlY1v3O0cdD5V3rA/kuMHWE6UpqC53u8dg67TXrzp9LMXnTF8zWJT0bR/kCpZmu6KOVw= X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: be6AXxoVM1nY7n7aCHion4ePCo5OaYG8GGOgw6VsHQ05Pax hIK5livlvIayRpYNQmfnEEQOOuUrF3fBzuNvViNtgPEXUfWPfkUPIT95Kg6L rMHa5VjKaXyJDa1lN2Krn5LgUz80lVOMlz8gnISghNmFdXF4A1BAa7z8IT8a ibZQ6XIFYYU4nKEFJQg_dLillZhNa8MGSdQzxCcjrDpCxRCSW9.ASKktopkT mfu.ON.ekRUROVMMN5mWHJ0AUFJaTCINDvK2x1uKu3KPL_bPR0jh.SH.cUki vmHEibyHrQgQnEX3YEPfahLRuI26bWv2AmIA4z2TSo6E9fW8cp8ptLDsIa9e qeZP9YBRaw2Ej6VVbHDY03foWsMeKQyuIcBEZB6vtOrxdmbrXtIPnfvv09Mw kPwVB9uxkewJq03whwZf4hamU6bP4K9A9a4MAhBBwHYs5_mQF0dx2rWjlg5w z32VfMx_C58FP3LLV.NYXAZ8Q4gVcCRRMh.eDeBG3y9Q_VFHEQqO1U5ZGXep VaLYJKuGl3Koucrj0GuhEtCenBiN5g0j6c2r3nm3na9HmPByhtXQ_hwBNmmi _orb0LQAAJZchKnI0cAfHELB_2ujri8A0YRSgE6svU3EoqFnsIe72djCgccG mXgnstemFHivCMV8GbUhbYmh5JmXvwDZCywsrk12zkaizK.IycJJwB0JK7pG iV.ZVUl5CgMk.AgJJkbZmDewnpyifDLcajJ9rN_6l6ABHMbHl.qEgQ1kYIu_ El7t.evLOru9MzQluszNpjYO6ZEaQtEsVLVW0iXd0CDA9cz_2wwpak9YGMSQ de2S0iLdBXwpFIs3l8lmkmlTxVT6WhXAsY7kVv.mDSRvuF0gVfMTcjX2jms7 je_VQ2eQMDxDhr2E9D4V0Hd8DAI5_X4kW4BZF1yWzLSmeuI_qNdIrMdig.aD i6.TIAEf4SY7dBeC0o4KIThp2J2S9B4zLxGVhA4PM7Nsbu5ZRYcyKLtlI7pu BxuD0FbEsdYlgiXJ9yiu7Xea3m1sdzxN5jSUSMCs4Dkc- X-Yahoo-SMTP: dHt73eiswBAYjuZ6oL.TTjbe.KQkAIve Received: from [192.168.1.109] (kay.schenk@67.117.30.42 with plain) by smtp110.sbc.mail.mud.yahoo.com with SMTP; 03 Aug 2012 09:19:25 -0700 PDT Message-ID: <501BF8A9.3080009@gmail.com> Date: Fri, 03 Aug 2012 09:13:29 -0700 From: Kay Schenk User-Agent: Mozilla/5.0 (X11; Linux i686; rv:13.0) Gecko/20120601 Thunderbird/13.0 MIME-Version: 1.0 To: ooo-dev@incubator.apache.org Subject: Re: investigation using Google Webmaster tools References: <5019BFA7.6010107@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 08/02/2012 07:45 AM, Rob Weir wrote: > On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk wrote: >> >> >> On 08/01/2012 04:29 PM, Rob Weir wrote: >>> >>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk wrote: >>>> >>>> Hello all -- >>>> >>>> I am exploring the www.openoffice.site using the Google Webmaster tool >>>> that >>>> Rob told us about on Jul 19. >>>> >>>> I am ONLY getting started by looking at the 62,962 404 errors (!!!!!) >>>> >>>> Many of these are links to VERY old docs which we no longer have -- like >>>> source trees for 1.0.1, 1.0.2 etc.-- or have to do with the OLD >>>> architecture -- servlet references etc. >>>> >>> >>> If I understand this correctly, Google is looking at links on >>> webpages, not just our webpages, but also links from 3rd party >>> websites, and if they point to an openoffice.org page that doesn't >>> exist, it shows up on this list. This could happen for any reason. >>> In some cases the original link might have had a typo. >> >> >> yes, this is correct, and you are right about this too...some of the 404s >> reference pages we probably NEVER had. >> >> >>> >>>> Some of this issues could be solved with rather extensive use of sym >>>> links >>>> (yes, you can actually use these in svn -- kind of) and of course some >>>> not >>>> -- many missing old security bulletins. >>>> >>> >>> For the security bulletins, I wonder if this is actually a redirection >>> error. We have many of them here: >>> >>> http://www.openoffice.org/security/bulletin.html >> >> >> ah...yes, they are there...the problem is we would need to construct a LOT >> of just "redirect" pages to right some of these since they all seem to have >> the form >> >> "/security/cvs-bulletin-number".html >> > > So let's take a specific example. > > Google is reporting a 404 error for this URL: > http://www.openoffice.org/security/bulletin-20060629.html > > It is linked to from from at least 10 external web pages, for example > the last link in this table: > > http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html > > (Whoops, make that at least 12 links, since the Apache and MarkMail > list archives will now link to this) > > There is no file of this name in > https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/ > > Looking at the svn log I don't see any record of the files ever being here. > > I searched the complete ooo-site tree and this file > (bulletin-20060629.html) doesn't exist anywhere. > > The Wayback Machine shows the page did exist in 2006: > > http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html > > But it was broken already by 2009: > > http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html > > So this is a pre-existing problem, and nothing we can do about it. > > Ughh. Obviously we cannot do this kind of research for every one of > the 64 thousand links. > > But in other cases we can help. For example this link is giving 404 error: > > http://www.openoffice.org/licenses/lgpl_license.html > > I think we removed that intentionally, since that is no longer our > license. However, that link was used by many other websites, > including university course materials looking at open source licenses, > etc.: http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html > > So in cases like this we might want to restore the page. Do our part > to help prevent bit rot and entropy from destroying the web. Well this particular one I really AM not in favor of restoring to our site. What we could do on this one, is put in a page with just a redirect to where the actual license lives. (and yes, this is really one of the most critical ones in my opinion) > > But to put it in perspective, although we have 64 thousand 404 errors > on our website, we also have nearly 16 million incoming links that do > not give errors. Well that's a relief eh? :) OK, I will have another look at this. At any rate, we absolutely should put in place a generic "error.html" and have infra reconfigure www.openoffice.org with THAT as our 404. That way we can assist folks in dealing with link problems. > > -Rob > >> >>> >>> But we're redirecting security.openoffice.org to >>> http://incubator.apache.org/openofficeorg/security.html >>> >>> So if there are outstanding URL's that are of the form >>> security.openoffice.org/foo.html then they might be broken now. >> >> >> see above...it's the actual placement of the bulletins within the tree >> that's the problem I think >> >> >> >>> >>>> So, to those of you using this tool, I may mark many of these as "fixed". >>>> Of course they are not -- and they may show up again. Some of them only >>>> show up in BZ issues!! (Google is amazingly thorough). >>>> >>>> I don't know how long it will take for them to "show up" again. The >>>> problem >>>> is some of these are very very very old references, and not likely we can >>>> do anything about at this point in time. >>>> If you're not using this tool, you probably don't care about this. If you >>>> are using it, and have another opinion before I start chunking away at >>>> hiding these, please weigh in. >>>> >>> >>> The way I understand it the links at the top of the list are the ones >>> Google considers the most important. I think this is based on the >>> number of links to that page. Maybe they factor in other things as >>> well. So I'd recommend looking more at the top 100 or so broken >>> links, make this a manageable task. >> >> >> Well the problem is "how" to make it manageable... :( >> >> >>> >>> Or -- and here is a challenge for the algorithm experts -- maybe there >>> is an easy way to take that entire list of 62,962 links and determine >>> what the top base paths are that are broken. >> >> >> if only this were so :( They're all over the place. >> >> >> In other words, if the >>> >>> links are: >>> >>> foo.openoffice.org/bar/baz1 >>> foo.openoffice.org/bar/baz2 >>> foo.openoffice.org/bar/baz2 >>> foo.openoffice.org/bar2/baz1 >>> foo2.openoffice.org/bar1/baz1 >>> >>> Then this would tell us that foo.openoffice.org/bar/* was a top source >>> of broken links. This might indicate important patterns of where the >>> most broken links are. >>> >>> It seems like this could be done via a prefix tree (a "trie"): >>> http://en.wikipedia.org/wiki/Trie >>> >>> Maybe other (simpler) ways as well. >> >> >> I'll look at this article. It's a daunting task any way you look at it. >> >>> >>> Regards, >> >> >> What happens when things get moved a LOT with no regard for the end user. >> Don't get me started on the ways I've had to deal with this in the past. >> >> >>> >>> -Rob >>> >>>> >>>> >>>> -- >>>> >>>> ---------------------------------------------------------------------------------------- >>>> MzK >>>> >>>> "I'm just a normal jerk who happens to make music. >>>> As long as my brain and fingers work, I'm cool." >>>> -- Eddie Van Halen >> >> >> -- >> ------------------------------------------------------------------------ >> MzK >> >> "I'm just a normal jerk who happens to make music. >> As long as my brain and fingers work, I'm cool." >> -- Eddie Van Halen >> >> -- ------------------------------------------------------------------------ MzK "I'm just a normal jerk who happens to make music. As long as my brain and fingers work, I'm cool." -- Eddie Van Halen