incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Weir <robw...@apache.org>
Subject Re: investigation using Google Webmaster tools
Date Fri, 03 Aug 2012 16:29:44 GMT
On Fri, Aug 3, 2012 at 12:13 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
>
>
> On 08/02/2012 07:45 AM, Rob Weir wrote:
>>
>> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
>>>
>>>
>>>
>>> On 08/01/2012 04:29 PM, Rob Weir wrote:
>>>>
>>>>
>>>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Hello all --
>>>>>
>>>>> I am exploring the www.openoffice.site using the Google Webmaster tool
>>>>> that
>>>>> Rob told us about on Jul 19.
>>>>>
>>>>> I am ONLY getting started by looking at the 62,962 404 errors (!!!!!)
>>>>>
>>>>> Many of these are links to VERY old docs which we no longer have --
>>>>> like
>>>>> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
>>>>> architecture -- servlet references etc.
>>>>>
>>>>
>>>> If I understand this correctly, Google is looking at links on
>>>> webpages, not just our webpages, but also links from 3rd party
>>>> websites, and if they point to an openoffice.org page that doesn't
>>>> exist, it shows up on this list.   This could happen for any reason.
>>>> In some cases the original link might have had a typo.
>>>
>>>
>>>
>>> yes, this is correct, and you are right about this too...some of the 404s
>>> reference pages we probably NEVER had.
>>>
>>>
>>>>
>>>>> Some of this issues could be solved with rather extensive use of sym
>>>>> links
>>>>> (yes, you can actually use these in svn -- kind of) and of course some
>>>>> not
>>>>> -- many missing old security bulletins.
>>>>>
>>>>
>>>> For the security bulletins, I wonder if this is actually a redirection
>>>> error.  We have many of them here:
>>>>
>>>> http://www.openoffice.org/security/bulletin.html
>>>
>>>
>>>
>>> ah...yes, they are there...the problem is we would need to construct a
>>> LOT
>>> of just "redirect" pages to right some of these since they all seem to
>>> have
>>> the form
>>>
>>> "/security/cvs-bulletin-number".html
>>>
>>
>> So let's take a specific example.
>>
>> Google is reporting a 404 error for this URL:
>> http://www.openoffice.org/security/bulletin-20060629.html
>>
>> It is linked to from from at least 10 external web pages, for example
>> the last link in this table:
>>
>>
>> http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html
>>
>> (Whoops, make that at least 12 links, since the Apache and MarkMail
>> list archives will now link to this)
>>
>> There is no file of this name in
>>
>> https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/
>>
>> Looking at the svn log I don't see any record of the files ever being
>> here.
>>
>> I searched the complete ooo-site tree and this file
>> (bulletin-20060629.html) doesn't exist anywhere.
>>
>> The Wayback Machine shows the page did exist in 2006:
>>
>>
>> http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html
>>
>> But it was broken already by 2009:
>>
>>
>> http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html
>>
>> So this is a pre-existing problem, and nothing we can do about it.
>>
>> Ughh.   Obviously we cannot do this kind of research for every one of
>> the 64 thousand links.
>>
>> But in other cases we can help.  For example this link is giving 404
>> error:
>>
>> http://www.openoffice.org/licenses/lgpl_license.html
>>
>> I think we removed that intentionally, since that is no longer our
>> license.  However, that link was used by many other websites,
>> including university course materials looking at open source licenses,
>> etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
>>
>> So in cases like this we might want to restore the page.  Do our part
>> to help prevent bit rot and entropy from destroying the web.
>
>
> Well this particular one I really AM not in favor of restoring to our site.
> What we could do on this one, is put in a page with just a redirect to where
> the actual license lives. (and yes, this is really one of the most critical
> ones in my opinion)
>

That would be fine, a page at that URL that says our license has
changed, and that the LGPL van be found at the Free Software
Foundation website, and link to that.  Everyone's happy then.

>
>>
>> But to put it in perspective, although we have 64 thousand 404 errors
>> on our website, we also have nearly 16 million incoming links that do
>> not give errors.
>
>
> Well that's a relief eh? :)
>
> OK, I will have another look at this. At any rate, we absolutely should put
> in place a generic "error.html" and have infra reconfigure
> www.openoffice.org with THAT as our 404. That way we can assist folks in
> dealing with link problems.
>

The nice thing about a custom error page is we can put also put Google
custom search box there, to let the user do a site-wide search to try
to find their answer that way.

-Rob

>
>
>>
>> -Rob
>>
>>>
>>>>
>>>> But we're redirecting security.openoffice.org to
>>>> http://incubator.apache.org/openofficeorg/security.html
>>>>
>>>> So if there are outstanding URL's that are of the form
>>>> security.openoffice.org/foo.html then they might be broken now.
>>>
>>>
>>>
>>> see above...it's the actual placement of the bulletins within the tree
>>> that's the problem I think
>>>
>>>
>>>
>>>>
>>>>> So, to those of you using this tool, I may mark many of these as
>>>>> "fixed".
>>>>> Of course they are not -- and they may show up again. Some of them only
>>>>> show up in BZ issues!! (Google is amazingly thorough).
>>>>>
>>>>> I don't know how long it will take for them to "show up" again. The
>>>>> problem
>>>>> is some of these are very very very old references, and not likely we
>>>>> can
>>>>> do anything about at this point in time.
>>>>> If you're not using this tool, you probably don't care about this. If
>>>>> you
>>>>> are using it, and have another opinion before I start chunking away at
>>>>> hiding these, please weigh in.
>>>>>
>>>>
>>>> The way I understand it the links at the top of the list are the ones
>>>> Google considers the most important.  I think this is based on the
>>>> number of links to that page.  Maybe they factor in other things as
>>>> well.  So I'd recommend looking more at the top 100 or so broken
>>>> links, make this a manageable task.
>>>
>>>
>>>
>>> Well the problem is "how" to make it manageable... :(
>>>
>>>
>>>>
>>>> Or -- and here is a challenge for the algorithm experts -- maybe there
>>>> is an easy way to take that entire list of 62,962 links and determine
>>>> what the top base paths are that are broken.
>>>
>>>
>>>
>>> if only this were so :( They're all over the place.
>>>
>>>
>>>   In other words, if the
>>>>
>>>>
>>>> links are:
>>>>
>>>> foo.openoffice.org/bar/baz1
>>>> foo.openoffice.org/bar/baz2
>>>> foo.openoffice.org/bar/baz2
>>>> foo.openoffice.org/bar2/baz1
>>>> foo2.openoffice.org/bar1/baz1
>>>>
>>>> Then this would tell us that foo.openoffice.org/bar/* was a top source
>>>> of broken links.  This might indicate important patterns of where the
>>>> most broken links are.
>>>>
>>>> It seems like this could be done via a prefix tree (a "trie"):
>>>> http://en.wikipedia.org/wiki/Trie
>>>>
>>>> Maybe other (simpler) ways as well.
>>>
>>>
>>>
>>> I'll look at this article. It's a daunting task any way you look at it.
>>>
>>>>
>>>> Regards,
>>>
>>>
>>>
>>> What happens when things get moved a LOT with no regard for the end user.
>>> Don't get me started on the ways I've had to deal with this in the past.
>>>
>>>
>>>>
>>>> -Rob
>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> ----------------------------------------------------------------------------------------
>>>>> MzK
>>>>>
>>>>> "I'm just a normal jerk who happens to make music.
>>>>>    As long as my brain and fingers work, I'm cool."
>>>>>                                 -- Eddie Van Halen
>>>
>>>
>>>
>>> --
>>> ------------------------------------------------------------------------
>>> MzK
>>>
>>> "I'm just a normal jerk who happens to make music.
>>>   As long as my brain and fingers work, I'm cool."
>>>                                -- Eddie Van Halen
>>>
>>>
>
> --
> ------------------------------------------------------------------------
> MzK
>
> "I'm just a normal jerk who happens to make music.
>  As long as my brain and fingers work, I'm cool."
>                               -- Eddie Van Halen
>
>

Mime
View raw message