incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kay Schenk <kay.sch...@gmail.com>
Subject Re: investigation using Google Webmaster tools
Date Fri, 03 Aug 2012 16:13:29 GMT


On 08/02/2012 07:45 AM, Rob Weir wrote:
> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
>>
>>
>> On 08/01/2012 04:29 PM, Rob Weir wrote:
>>>
>>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
>>>>
>>>> Hello all --
>>>>
>>>> I am exploring the www.openoffice.site using the Google Webmaster tool
>>>> that
>>>> Rob told us about on Jul 19.
>>>>
>>>> I am ONLY getting started by looking at the 62,962 404 errors (!!!!!)
>>>>
>>>> Many of these are links to VERY old docs which we no longer have -- like
>>>> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
>>>> architecture -- servlet references etc.
>>>>
>>>
>>> If I understand this correctly, Google is looking at links on
>>> webpages, not just our webpages, but also links from 3rd party
>>> websites, and if they point to an openoffice.org page that doesn't
>>> exist, it shows up on this list.   This could happen for any reason.
>>> In some cases the original link might have had a typo.
>>
>>
>> yes, this is correct, and you are right about this too...some of the 404s
>> reference pages we probably NEVER had.
>>
>>
>>>
>>>> Some of this issues could be solved with rather extensive use of sym
>>>> links
>>>> (yes, you can actually use these in svn -- kind of) and of course some
>>>> not
>>>> -- many missing old security bulletins.
>>>>
>>>
>>> For the security bulletins, I wonder if this is actually a redirection
>>> error.  We have many of them here:
>>>
>>> http://www.openoffice.org/security/bulletin.html
>>
>>
>> ah...yes, they are there...the problem is we would need to construct a LOT
>> of just "redirect" pages to right some of these since they all seem to have
>> the form
>>
>> "/security/cvs-bulletin-number".html
>>
>
> So let's take a specific example.
>
> Google is reporting a 404 error for this URL:
> http://www.openoffice.org/security/bulletin-20060629.html
>
> It is linked to from from at least 10 external web pages, for example
> the last link in this table:
>
> http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html
>
> (Whoops, make that at least 12 links, since the Apache and MarkMail
> list archives will now link to this)
>
> There is no file of this name in
> https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/
>
> Looking at the svn log I don't see any record of the files ever being here.
>
> I searched the complete ooo-site tree and this file
> (bulletin-20060629.html) doesn't exist anywhere.
>
> The Wayback Machine shows the page did exist in 2006:
>
> http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html
>
> But it was broken already by 2009:
>
> http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html
>
> So this is a pre-existing problem, and nothing we can do about it.
>
> Ughh.   Obviously we cannot do this kind of research for every one of
> the 64 thousand links.
>
> But in other cases we can help.  For example this link is giving 404 error:
>
> http://www.openoffice.org/licenses/lgpl_license.html
>
> I think we removed that intentionally, since that is no longer our
> license.  However, that link was used by many other websites,
> including university course materials looking at open source licenses,
> etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
>
> So in cases like this we might want to restore the page.  Do our part
> to help prevent bit rot and entropy from destroying the web.

Well this particular one I really AM not in favor of restoring to our 
site. What we could do on this one, is put in a page with just a 
redirect to where the actual license lives. (and yes, this is really one 
of the most critical ones in my opinion)

>
> But to put it in perspective, although we have 64 thousand 404 errors
> on our website, we also have nearly 16 million incoming links that do
> not give errors.

Well that's a relief eh? :)

OK, I will have another look at this. At any rate, we absolutely should 
put in place a generic "error.html" and have infra reconfigure 
www.openoffice.org with THAT as our 404. That way we can assist folks in 
dealing with link problems.


>
> -Rob
>
>>
>>>
>>> But we're redirecting security.openoffice.org to
>>> http://incubator.apache.org/openofficeorg/security.html
>>>
>>> So if there are outstanding URL's that are of the form
>>> security.openoffice.org/foo.html then they might be broken now.
>>
>>
>> see above...it's the actual placement of the bulletins within the tree
>> that's the problem I think
>>
>>
>>
>>>
>>>> So, to those of you using this tool, I may mark many of these as "fixed".
>>>> Of course they are not -- and they may show up again. Some of them only
>>>> show up in BZ issues!! (Google is amazingly thorough).
>>>>
>>>> I don't know how long it will take for them to "show up" again. The
>>>> problem
>>>> is some of these are very very very old references, and not likely we can
>>>> do anything about at this point in time.
>>>> If you're not using this tool, you probably don't care about this. If you
>>>> are using it, and have another opinion before I start chunking away at
>>>> hiding these, please weigh in.
>>>>
>>>
>>> The way I understand it the links at the top of the list are the ones
>>> Google considers the most important.  I think this is based on the
>>> number of links to that page.  Maybe they factor in other things as
>>> well.  So I'd recommend looking more at the top 100 or so broken
>>> links, make this a manageable task.
>>
>>
>> Well the problem is "how" to make it manageable... :(
>>
>>
>>>
>>> Or -- and here is a challenge for the algorithm experts -- maybe there
>>> is an easy way to take that entire list of 62,962 links and determine
>>> what the top base paths are that are broken.
>>
>>
>> if only this were so :( They're all over the place.
>>
>>
>>   In other words, if the
>>>
>>> links are:
>>>
>>> foo.openoffice.org/bar/baz1
>>> foo.openoffice.org/bar/baz2
>>> foo.openoffice.org/bar/baz2
>>> foo.openoffice.org/bar2/baz1
>>> foo2.openoffice.org/bar1/baz1
>>>
>>> Then this would tell us that foo.openoffice.org/bar/* was a top source
>>> of broken links.  This might indicate important patterns of where the
>>> most broken links are.
>>>
>>> It seems like this could be done via a prefix tree (a "trie"):
>>> http://en.wikipedia.org/wiki/Trie
>>>
>>> Maybe other (simpler) ways as well.
>>
>>
>> I'll look at this article. It's a daunting task any way you look at it.
>>
>>>
>>> Regards,
>>
>>
>> What happens when things get moved a LOT with no regard for the end user.
>> Don't get me started on the ways I've had to deal with this in the past.
>>
>>
>>>
>>> -Rob
>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ----------------------------------------------------------------------------------------
>>>> MzK
>>>>
>>>> "I'm just a normal jerk who happens to make music.
>>>>    As long as my brain and fingers work, I'm cool."
>>>>                                 -- Eddie Van Halen
>>
>>
>> --
>> ------------------------------------------------------------------------
>> MzK
>>
>> "I'm just a normal jerk who happens to make music.
>>   As long as my brain and fingers work, I'm cool."
>>                                -- Eddie Van Halen
>>
>>

-- 
------------------------------------------------------------------------
MzK

"I'm just a normal jerk who happens to make music.
  As long as my brain and fingers work, I'm cool."
                               -- Eddie Van Halen



Mime
View raw message