incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roberto Galoppini <rgalopp...@geek.net>
Subject Re: investigation using Google Webmaster tools
Date Fri, 03 Aug 2012 15:49:25 GMT
On Thu, Aug 2, 2012 at 4:45 PM, Rob Weir <robweir@apache.org> wrote:
> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
>>
>>
>> On 08/01/2012 04:29 PM, Rob Weir wrote:
>>>
>>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
>>>>
>>>> Hello all --
>>>>
>>>> I am exploring the www.openoffice.site using the Google Webmaster tool
>>>> that
>>>> Rob told us about on Jul 19.
>>>>
>>>> I am ONLY getting started by looking at the 62,962 404 errors (!!!!!)
>>>>
>>>> Many of these are links to VERY old docs which we no longer have -- like
>>>> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
>>>> architecture -- servlet references etc.
>>>>
>>>
>>> If I understand this correctly, Google is looking at links on
>>> webpages, not just our webpages, but also links from 3rd party
>>> websites, and if they point to an openoffice.org page that doesn't
>>> exist, it shows up on this list.   This could happen for any reason.
>>> In some cases the original link might have had a typo.
>>
>>
>> yes, this is correct, and you are right about this too...some of the 404s
>> reference pages we probably NEVER had.
>>
>>
>>>
>>>> Some of this issues could be solved with rather extensive use of sym
>>>> links
>>>> (yes, you can actually use these in svn -- kind of) and of course some
>>>> not
>>>> -- many missing old security bulletins.
>>>>
>>>
>>> For the security bulletins, I wonder if this is actually a redirection
>>> error.  We have many of them here:
>>>
>>> http://www.openoffice.org/security/bulletin.html
>>
>>
>> ah...yes, they are there...the problem is we would need to construct a LOT
>> of just "redirect" pages to right some of these since they all seem to have
>> the form
>>
>> "/security/cvs-bulletin-number".html
>>
>
> So let's take a specific example.
>
> Google is reporting a 404 error for this URL:
> http://www.openoffice.org/security/bulletin-20060629.html
>
> It is linked to from from at least 10 external web pages, for example
> the last link in this table:
>
> http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html
>
> (Whoops, make that at least 12 links, since the Apache and MarkMail
> list archives will now link to this)
>
> There is no file of this name in
> https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/
>
> Looking at the svn log I don't see any record of the files ever being here.
>
> I searched the complete ooo-site tree and this file
> (bulletin-20060629.html) doesn't exist anywhere.
>
> The Wayback Machine shows the page did exist in 2006:
>
> http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html
>
> But it was broken already by 2009:
>
> http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html
>
> So this is a pre-existing problem, and nothing we can do about it.
>
> Ughh.   Obviously we cannot do this kind of research for every one of
> the 64 thousand links.
>
> But in other cases we can help.  For example this link is giving 404 error:
>
> http://www.openoffice.org/licenses/lgpl_license.html
>
> I think we removed that intentionally, since that is no longer our
> license.  However, that link was used by many other websites,
> including university course materials looking at open source licenses,
> etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
>
> So in cases like this we might want to restore the page.  Do our part
> to help prevent bit rot and entropy from destroying the web.
>
> But to put it in perspective, although we have 64 thousand 404 errors
> on our website, we also have nearly 16 million incoming links that do
> not give errors.

Given our rank I'd rather assume that those 64k 404 errors don't
affect our site popularity because of the 16 M links. So said, we
might consider to restore pages like that one, adding the info about
the license change.

Roberto

> -Rob
>
>>
>>>
>>> But we're redirecting security.openoffice.org to
>>> http://incubator.apache.org/openofficeorg/security.html
>>>
>>> So if there are outstanding URL's that are of the form
>>> security.openoffice.org/foo.html then they might be broken now.
>>
>>
>> see above...it's the actual placement of the bulletins within the tree
>> that's the problem I think
>>
>>
>>
>>>
>>>> So, to those of you using this tool, I may mark many of these as "fixed".
>>>> Of course they are not -- and they may show up again. Some of them only
>>>> show up in BZ issues!! (Google is amazingly thorough).
>>>>
>>>> I don't know how long it will take for them to "show up" again. The
>>>> problem
>>>> is some of these are very very very old references, and not likely we can
>>>> do anything about at this point in time.
>>>> If you're not using this tool, you probably don't care about this. If you
>>>> are using it, and have another opinion before I start chunking away at
>>>> hiding these, please weigh in.
>>>>
>>>
>>> The way I understand it the links at the top of the list are the ones
>>> Google considers the most important.  I think this is based on the
>>> number of links to that page.  Maybe they factor in other things as
>>> well.  So I'd recommend looking more at the top 100 or so broken
>>> links, make this a manageable task.
>>
>>
>> Well the problem is "how" to make it manageable... :(
>>
>>
>>>
>>> Or -- and here is a challenge for the algorithm experts -- maybe there
>>> is an easy way to take that entire list of 62,962 links and determine
>>> what the top base paths are that are broken.
>>
>>
>> if only this were so :( They're all over the place.
>>
>>
>>  In other words, if the
>>>
>>> links are:
>>>
>>> foo.openoffice.org/bar/baz1
>>> foo.openoffice.org/bar/baz2
>>> foo.openoffice.org/bar/baz2
>>> foo.openoffice.org/bar2/baz1
>>> foo2.openoffice.org/bar1/baz1
>>>
>>> Then this would tell us that foo.openoffice.org/bar/* was a top source
>>> of broken links.  This might indicate important patterns of where the
>>> most broken links are.
>>>
>>> It seems like this could be done via a prefix tree (a "trie"):
>>> http://en.wikipedia.org/wiki/Trie
>>>
>>> Maybe other (simpler) ways as well.
>>
>>
>> I'll look at this article. It's a daunting task any way you look at it.
>>
>>>
>>> Regards,
>>
>>
>> What happens when things get moved a LOT with no regard for the end user.
>> Don't get me started on the ways I've had to deal with this in the past.
>>
>>
>>>
>>> -Rob
>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ----------------------------------------------------------------------------------------
>>>> MzK
>>>>
>>>> "I'm just a normal jerk who happens to make music.
>>>>   As long as my brain and fingers work, I'm cool."
>>>>                                -- Eddie Van Halen
>>
>>
>> --
>> ------------------------------------------------------------------------
>> MzK
>>
>> "I'm just a normal jerk who happens to make music.
>>  As long as my brain and fingers work, I'm cool."
>>                               -- Eddie Van Halen
>>
>>

-- 
====
This e- mail message is intended only for the named recipient(s) above. It 
may contain confidential and privileged information. If you are not the 
intended recipient you are hereby notified that any dissemination, 
distribution or copying of this e-mail and any attachment(s) is strictly 
prohibited. If you have received this e-mail in error, please immediately 
notify the sender by replying to this e-mail and delete the message and any 
attachment(s) from your system. Thank you.


Mime
View raw message