incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Weir <robw...@apache.org>
Subject Re: investigation using Google Webmaster tools
Date Wed, 01 Aug 2012 23:29:46 GMT
On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
> Hello all --
>
> I am exploring the www.openoffice.site using the Google Webmaster tool that
> Rob told us about on Jul 19.
>
> I am ONLY getting started by looking at the 62,962 404 errors (!!!!!)
>
> Many of these are links to VERY old docs which we no longer have -- like
> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
> architecture -- servlet references etc.
>

If I understand this correctly, Google is looking at links on
webpages, not just our webpages, but also links from 3rd party
websites, and if they point to an openoffice.org page that doesn't
exist, it shows up on this list.   This could happen for any reason.
In some cases the original link might have had a typo.

> Some of this issues could be solved with rather extensive use of sym links
> (yes, you can actually use these in svn -- kind of) and of course some not
> -- many missing old security bulletins.
>

For the security bulletins, I wonder if this is actually a redirection
error.  We have many of them here:

http://www.openoffice.org/security/bulletin.html

But we're redirecting security.openoffice.org to
http://incubator.apache.org/openofficeorg/security.html

So if there are outstanding URL's that are of the form
security.openoffice.org/foo.html then they might be broken now.

> So, to those of you using this tool, I may mark many of these as "fixed".
> Of course they are not -- and they may show up again. Some of them only
> show up in BZ issues!! (Google is amazingly thorough).
>
> I don't know how long it will take for them to "show up" again. The problem
> is some of these are very very very old references, and not likely we can
> do anything about at this point in time.
> If you're not using this tool, you probably don't care about this. If you
> are using it, and have another opinion before I start chunking away at
> hiding these, please weigh in.
>

The way I understand it the links at the top of the list are the ones
Google considers the most important.  I think this is based on the
number of links to that page.  Maybe they factor in other things as
well.  So I'd recommend looking more at the top 100 or so broken
links, make this a manageable task.

Or -- and here is a challenge for the algorithm experts -- maybe there
is an easy way to take that entire list of 62,962 links and determine
what the top base paths are that are broken.  In other words, if the
links are:

foo.openoffice.org/bar/baz1
foo.openoffice.org/bar/baz2
foo.openoffice.org/bar/baz2
foo.openoffice.org/bar2/baz1
foo2.openoffice.org/bar1/baz1

Then this would tell us that foo.openoffice.org/bar/* was a top source
of broken links.  This might indicate important patterns of where the
most broken links are.

It seems like this could be done via a prefix tree (a "trie"):
http://en.wikipedia.org/wiki/Trie

Maybe other (simpler) ways as well.

Regards,

-Rob

>
>
> --
> ----------------------------------------------------------------------------------------
> MzK
>
> "I'm just a normal jerk who happens to make music.
>  As long as my brain and fingers work, I'm cool."
>                               -- Eddie Van Halen

Mime
View raw message