incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kay Schenk <kay.sch...@gmail.com>
Subject Re: investigation using Google Webmaster tools
Date Fri, 03 Aug 2012 19:52:07 GMT
On Fri, Aug 3, 2012 at 9:29 AM, Rob Weir <robweir@apache.org> wrote:

> On Fri, Aug 3, 2012 at 12:13 PM, Kay Schenk <kay.schenk@gmail.com> wrote:
> >
> >
> > On 08/02/2012 07:45 AM, Rob Weir wrote:
> >>
> >> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk <kay.schenk@gmail.com>
> wrote:
> >>>
> >>>
> >>>
> >>> On 08/01/2012 04:29 PM, Rob Weir wrote:
> >>>>
> >>>>
> >>>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk <kay.schenk@gmail.com>
> wrote:
> >>>>>
> >>>>>
> >>>>> Hello all --
> >>>>>
> >>>>> I am exploring the www.openoffice.site using the Google Webmaster
> tool
> >>>>> that
> >>>>> Rob told us about on Jul 19.
> >>>>>
> >>>>> I am ONLY getting started by looking at the 62,962 404 errors (!!!!!)
> >>>>>
> >>>>> Many of these are links to VERY old docs which we no longer have
--
> >>>>> like
> >>>>> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
> >>>>> architecture -- servlet references etc.
> >>>>>
> >>>>
> >>>> If I understand this correctly, Google is looking at links on
> >>>> webpages, not just our webpages, but also links from 3rd party
> >>>> websites, and if they point to an openoffice.org page that doesn't
> >>>> exist, it shows up on this list.   This could happen for any reason.
> >>>> In some cases the original link might have had a typo.
> >>>
> >>>
> >>>
> >>> yes, this is correct, and you are right about this too...some of the
> 404s
> >>> reference pages we probably NEVER had.
> >>>
> >>>
> >>>>
> >>>>> Some of this issues could be solved with rather extensive use of
sym
> >>>>> links
> >>>>> (yes, you can actually use these in svn -- kind of) and of course
> some
> >>>>> not
> >>>>> -- many missing old security bulletins.
> >>>>>
> >>>>
> >>>> For the security bulletins, I wonder if this is actually a redirection
> >>>> error.  We have many of them here:
> >>>>
> >>>> http://www.openoffice.org/security/bulletin.html
> >>>
> >>>
> >>>
> >>> ah...yes, they are there...the problem is we would need to construct a
> >>> LOT
> >>> of just "redirect" pages to right some of these since they all seem to
> >>> have
> >>> the form
> >>>
> >>> "/security/cvs-bulletin-number".html
> >>>
> >>
> >> So let's take a specific example.
> >>
> >> Google is reporting a 404 error for this URL:
> >> http://www.openoffice.org/security/bulletin-20060629.html
> >>
> >> It is linked to from from at least 10 external web pages, for example
> >> the last link in this table:
> >>
> >>
> >>
> http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html
> >>
> >> (Whoops, make that at least 12 links, since the Apache and MarkMail
> >> list archives will now link to this)
> >>
> >> There is no file of this name in
> >>
> >>
> https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/
> >>
> >> Looking at the svn log I don't see any record of the files ever being
> >> here.
> >>
> >> I searched the complete ooo-site tree and this file
> >> (bulletin-20060629.html) doesn't exist anywhere.
> >>
> >> The Wayback Machine shows the page did exist in 2006:
> >>
> >>
> >>
> http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html
> >>
> >> But it was broken already by 2009:
> >>
> >>
> >>
> http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html
> >>
> >> So this is a pre-existing problem, and nothing we can do about it.
> >>
> >> Ughh.   Obviously we cannot do this kind of research for every one of
> >> the 64 thousand links.
> >>
> >> But in other cases we can help.  For example this link is giving 404
> >> error:
> >>
> >> http://www.openoffice.org/licenses/lgpl_license.html
> >>
> >> I think we removed that intentionally, since that is no longer our
> >> license.  However, that link was used by many other websites,
> >> including university course materials looking at open source licenses,
> >> etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
> >>
> >> So in cases like this we might want to restore the page.  Do our part
> >> to help prevent bit rot and entropy from destroying the web.
> >
> >
> > Well this particular one I really AM not in favor of restoring to our
> site.
> > What we could do on this one, is put in a page with just a redirect to
> where
> > the actual license lives. (and yes, this is really one of the most
> critical
> > ones in my opinion)
> >
>
> That would be fine, a page at that URL that says our license has
> changed, and that the LGPL van be found at the Free Software
> Foundation website, and link to that.  Everyone's happy then.
>
> >
> >>
> >> But to put it in perspective, although we have 64 thousand 404 errors
> >> on our website, we also have nearly 16 million incoming links that do
> >> not give errors.
> >
> >
> > Well that's a relief eh? :)
> >
> > OK, I will have another look at this. At any rate, we absolutely should
> put
> > in place a generic "error.html" and have infra reconfigure
> > www.openoffice.org with THAT as our 404. That way we can assist folks in
> > dealing with link problems.
> >
>
> The nice thing about a custom error page is we can put also put Google
> custom search box there, to let the user do a site-wide search to try
> to find their answer that way.
>
> -Rob
>

EXACTLY! And that's just what was done when I've been in other environments
and come up against this.


>
> >
> >
> >>
> >> -Rob
> >>
> >>>
> >>>>
> >>>> But we're redirecting security.openoffice.org to
> >>>> http://incubator.apache.org/openofficeorg/security.html
> >>>>
> >>>> So if there are outstanding URL's that are of the form
> >>>> security.openoffice.org/foo.html then they might be broken now.
> >>>
> >>>
> >>>
> >>> see above...it's the actual placement of the bulletins within the tree
> >>> that's the problem I think
> >>>
> >>>
> >>>
> >>>>
> >>>>> So, to those of you using this tool, I may mark many of these as
> >>>>> "fixed".
> >>>>> Of course they are not -- and they may show up again. Some of them
> only
> >>>>> show up in BZ issues!! (Google is amazingly thorough).
> >>>>>
> >>>>> I don't know how long it will take for them to "show up" again.
The
> >>>>> problem
> >>>>> is some of these are very very very old references, and not likely
we
> >>>>> can
> >>>>> do anything about at this point in time.
> >>>>> If you're not using this tool, you probably don't care about this.
If
> >>>>> you
> >>>>> are using it, and have another opinion before I start chunking away
> at
> >>>>> hiding these, please weigh in.
> >>>>>
> >>>>
> >>>> The way I understand it the links at the top of the list are the ones
> >>>> Google considers the most important.  I think this is based on the
> >>>> number of links to that page.  Maybe they factor in other things as
> >>>> well.  So I'd recommend looking more at the top 100 or so broken
> >>>> links, make this a manageable task.
> >>>
> >>>
> >>>
> >>> Well the problem is "how" to make it manageable... :(
> >>>
> >>>
> >>>>
> >>>> Or -- and here is a challenge for the algorithm experts -- maybe there
> >>>> is an easy way to take that entire list of 62,962 links and determine
> >>>> what the top base paths are that are broken.
> >>>
> >>>
> >>>
> >>> if only this were so :( They're all over the place.
> >>>
> >>>
> >>>   In other words, if the
> >>>>
> >>>>
> >>>> links are:
> >>>>
> >>>> foo.openoffice.org/bar/baz1
> >>>> foo.openoffice.org/bar/baz2
> >>>> foo.openoffice.org/bar/baz2
> >>>> foo.openoffice.org/bar2/baz1
> >>>> foo2.openoffice.org/bar1/baz1
> >>>>
> >>>> Then this would tell us that foo.openoffice.org/bar/* was a top
> source
> >>>> of broken links.  This might indicate important patterns of where the
> >>>> most broken links are.
> >>>>
> >>>> It seems like this could be done via a prefix tree (a "trie"):
> >>>> http://en.wikipedia.org/wiki/Trie
> >>>>
> >>>> Maybe other (simpler) ways as well.
> >>>
> >>>
> >>>
> >>> I'll look at this article. It's a daunting task any way you look at it.
> >>>
> >>>>
> >>>> Regards,
> >>>
> >>>
> >>>
> >>> What happens when things get moved a LOT with no regard for the end
> user.
> >>> Don't get me started on the ways I've had to deal with this in the
> past.
> >>>
> >>>
> >>>>
> >>>> -Rob
> >>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>>
> >>>>>
> ----------------------------------------------------------------------------------------
> >>>>> MzK
> >>>>>
> >>>>> "I'm just a normal jerk who happens to make music.
> >>>>>    As long as my brain and fingers work, I'm cool."
> >>>>>                                 -- Eddie Van Halen
> >>>
> >>>
> >>>
> >>> --
> >>>
> ------------------------------------------------------------------------
> >>> MzK
> >>>
> >>> "I'm just a normal jerk who happens to make music.
> >>>   As long as my brain and fingers work, I'm cool."
> >>>                                -- Eddie Van Halen
> >>>
> >>>
> >
> > --
> > ------------------------------------------------------------------------
> > MzK
> >
> > "I'm just a normal jerk who happens to make music.
> >  As long as my brain and fingers work, I'm cool."
> >                               -- Eddie Van Halen
> >
> >
>



-- 
----------------------------------------------------------------------------------------
MzK

"I'm just a normal jerk who happens to make music.
 As long as my brain and fingers work, I'm cool."
                              -- Eddie Van Halen

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message