lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Open Relevance Project?
Date Mon, 18 May 2009 09:59:49 GMT
André Warnier wrote:
> Hi.
> There has been an erlier suggestion here, later endorsed by someone 
> else, to use the documentation of the Apache projects as a corpus.
> Being far from an expert, I am just naively wondering why the experts on 
> this list seem to totally ignore it, without providing any argument.
> Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible, 
> useless, uninteresting or ... ?

The documentation is mostly on a single topic - programming. The 
vocabulary is, let's not deceive ourselves, limited ;) Pages contain a 
lot of noise (Forrest navigation, javadoc dressing, common class names, 
code snippets, etc).

For a general-purpose corpus you would want to have several topics, with 
a well-balanced representation, and using a broad vocabulary and low 
level of noise.

Additionally, this collection gets relatively little endorsement (links 
with meaningful anchors) from within apache.org, so the typical PageRank 
scoring wouldn't work too well (on the other hand, it resembles intranet 
linkage, so it could be useful for studying scoring algos for enterprise 
search).

So, while this collection is not useless, it's not the best fit either.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message