mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Wikipedia things/strings dataset
Date Sat, 19 May 2012 20:38:46 GMT
Nice to note: the license for this data is very nicely compatible with the
Apache License for software, as it's CC-3.0 Attribution:

as linked from their download dir:

On Sat, May 19, 2012 at 12:50 PM, Dan Brickley <> wrote:

> Just noticed this handy-looking dataset,
> "From Words to Concepts and Back: Dictionaries for Linking Text,
> Entities and Ideas"
> Excerpt, "How do we represent concepts? Our approach piggybacks on the
> unique titles of entries from an encyclopedia, which are mostly proper
> and common noun phrases. We consider each individual Wikipedia article
> as representing a concept (an entity or an idea), identified by its
> URL. Text strings that refer to concepts were collected using the
> publicly available hypertext of anchors (the text you click on in a
> web link) that point to each Wikipedia page, thus drawing on the vast
> link structure of the web. For every English article we harvested the
> strings associated with its incoming hyperlinks from the rest of
> Wikipedia, the greater web, and also anchors of parallel, non-English
> Wikipedia pages. Our dictionaries are cross-lingual, and any concept
> deemed too fine can be broadened to a desired level of generality
> using Wikipedia's groupings of articles into hierarchical categories.
> The data set contains triples, each consisting of (i) text, a short,
> raw natural language string; (ii) url, a related concept, represented
> by an English Wikipedia article's canonical location; and (iii) count,
> an integer indicating the number of times text has been observed
> connected with the concept's url. Our database thus includes weights
> that measure degrees of association. " [...]
> I figured this should be of interest to a good few Mahout users, so
> passing it along...
> cheers,
> Dan



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message