lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Benchmarking on GOV2
Date Mon, 29 May 2006 18:11:54 GMT

On May 29, 2006, at 10:58 AM, Andrzej Bialecki wrote:

>> Has anyone used existing categorization data associated with the  
>> Reuters corpus to build a benchmarker that measured IR precision  
>> and/or recall?
>
> That would be RCV1 or RCV2, right? AFAIK the Reuters-21578 has no  
> such information ... The use of RCV1/RCV2 is subject to a more  
> stringent license than Reuters-21578, so that few people would be  
> able to actually run the benchmarks.

21578 has categorization information.  Here's a snippet from one of  
the SGML files (note the TOPICS tag):

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"  
OLDID="5562" NEWID="19">
<DATE>26-FEB-1987 15:26:54.12</DATE>
<TOPICS><D>wheat</D><D>grain</D></TOPICS>
<PLACES><D>yemen-arab-republic</D><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
&#5;&#5;&#5;C G
&#22;&#22;&#1;f0798&#31;reute
u f BC-/BONUS-WHEAT-FLOUR-FO   02-26 0096</UNKNOWN>
<TEXT>&#2;
<TITLE>BONUS WHEAT FLOUR FOR NORTH YEMEN  -- USDA</TITLE>

I'm not sure how to use this info, though -- I'm just investigating  
whether there's prior art before I start thinking hard about it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message