lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes" <steven_par...@esseff.org>
Subject RE: [jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Tue, 24 Apr 2007 22:01:10 GMT
They don't seem to keep things around too long. There were more files
available when I downloaded earlier this month, but they're already
gone.

Wikipedia is supposed to only contain stuff covered by the GNU Free
Documentation License so saving it should be okay. In fact, one of the
other files you can download has all the revisions of all the documents.

The issue of different versions is a good one. I wonder how much it
matters for reasonably big datasets. Not that much of the data changes,
I suspect.

For grins, I think I'll download the newer snapshot and see if there's
any difference for the ingest tests I've done. 

-----Original Message-----
From: Grant Ingersoll [mailto:grant.ingersoll@gmail.com] 
Sent: Tuesday, April 24, 2007 1:50 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-848) Add supported for Wikipedia
English as a corpus in the benchmarker stuff

Is there a way to pick a specific day, versus "latest".  How long  
does Wikipedia archive?  Always using the latest makes comparisons  
more difficult.  I wonder if licensing terms would allow us to host a  
specific date of the version on Lucene zones.  Of course, that may  
not be a good idea bandwidth wise.  I'm open to suggestions.  Maybe  
using the latest isn't that big of a deal.



On Apr 24, 2007, at 2:45 PM, Steven Parkes (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-848? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12491396 ]
>
> Steven Parkes commented on LUCENE-848:
> --------------------------------------
>
> Yeah, it takes a while to download.
>
> I added the jars since that's what we've been doing elsewhere. In  
> fact, xerces is in gdata-server too. Personally, the size isn't an  
> issue for me; don't know about others.  What might be difficult,  
> though, is trying to share the two since that would mean  
> coordinating contrib projects, and I don't know anything about the  
> gdata server. I can tell you that if you want to support both 1.4  
> and 1.5 on something as big wikipedia, there is sensitivity to the  
> xerces revision.
>
> Sorry about the download problem, Grant. I actually documented that  
> in a readme ... hat I can no longer find. I would swear I put it in  
> the patch but obviously I didn't becuase it's not there. Now I have  
> to go find it.
>
> The short answer is you want to download  http:// 
> download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages- 
> articles.xml.bz2. The wikipedia download site isn't always clean,  
> doesn't have files where they "should" be. It was when I first  
> started this, but isn't now.
>
>> Add supported for Wikipedia English as a corpus in the benchmarker  
>> stuff
>> ---------------------------------------------------------------------

>> ---
>>
>>                 Key: LUCENE-848
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/benchmark
>>            Reporter: Steven Parkes
>>         Assigned To: Grant Ingersoll
>>            Priority: Minor
>>             Fix For: 2.2
>>
>>         Attachments: LUCENE-848.txt, LUCENE-848.txt,  
>> LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java,  
>> xerces.jar, xerces.jar, xml-apis.jar
>>
>>
>> Add support for using Wikipedia for benchmarking.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message