lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Tue, 24 Apr 2007 20:50:27 GMT
Is there a way to pick a specific day, versus "latest".  How long  
does Wikipedia archive?  Always using the latest makes comparisons  
more difficult.  I wonder if licensing terms would allow us to host a  
specific date of the version on Lucene zones.  Of course, that may  
not be a good idea bandwidth wise.  I'm open to suggestions.  Maybe  
using the latest isn't that big of a deal.



On Apr 24, 2007, at 2:45 PM, Steven Parkes (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-848? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12491396 ]
>
> Steven Parkes commented on LUCENE-848:
> --------------------------------------
>
> Yeah, it takes a while to download.
>
> I added the jars since that's what we've been doing elsewhere. In  
> fact, xerces is in gdata-server too. Personally, the size isn't an  
> issue for me; don't know about others.  What might be difficult,  
> though, is trying to share the two since that would mean  
> coordinating contrib projects, and I don't know anything about the  
> gdata server. I can tell you that if you want to support both 1.4  
> and 1.5 on something as big wikipedia, there is sensitivity to the  
> xerces revision.
>
> Sorry about the download problem, Grant. I actually documented that  
> in a readme ... hat I can no longer find. I would swear I put it in  
> the patch but obviously I didn't becuase it's not there. Now I have  
> to go find it.
>
> The short answer is you want to download  http:// 
> download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages- 
> articles.xml.bz2. The wikipedia download site isn't always clean,  
> doesn't have files where they "should" be. It was when I first  
> started this, but isn't now.
>
>> Add supported for Wikipedia English as a corpus in the benchmarker  
>> stuff
>> --------------------------------------------------------------------- 
>> ---
>>
>>                 Key: LUCENE-848
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/benchmark
>>            Reporter: Steven Parkes
>>         Assigned To: Grant Ingersoll
>>            Priority: Minor
>>             Fix For: 2.2
>>
>>         Attachments: LUCENE-848.txt, LUCENE-848.txt,  
>> LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java,  
>> xerces.jar, xerces.jar, xml-apis.jar
>>
>>
>> Add support for using Wikipedia for benchmarking.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message