lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Tue, 24 Apr 2007 18:45:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491396
] 

Steven Parkes commented on LUCENE-848:
--------------------------------------

Yeah, it takes a while to download.

I added the jars since that's what we've been doing elsewhere. In fact, xerces is in gdata-server
too. Personally, the size isn't an issue for me; don't know about others.  What might be difficult,
though, is trying to share the two since that would mean coordinating contrib projects, and
I don't know anything about the gdata server. I can tell you that if you want to support both
1.4 and 1.5 on something as big wikipedia, there is sensitivity to the xerces revision. 

Sorry about the download problem, Grant. I actually documented that in a readme ... hat I
can no longer find. I would swear I put it in the patch but obviously I didn't becuase it's
not there. Now I have to go find it.

The short answer is you want to download  http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2.
The wikipedia download site isn't always clean, doesn't have files where they "should" be.
It was when I first started this, but isn't now.

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt,
WikipediaHarvester.java, xerces.jar, xerces.jar, xml-apis.jar
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message