lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Mon, 09 Apr 2007 18:29:32 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steven Parkes updated LUCENE-848:
---------------------------------

    Attachment: LUCENE-848.txt

This patch is a first cut a wikipedia benchmark support. It downloads the current english
pages from the Wikipedia download site ... which, of course, is actually not there right now.
I'm not quite sure what's up, but you can find the files at http://download.wikimedia.org/enwiki/20070402/
right now if you want to play.

It adds ExtractWikipedia.java, which uses Xerces-J to grab the individual articles. It writes
the articles in the same format as the Reuters stuff, so a generecised ReutersDocMaker, DirDocMaker,
works.

The current size of the download file is 2.1G bzip2'd. It's supposed to contain about 1.2M
documents but I came out with 2 or 3, I think, so there maybe "extra" files in there. (Some
entries are links and I tried to get rid of those, but I may have missed a particular coding
or case).

For the first pass, I copied the Reuters steps of decompressing and parsing. This creates
big temporary files. Moreover, it creates a big directory tree in the end. (The extractor
uses a fixed number of documents per directory and grows the depth of the tree logarithmically,
a lot like Lucene segments).

It's not clear how this preprocessing-to-a-directory-tree compares to on the fly decompression,
which would require less disk seeks on the input during indexing. May try that at some point
...

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message