lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Wed, 28 Mar 2007 17:09:25 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steven Parkes updated LUCENE-848:
---------------------------------

      Description: Add support for using Wikipedia for benchmarking.  (was: Add support for
using Wikipedia for benchmarking. If no one is working on this, I'll start soon.)
    Lucene Fields:   (was: [New])
          Summary: Add supported for Wikipedia English as a corpus in the benchmarker stuff
 (was: Add supported for Wikipediea English as a corpus in the benchmarker stuff)

Can't leave the typo in the title. It's bugging me.

Karl, it looks like your stuff grabs individual articles, right? I'm gong to have it download
the bzip2 snapshots they provide (and that they prefer you use, if you're getting much).

Question (for Doron and anyone else): the file is xml and it's big, so DOM isn't going to
work. I could still use something SAX based but since the format is so tightly controlled,
I'm thinking regular expressions would be sufficient and have less dependences. Anyone have
opinions on this? 

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message