lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Wed, 28 Mar 2007 17:44:08 GMT

On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-848? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Steven Parkes updated LUCENE-848:
> ---------------------------------
>
>       Description: Add support for using Wikipedia for  
> benchmarking.  (was: Add support for using Wikipedia for  
> benchmarking. If no one is working on this, I'll start soon.)
>     Lucene Fields:   (was: [New])
>           Summary: Add supported for Wikipedia English as a corpus  
> in the benchmarker stuff  (was: Add supported for Wikipediea  
> English as a corpus in the benchmarker stuff)
>
> Can't leave the typo in the title. It's bugging me.
>
> Karl, it looks like your stuff grabs individual articles, right?  
> I'm gong to have it download the bzip2 snapshots they provide (and  
> that they prefer you use, if you're getting much).
>
> Question (for Doron and anyone else): the file is xml and it's big,  
> so DOM isn't going to work. I could still use something SAX based  
> but since the format is so tightly controlled, I'm thinking regular  
> expressions would be sufficient and have less dependences. Anyone  
> have opinions on this?


Personally, I think SAX is the way to go, as you'll get handling of  
escape sequences, etc. out of the box.  And seems like it is easier  
to read/maintain????

>
>> Add supported for Wikipedia English as a corpus in the benchmarker  
>> stuff
>> --------------------------------------------------------------------- 
>> ---
>>
>>                 Key: LUCENE-848
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/benchmark
>>            Reporter: Steven Parkes
>>         Assigned To: Steven Parkes
>>            Priority: Minor
>>             Fix For: 2.2
>>
>>         Attachments: WikipediaHarvester.java
>>
>>
>> Add support for using Wikipedia for benchmarking.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message