lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes" <steven_par...@esseff.org>
Subject RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Mon, 02 Apr 2007 19:41:58 GMT
I checked and there are escape sequences in there. If it was ever
debatable, I think that tips it in favor of SAX. xerces? The
contrib/gdata stuff seems to use it.

I suppose if I'm careful and creative enough, we could share a lot of
the code amongst benchmark ingesters that use XML, should there be more
... 

-----Original Message-----
From: Grant Ingersoll [mailto:grant.ingersoll@gmail.com] 
Sent: Wednesday, March 28, 2007 10:44 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia
English as a corpus in the benchmarker stuff


On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-848? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Steven Parkes updated LUCENE-848:
> ---------------------------------
>
>       Description: Add support for using Wikipedia for  
> benchmarking.  (was: Add support for using Wikipedia for  
> benchmarking. If no one is working on this, I'll start soon.)
>     Lucene Fields:   (was: [New])
>           Summary: Add supported for Wikipedia English as a corpus  
> in the benchmarker stuff  (was: Add supported for Wikipediea  
> English as a corpus in the benchmarker stuff)
>
> Can't leave the typo in the title. It's bugging me.
>
> Karl, it looks like your stuff grabs individual articles, right?  
> I'm gong to have it download the bzip2 snapshots they provide (and  
> that they prefer you use, if you're getting much).
>
> Question (for Doron and anyone else): the file is xml and it's big,  
> so DOM isn't going to work. I could still use something SAX based  
> but since the format is so tightly controlled, I'm thinking regular  
> expressions would be sufficient and have less dependences. Anyone  
> have opinions on this?


Personally, I think SAX is the way to go, as you'll get handling of  
escape sequences, etc. out of the box.  And seems like it is easier  
to read/maintain????

>
>> Add supported for Wikipedia English as a corpus in the benchmarker  
>> stuff
>> ---------------------------------------------------------------------

>> ---
>>
>>                 Key: LUCENE-848
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/benchmark
>>            Reporter: Steven Parkes
>>         Assigned To: Steven Parkes
>>            Priority: Minor
>>             Fix For: 2.2
>>
>>         Attachments: WikipediaHarvester.java
>>
>>
>> Add support for using Wikipedia for benchmarking.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message