lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Mon, 02 Apr 2007 20:01:07 GMT

On Apr 2, 2007, at 3:41 PM, Steven Parkes wrote:

> I checked and there are escape sequences in there. If it was ever
> debatable, I think that tips it in favor of SAX. xerces? The
> contrib/gdata stuff seems to use it.

Xerces should be fine, I think.

>
> I suppose if I'm careful and creative enough, we could share a lot of
> the code amongst benchmark ingesters that use XML, should there be  
> more
> ...
>

Yes, indeed.  May not be necessary initially, but we could support  
XPath or something down the road to allow us to specify what things  
we are interested in.  I wouldn't worry about generalizing too much  
to start with.  Once we have a couple collections then we can go that  
route.

> -----Original Message-----
> From: Grant Ingersoll [mailto:grant.ingersoll@gmail.com]
> Sent: Wednesday, March 28, 2007 10:44 AM
> To: java-dev@lucene.apache.org
> Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia
> English as a corpus in the benchmarker stuff
>
>
> On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:
>
>>
>>      [ https://issues.apache.org/jira/browse/LUCENE-848?
>> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Steven Parkes updated LUCENE-848:
>> ---------------------------------
>>
>>       Description: Add support for using Wikipedia for
>> benchmarking.  (was: Add support for using Wikipedia for
>> benchmarking. If no one is working on this, I'll start soon.)
>>     Lucene Fields:   (was: [New])
>>           Summary: Add supported for Wikipedia English as a corpus
>> in the benchmarker stuff  (was: Add supported for Wikipediea
>> English as a corpus in the benchmarker stuff)
>>
>> Can't leave the typo in the title. It's bugging me.
>>
>> Karl, it looks like your stuff grabs individual articles, right?
>> I'm gong to have it download the bzip2 snapshots they provide (and
>> that they prefer you use, if you're getting much).
>>
>> Question (for Doron and anyone else): the file is xml and it's big,
>> so DOM isn't going to work. I could still use something SAX based
>> but since the format is so tightly controlled, I'm thinking regular
>> expressions would be sufficient and have less dependences. Anyone
>> have opinions on this?
>
>
> Personally, I think SAX is the way to go, as you'll get handling of
> escape sequences, etc. out of the box.  And seems like it is easier
> to read/maintain????
>
>>
>>> Add supported for Wikipedia English as a corpus in the benchmarker
>>> stuff
>>> -------------------------------------------------------------------- 
>>> -
>
>>> ---
>>>
>>>                 Key: LUCENE-848
>>>                 URL: https://issues.apache.org/jira/browse/ 
>>> LUCENE-848
>>>             Project: Lucene - Java
>>>          Issue Type: New Feature
>>>          Components: contrib/benchmark
>>>            Reporter: Steven Parkes
>>>         Assigned To: Steven Parkes
>>>            Priority: Minor
>>>             Fix For: 2.2
>>>
>>>         Attachments: WikipediaHarvester.java
>>>
>>>
>>> Add support for using Wikipedia for benchmarking.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message