lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: lucene functionality
Date Thu, 14 Dec 2006 13:20:15 GMT

On Dec 13, 2006, at 1:51 PM, Patrick Turcotte wrote:

> I would suggest you take a look at exist-db (http://exist-db.org/).

I really doubt eXist can handle 10M XML files.  Last time I tried it,  
it choked on 20k of them.

	Erik


>
> A database for XML documents that support XQuery.
>
> We are using both products here (lucene and exist-db), and for what  
> you are
> looking for, exist-db seems better.
>
> Our documents are far more complex than yours (about 500 different  
> element
> in the structure) and even if we don't have millions, we have more  
> than 53K
> documents.
>
> Once loaded in the database, performance are impressives to find  
> info on
> documents parts (xpath) where no index exists. And for your  
> structure, you
> could even create indexes which would boost performance even more.
>
> Don't hesitate to contact me directly if you have more questions.
>
> Patrick
>
> On 12/13/06, Mark Mei <vmslucene@gmail.com> wrote:
>>
>> At the bottom of this email is the sample xml file that we are using
>> today.
>> We have about 10 million of these.
>>
>> We need to know whether Lucene can support the following  
>> functionalities.
>> (1) Each field is searchable and indexable.
>> (2) Fields such as STARTTIME and ENDTIME need to be treated as a  
>> pair so
>> that we can apply timestamp operation such as search by data time  
>> ranges
>> (3) Fields such as DMA need to be treated as numerical and be able  
>> to use
>> math operators ( > < =) for those fields.
>>
>> We also use Apache Commons Digester to parse the xml files. So we  
>> want to
>> know, can all of the above requirements be supported by combining  
>> both
>> Digester and Lucene together, or do we need other modules in order  
>> for us
>> to
>> support those requirements?
>> If these functionalities can be supported, please tell us about  
>> the effort
>> involved (ie, do I need to rewrite 90% of Lucene/Digester to include
>> support
>> for these requirements, or is it more like spending one/two  
>> afternoons
>> extending some classes ? )
>>
>> <DOCUMENT>
>>   <DREREFERENCE>61926433</DREREFERENCE>
>>   <DREDBNAME>News</DREDBNAME>
>>   <SEGMENTID>61829557</SEGMENTID>
>>   <SHOWID>2051460</SHOWID>
>>   <PROGRAMID>21181</PROGRAMID>
>>   <PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME>
>>   <PREFIX>wthi0600</PREFIX>
>>   <STATIONID>903</STATIONID>
>>   <STATIONNAME>WTHI-TV</STATIONNAME>
>>   <AFFILIATEID>17</AFFILIATEID>
>>   <AFFILIATENAME>CBS</AFFILIATENAME>
>>   <MARKETID>141</MARKETID>
>>   <MARKETNAME>Terre Haute</MARKETNAME>
>>   <MEDIATYPE>T</MEDIATYPE>
>>   <DMA>149</DMA>
>>   <SOURCETYPE>CC</SOURCETYPE>
>>   <STARTTIME>2005-07-04 06:00:00</STARTTIME>
>>   <ENDTIME>2005-07-04 07:00:00</ENDTIME>
>>   <STARTMETER>00:42:53</STARTMETER>
>>   <ENDMETER>00:45:02</ENDMETER>
>>   <DREDATE>2006-01-25 00:00:00</DREDATE>
>>   <DRETITLE>At we take you to break with a look at some of the  
>> fourth of
>> July fun going on around the wabash valley today.</DRETITLE>
>>   <DRECONTENT>At we take you to break with a look at some of the  
>> fourth of
>> July fun going on around the wabash valley today. This is action  
>> 10 news
>> this morning on wthi. He's been the US Attorney general for only a  
>> few
>> months. But alberto gonzales may already be in the running for a  
>> new job.
>> And not just any job, either.</DRECONTENT>
>> </DOCUMENT>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message