lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: lucene functionality
Date Wed, 13 Dec 2006 19:09:27 GMT
Lucene RangeQuery would do for the "time" and "numeric" reqs.

"Mark Mei" <vmslucene@gmail.com> wrote:
> At the bottom of this email is the sample xml file that we are using
today.
> We have about 10 million of these.
>
> We need to know whether Lucene can support the following functionalities.
> (1) Each field is searchable and indexable.

This is natural in Lucene.

> (2) Fields such as STARTTIME and ENDTIME need to be treated as a pair so
> that we can apply timestamp operation such as search by data time ranges

You should be able to do it with two fields - start/end - and then you have
all the flexibility for queries. Conditions can be set on either one of
these or on both. In case that both are used, i.e. a doc matches only if
its start-to-end range falls within a certain minStart maxEnd range, you
would need two open ended range queries (or range filters) - one to apply
on the start date and one to apply on the end date, because a single range
query values must have the same field name. Also notice that open ended
range queries must be created programmatically - current QueryParser does
not support this. See using DateTools for saving the time values in a
resolution that matches your needs.

> (3) Fields such as DMA need to be treated as numerical and be able to use
> math operators ( > < =) for those fields.

Same comments on range queries / filters apply. Be aware though that
comparison is lexicographic, so numeric values should be indexed as
strings. See NumberTools.

>
> We also use Apache Commons Digester to parse the xml files. So we want to
> know, can all of the above requirements be supported by combining both
> Digester and Lucene together, or do we need other modules in order for us
to
> support those requirements?
> If these functionalities can be supported, please tell us about the
effort
> involved (ie, do I need to rewrite 90% of Lucene/Digester to include
support
> for these requirements, or is it more like spending one/two afternoons
> extending some classes ? )

Apart from the XML handling (don't know about that), for the other reqs it
is just using Lucene's API.

>
> <DOCUMENT>
>   <DREREFERENCE>61926433</DREREFERENCE>
>   <DREDBNAME>News</DREDBNAME>
>   <SEGMENTID>61829557</SEGMENTID>
>   <SHOWID>2051460</SHOWID>
>   <PROGRAMID>21181</PROGRAMID>
>   <PROGRAMNAME>Action 10 News This Morning</PROGRAMNAME>
>   <PREFIX>wthi0600</PREFIX>
>   <STATIONID>903</STATIONID>
>   <STATIONNAME>WTHI-TV</STATIONNAME>
>   <AFFILIATEID>17</AFFILIATEID>
>   <AFFILIATENAME>CBS</AFFILIATENAME>
>   <MARKETID>141</MARKETID>
>   <MARKETNAME>Terre Haute</MARKETNAME>
>   <MEDIATYPE>T</MEDIATYPE>
>   <DMA>149</DMA>
>   <SOURCETYPE>CC</SOURCETYPE>
>   <STARTTIME>2005-07-04 06:00:00</STARTTIME>
>   <ENDTIME>2005-07-04 07:00:00</ENDTIME>
>   <STARTMETER>00:42:53</STARTMETER>
>   <ENDMETER>00:45:02</ENDMETER>
>   <DREDATE>2006-01-25 00:00:00</DREDATE>
>   <DRETITLE>At we take you to break with a look at some of the fourth of
> July fun going on around the wabash valley today.</DRETITLE>
>   <DRECONTENT>At we take you to break with a look at some of the fourth
of
> July fun going on around the wabash valley today. This is action 10 news
> this morning on wthi. He's been the US Attorney general for only a few
> months. But alberto gonzales may already be in the running for a new job.
> And not just any job, either.</DRECONTENT>
> </DOCUMENT>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message