lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Falck" <>
Subject SV: SV: Sort problematics
Date Thu, 18 May 2006 22:23:24 GMT
Hi Gunther.
We thought in the terms of an index containing the search profiles and search that index using
the documents as a query. But we couldn't really figure it out. We have an alert service up
and running today using Veritys implementation of alerts. So we looked at the Verity documentation
and realised that they didn't handle the alert using an inverted index. So we implemented
our new alert service in the same way the verity service works today. 
Which seems to work nice, but if you have any concrete solution on how to achive an inverted
index storing pretty complex queries you are more then welcome to share it.
What I want to accomplish is an central index for alot of large backend systems containing
a lot of articles. For example news polled from web, newspapers delivered in electronic form
to us and 3:d part document databases.
So what we have done is to implement a search engine using Lucene as the core. This engine
is scalable both in terms of range and round-robin/range. Fetcher applications fetches documents
from different storages and transforms those documents into a more common format and then
distributes them to all searchmachines matching that range.
The range clustering is built using date range. Since we are going to buy document databases
from other companies we can't guarantee that all data will be added in terms of date order.

The volymes of data we are talking about are around 500 Million news articles.
The enduser, and alot of our internal processes for value adding services, are then defining
a search query for things they want to monitor. In the endusers case this is called "agent".
When the user logs in to the system and clicks on its agent the user will get the matching
articles presented to him/her in DATE order (newest first). The date order is critical. The
relevance is not important since we have value added services such as quality control of the
So the last thing to do in order to get a fully functional prof of concept up is to fix the
date order presentation. And since it's alot of data and the IndexSearcher will be recreated
pretty often we will need to change the lucene scoring/ranking. And I can't understand why
this should be so hard? But I don't have any clue of what the best practises for doing so


Från: Günther Starnberger []
Skickat: to 2006-05-18 23:22
Ämne: Re: SV: Sort problematics

On Thu, May 18, 2006 at 10:53:23PM +0200, Marcus Falck wrote:


> The term scorer will give higher score on documents containing both
> terms. This is a problem (in our application) since in this case want
> the same score on documents as long as they contain 1 of the terms
> (since we are dealing with newsletter observation for companies they
> want to get the hits ordered by date to get the complete overview).  I
> tested to rewrite the TermScorer to give me the same score with
> success. So my question is.

What exactly do you want to achieve with your application?

You speak of "immediate alerts". I understand this as: Your users
specify keywords or queries and when you receive a new document which
matches a query you alert the user.

Is this what you want to do? If so I don't think that Lucene is useful
for this kind of realtime queries. Instead of using an inverted index
it would make more sense to use a normal index which contains the
terms you search for. If you receive a new document make a lookup on
each term of the document using the index. It _might_ be possible to
do this with Lucene by storing the search-terms as documents and using
the documents which you receive as queries, but i guess this it isn't
that trivial.

If you need a combination of traditional search and real-time alerts a
hybrid solution may make sense. But using Lucene for real-time search
isn't a good idea (at least IMO).


View raw message