lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From v.se...@lombardodier.com
Subject Re: deleting with sorting and max document
Date Thu, 15 Sep 2011 12:20:30 GMT
Hi,

our application is indexing our logging events as documents. when the 
index reaches a limit, I want to delete the oldest 1 million events. since 
the number of events per day changes on a day to day basis, I cannot just 
delete blindly the last 3 days for instance.
based on your different inputs I decided to query with a max = 1 million 
sorted by index order. I get the last document, get its timestamp, then 
delete based on a new query that includes a criteria on the timestamp 
field. this is good enough.

thanks all for you help,


Vincent







Chris Hostetter <hossman_lucene@fucit.org> 
 
 
14.09.2011 22:04
Please respond to
java-user@lucene.apache.org



To
java-user@lucene.apache.org, simon.willnauer@gmail.com
cc

Subject
Re: deleting with sorting and max document







: can you provide your query which yields all the documents that you
: want to delete? I don't understand how the sort order changes anything
: here. if you want to only delete the top N docs of that query you
: should maybe modify your query to only return those. I could imagine
: you are returning the oldest first, if so can't you do a range filter
: on top instead of sorting?

i suspect the susinct problem description is something like "i want to 
only have the X newest docs that match query Q in my index, so i want to 
execute Q, find the total number of matches N, and then delete the first 
N-X docs matching Q when sorted by field F"

Hypothetical example: a news aggregation site, with various contracts 
with other news sites that say things like "only allowed to redisplay at 
most 1000 articles from the NY Times at any one time" and the people 
running the site want to always include the 1000 newest NYT articles and 
delete the older ones.


I suspect the most efficient way to deal with this would be to give every 
document a unique id that is garunteed to always increase.  then decide 
how many docs you need to delete, and execute a query sorting on that id 
field asc using that num docs as the size of a TopSortedDocs, and find the 

id of the "newest" doc that you want to delete, then reformulate the query 

to include a range query on the id field with that value.  if the num of 
docs to delete is too big to deal with TopSortedDocs, then paginate trough 

until you get the number you need.

(you can do the same thing w/o the unique id using a date field, but you 
run the risk of overdeleting if multiple docs have the same date)

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




************************ DISCLAIMER ************************
This message is intended only for use by the person to
whom it is addressed. It may contain information that is
privileged and confidential. Its content does not
constitute a formal commitment by Lombard Odier
Darier Hentsch & Cie or any of its branches or affiliates.
If you are not the intended recipient of this message,
kindly notify the sender immediately and destroy this
message. Thank You.
*****************************************************************

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message