lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3577) rename expungeDeletes
Date Tue, 15 Nov 2011 18:59:52 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150675#comment-13150675
] 

Hoss Man commented on LUCENE-3577:
----------------------------------

bq. If there are just a few deletes in a few small segments, using optimize instead of expungeDeletes
is much more expensive?

that's what i was wondering ... 

most incrementally updated indexes i've seen related to structured content (ie: products,
news, blogs, patents, etc...) the "recent" documents are the only things likely to get updates
(ie: a news story published in the past hour has a decent change of getting an update, a news
story published yesterday might get a typo fixed, but a news story published a year ago isn't
likely to ever get updated) so in a traditional merged segment structure the newer/smaller
segments are the only ones that tend to have delets -- the bigger older segments are mostly
stagnant except when involved in merging.  An expungeDelets call that only touches the small
"recent" segments is going to be a lot faster then a full optimize, correct?

bq. Although, it doesn't really seem like an important use case (ensuring there are no deletes).

I'm constantly surprised by the number of people who are really picky about ensuring that
their tf/idf numbers are *exact* because they use them in a weird way -- it's definitely an
expert level concern, but if those people are willing to spend the time expunging deletes
and we already have the code, might as well leave it in right?

i think this is really just a question of naming/documentation: the method doesn't sound as
sexy as optimize, but if someone stumbles upon it and thinks "oh wow, i guess i have to call
this for my deletes to really be deleted" that's bad.  likewise the javadocs encourage/imply
that it this method *should* be called, instead of just explaining that it *can* be called
and what it does.

I don't have a good suggestion for the name, but the doc is really the issue...

{quote}
...When an index has many document deletions (or updates to existing documents), it's best
to either call optimize or expungeDeletes to remove all unused data in the index associated
with the deleted documents. To see how many deletions you have pending in your index, call
IndexReader.numDeletedDocs() This saves disk space and memory usage while searching. ...
{quote}

...nothing in that description describes the downsides/cost of the method.
                
> rename expungeDeletes
> ---------------------
>
>                 Key: LUCENE-3577
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3577
>             Project: Lucene - Java
>          Issue Type: Task
>            Reporter: Robert Muir
>
> Similar to optimize(), expungeDeletes() has a misleading name.
> We already had problems with this on the user list because TieredMergePolicy
> didn't 'expunge' all their deletes.
> Also I think expunge is the wrong word, because expunge makes it seem
> like you just wrangle up the deletes and kick them out of the party and
> that it should be fast.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message