lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: problems with deleteDocuments
Date Sun, 08 Jul 2007 22:18:48 GMT
First, let me say that I ran a few tests to determine the behavior, so
it's entirely possible someone who actually understands the code will
tell me I'm all wet.

The problem here is that for every scenario in which deleting on partial
field matches would be good, I can create one where it would be bad. Or
worse, disastrous <G>....

The issue isn't deleting multiple documents at once, it's deleting on
partial term matches. In your example, you could have, say, two
IDs for the attachments, one the unique ID and one the parent ID. This
would allow you to remove all the attachments in one delete, and use
a separate delete for the original. Or you could have a "group" id
for an  e-mail and it's attachments, and delete them all with a single
delete on the "group" term. Or.....

But I'd *much* rather have a system that required me to jump through
a hoop when I wanted this kind of behavior than have a system that
allowed me to shoot myself in the foot. Or blow off the whole leg. And
deleting on, say, a text field with a partial term match on "the" is just
too terrible to contemplate <G>....


On 7/8/07, Nadav Har'El <> wrote:
> On Wed, Jul 04, 2007, Erick Erickson wrote about "Re: problems with
> deleteDocuments":
> > Consider what would happen otherwise. Say you have documents
> > with the following values for a field (call it blah).
> > some data
> > some data I put in the index
> > lots of data
> > data
> >
> > Then I don't want deleting on the term blah:data to remove all
> > of them. Which seems to be what you're asking. Even if
> > you restricted things to "phrases", then deleting on the term
> > 'blah:some data' would remove two documents.
> >
> > So, while UN_TOKENIZED isn't a *requirement*, exact total term
> > matches *is* the requirement. By that, I meant that whatever
> > goes into the field must not be broken into pieces by the indexing
> > tokenizer for deletes to work as you expect.
> I disagree, and frankly, am very surprised that "exact total term matches"
> is
> actually a requirement (I never tried it, so you may be absolutely right,
> I
> just hope you aren't).
> Let me give you just one example where id fields containing multiple
> words,
> and the ability for a delete query to match several documents, are useful.
> Consider an application for indexing emails with attachments. The email
> text,
> and each document attachment, is indexed as a separate document. When an
> email is deleted, we also need to delete its attachments. How shall we do
> this? One simple implementation is to have an "id" field for each
> document;
> The email text document will have a unique id, and the attachment document
> will have two ids: its own unique id, and the containing email's id. When
> we need to remove an email and all its attachments, we just remove all
> documents that match the email's id - and this will include the main text
> and the attachments.
> By the way, the method is called "deleteDocuments" - doesn't that imply
> that it's perfectly acceptable to delete many documents with one term?
> --
> Nadav Har'El                        |      Sunday, Jul  8 2007, 22 Tammuz
> 5767
> IBM Haifa Research
> Lab              |-----------------------------------------
>                                     |I am not a complete idiot - some
> parts
>           |are missing.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message