lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mr Havercamp <mrhaverc...@gmail.com>
Subject Re: Duplicate Documents
Date Sat, 12 Sep 2015 16:51:41 GMT
Unfortunately, <uniqueKey/> has never changed. The issue can take some time
to show itself although I think there were logic issues with the way I
update documents in my index.

I first do a full purge and reindex of all items without issue.

Over time, I only index items that have changed/are new since initial
reindex. However, I start to see duplicates appear which is strange becuase
I use a combination of <uniqueKey/> plus overwrite="true" which should
guarantee uniqueness.

However, I have been using the /admin/luke lastModified date to check for
items which have been added/updated after this date but have just realized
that lastModified will only change if I a) reindex everything or b) call
optimize, so I have been retrieving items which have already been added to
the index. I think explicitly storing the last run time (in a file/db
field) will ensure I only retrieve those items which have changed since the
last index. This will also go a long way to solving the duplication issue.

Thanks again


Hayden

On 11 September 2015 at 19:33, Erick Erickson <erickerickson@gmail.com>
wrote:

> OK, this makes no sense whatsoever, so I"m missing something.
>
> commitWithin shouldn't matter at all, there's code to handle multiple
> updates between commits.
>
> I'm _really_ shooting in the dark here, but...
>
> > did you perhaps change the <uniqueKey> definition from the default "id"
> to "key" without blowing away the entire data directory in between?
>
> > Take a look at your schema file through the Admin/UI browser, is it what
> you expect? And did you reload/restart after the changes?
>
> > I could get _some_ duplication by changing the field that was my
> <uniqueKey>
> the adding more docs. Which makes some sense since some of the Lucene
> segment files were created with one definition and some with another. But
> that
> doesn't explain why you _keep_ getting more and more duplicates.
>
> But this behavior is fundamental Solr, so I doubt it would have snuck
> through
> or not generated very loud howls. Which leaves us with wondering what is
> unexpected in your setup. Everything you've shown us looks good, so I'm
> puzzled.
>
> Best,
> Erick
>
>
> On Fri, Sep 11, 2015 at 9:52 AM, Mr Havercamp <mrhavercamp@gmail.com>
> wrote:
> > I'm wondering if the commitWithin is causing issues.
> >
> > On 11 September 2015 at 18:52, Mr Havercamp <mrhavercamp@gmail.com>
> wrote:
> >
> >> Thanks for the suggestions. No, not using MERGEINDEXES nor
> >> MapReduceIndexerTool.
> >>
> >> I've pasted the <add/> XML in case there is something broken there (cut
> >> down for brevity, i.e. the "..."):
> >>
> >> <add overwrite="true" commitWithin="10000"><doc><field
> >> name="handle_s">123456789/3</field><field name="title">Test
> >> Submission</field><field name="title_sort">Test Submission</field><field
> >> name="access">1</field><field name="parent_id">1</field><field
> >> name="collection_s">Test Collection</field><field
> name="collection_fc">test
> >> collection|||Test Collection</field><field name="collection_sort">Test
> >> Collection</field><field name="dc.contributor.author_fc">young,
> >> hayden|||Young, Hayden</field><field name="author">Young,
> >> Hayden</field><field name="dc.contributor.author_sm">Young,
> >> Hayden</field>...<field name="key">archive.item.1</field>...</doc></add>
> >>
> >> On 11 September 2015 at 18:06, Erick Erickson <erickerickson@gmail.com>
> >> wrote:
> >>
> >>> Are you by any chance using the MERGEINDEXES
> >>> core admin call? Or using MapReduceIndexerTool?
> >>>
> >>> Neither of those delete duplicates....
> >>>
> >>> This is a fundamental part of Solr though, so it's
> >>> virtually certain that there's some innocent-seeming
> >>> thing you're doing that's causing this...
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Fri, Sep 11, 2015 at 8:55 AM, Shawn Heisey <apache@elyograg.org>
> >>> wrote:
> >>> > On 9/11/2015 9:10 AM, Mr Havercamp wrote:
> >>> >> fieldType def:
> >>> >>
> >>> >>         <!-- The StrField type is not analyzed, but indexed/stored
> >>> >> verbatim. -->
> >>> >>         <fieldType name="string" class="solr.StrField"
> >>> >> sortMissingLast="true" />
> >>> >>
> >>> >> It is not SolrCloud.
> >>> >
> >>> > As long as it's not a distributed index, I can't think of any problem
> >>> > those field/type definitions might cause.  Even if it were
> distributed
> >>> > and you had the same document in multiple shards, duplicates should
> be
> >>> > removed at query time, if each shard has the same schema as the
> others.
> >>> >
> >>> > I don't have any further ideas.  There may be something wrong that
I
> >>> > haven't thought of.
> >>> >
> >>> > Thanks,
> >>> > Shawn
> >>> >
> >>>
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message