lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Duplicate Documents
Date Fri, 11 Sep 2015 17:33:49 GMT
OK, this makes no sense whatsoever, so I"m missing something.

commitWithin shouldn't matter at all, there's code to handle multiple
updates between commits.

I'm _really_ shooting in the dark here, but...

> did you perhaps change the <uniqueKey> definition from the default "id"
to "key" without blowing away the entire data directory in between?

> Take a look at your schema file through the Admin/UI browser, is it what
you expect? And did you reload/restart after the changes?

> I could get _some_ duplication by changing the field that was my <uniqueKey>
the adding more docs. Which makes some sense since some of the Lucene
segment files were created with one definition and some with another. But that
doesn't explain why you _keep_ getting more and more duplicates.

But this behavior is fundamental Solr, so I doubt it would have snuck through
or not generated very loud howls. Which leaves us with wondering what is
unexpected in your setup. Everything you've shown us looks good, so I'm puzzled.

Best,
Erick


On Fri, Sep 11, 2015 at 9:52 AM, Mr Havercamp <mrhavercamp@gmail.com> wrote:
> I'm wondering if the commitWithin is causing issues.
>
> On 11 September 2015 at 18:52, Mr Havercamp <mrhavercamp@gmail.com> wrote:
>
>> Thanks for the suggestions. No, not using MERGEINDEXES nor
>> MapReduceIndexerTool.
>>
>> I've pasted the <add/> XML in case there is something broken there (cut
>> down for brevity, i.e. the "..."):
>>
>> <add overwrite="true" commitWithin="10000"><doc><field
>> name="handle_s">123456789/3</field><field name="title">Test
>> Submission</field><field name="title_sort">Test Submission</field><field
>> name="access">1</field><field name="parent_id">1</field><field
>> name="collection_s">Test Collection</field><field name="collection_fc">test
>> collection|||Test Collection</field><field name="collection_sort">Test
>> Collection</field><field name="dc.contributor.author_fc">young,
>> hayden|||Young, Hayden</field><field name="author">Young,
>> Hayden</field><field name="dc.contributor.author_sm">Young,
>> Hayden</field>...<field name="key">archive.item.1</field>...</doc></add>
>>
>> On 11 September 2015 at 18:06, Erick Erickson <erickerickson@gmail.com>
>> wrote:
>>
>>> Are you by any chance using the MERGEINDEXES
>>> core admin call? Or using MapReduceIndexerTool?
>>>
>>> Neither of those delete duplicates....
>>>
>>> This is a fundamental part of Solr though, so it's
>>> virtually certain that there's some innocent-seeming
>>> thing you're doing that's causing this...
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Sep 11, 2015 at 8:55 AM, Shawn Heisey <apache@elyograg.org>
>>> wrote:
>>> > On 9/11/2015 9:10 AM, Mr Havercamp wrote:
>>> >> fieldType def:
>>> >>
>>> >>         <!-- The StrField type is not analyzed, but indexed/stored
>>> >> verbatim. -->
>>> >>         <fieldType name="string" class="solr.StrField"
>>> >> sortMissingLast="true" />
>>> >>
>>> >> It is not SolrCloud.
>>> >
>>> > As long as it's not a distributed index, I can't think of any problem
>>> > those field/type definitions might cause.  Even if it were distributed
>>> > and you had the same document in multiple shards, duplicates should be
>>> > removed at query time, if each shard has the same schema as the others.
>>> >
>>> > I don't have any further ideas.  There may be something wrong that I
>>> > haven't thought of.
>>> >
>>> > Thanks,
>>> > Shawn
>>> >
>>>
>>
>>

Mime
View raw message