lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Garth Grimm <GarthGr...@averyranchconsulting.com>
Subject Re: Different ids for the same document in different replicas.
Date Thu, 13 Nov 2014 13:07:59 GMT
OK.  So it sounds like doctorURL is a good key, but you don’t like the special characters.
 I’ve used MD5 hashes of URLs before as a way to convert unique URLs into unique alphanumeric
strings in a repeatable way.  I think most programming languages contain libraries for doing
that as you feed the data to Solr (Java certainly does).  Other hashing or encoding mechanisms
could be used if you wanted to be able to programmatically convert from the doctorURL to the
string you want to use and back again.

Anyway, the point there being that you have a repeatable unique key that is derived directly
from the data you’re storing.  Not a random ID value that will be different every time you
feed the same thing in.

BTW, you can certainly use a custom field type to do the hashing work, but I’d suggest you
do that before feeding the data to SolrCloud.  If you do it outside of SolrCloud, then SolrCloud
can use it for routing to the correct shard.  If you try to do it solely in a field type,
the field type output won’t be available until the indexing is actually occurring, which
is too late for routing purposes.  And that means you can’t ensure that subsequent re-feeds
of the same thing will overwrite the old values since you can’t make sure they get routed
to the same shard.

> On Nov 12, 2014, at 7:50 PM, Meraj A. Khan <merajak@gmail.com> wrote:
> 
> Sorry,its actually doctorUrl, so I dont want to use doctorUrl as a lookup
> mechanism because urls can have special characters that can caise issue
> with Solr lookup.
> 
> I guess I should rephrase my question to ,how to auto generate the unique
> keys in the id field when using SolrCloud?
> On Nov 12, 2014 7:28 PM, "Garth Grimm" <GarthGrimm@averyranchconsulting.com>
> wrote:
> 
>> You mention you already have a unique Key identified for the data you’re
>> storing in Solr:
>> 
>>> <uniqueKey>doctorId<uniquekey>
>> 
>> If that’s the field you’re using to uniquely identify each thing you’re
>> storing in the solr index, why do you want to have an id field that is
>> populated with some random value?  You’ll be using the doctorId field as
>> the key, and the id field will have no real meaning in your Data Model.
>> 
>> If doctorId actually isn’t unique to each item you plan on storing in
>> Solr, is there any other field that is?  If so, use that field as your
>> unique key.
>> 
>> Remember, this uniqueKeys are usually used for routing documents to shards
>> in SolrCloud, and are used to ensure that later updates of the same “thing”
>> overwrite the old one, rather than generating multiple copies.  So the keys
>> really should be something derived from the data your storing.  I’m not
>> sure if I understand why you would want to have the key randomly generated.
>> 
>>> On Nov 12, 2014, at 6:39 PM, S.L <simpleliving016@gmail.com> wrote:
>>> 
>>> Just tried  adding  <uniqueKey>id</uniqueKey> while keeping id type=
>>> "string" only blank ids are being generated ,looks like the id is being
>>> auto generated only if the the id is set to  type uuid , but in case of
>>> SolrCloud this id will be unique per replica.
>>> 
>>> Is there a  way to generate a unique id both in case of SolrCloud with
>> out
>>> using the uuid type or not having a per replica unique id?
>>> 
>>> The uuid in question is of type .
>>> 
>>> <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
>>> 
>>> 
>>> On Wed, Nov 12, 2014 at 6:20 PM, S.L <simpleliving016@gmail.com> wrote:
>>> 
>>>> Thanks.
>>>> 
>>>> So the issue here is I already have a <uniqueKey>doctorId<uniquekey>
>>>> defined in my schema.xml.
>>>> 
>>>> If along with that I also want the <id></id> field to be automatically
>>>> generated for each document do I have to declare it as a <uniquekey>
as
>>>> well , because I just tried the following setting without the uniqueKey
>> for
>>>> id and its only generating blank ids for me.
>>>> 
>>>> *schema.xml*
>>>> 
>>>>       <field name="id" type="string" indexed="true" stored="true"
>>>>           required="true" multiValued="false" />
>>>> 
>>>> *solrconfig.xml*
>>>> 
>>>>     <updateRequestProcessorChain name="uuid">
>>>> 
>>>>       <processor class="solr.UUIDUpdateProcessorFactory">
>>>>           <str name="fieldName">id</str>
>>>>       </processor>
>>>>       <processor class="solr.RunUpdateProcessorFactory" />
>>>>   </updateRequestProcessorChain>
>>>> 
>>>> 
>>>> On Tue, Nov 11, 2014 at 7:47 PM, Garth Grimm <
>>>> GarthGrimm@averyranchconsulting.com> wrote:
>>>> 
>>>>> Looking a little deeper, I did find this about UUIDField
>>>>> 
>>>>> 
>>>>> 
>> http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/schema/UUIDField.html
>>>>> 
>>>>> "NOTE: Configuring a UUIDField instance with a default value of "NEW"
>> is
>>>>> not advisable for most users when using SolrCloud (and not possible if
>> the
>>>>> UUID value is configured as the unique key field) since the result
>> will be
>>>>> that each replica of each document will get a unique UUID value. Using
>>>>> UUIDUpdateProcessorFactory<
>>>>> 
>> http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html
>>> 
>>>>> to generate UUID values when documents are added is recomended
>> instead.”
>>>>> 
>>>>> That might describe the behavior you saw.  And the use of
>>>>> UUIDUpdateProcessorFactory to auto generate ID’s seems to be covered
>> well
>>>>> here:
>>>>> 
>>>>> 
>>>>> 
>> http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/
>>>>> 
>>>>> Though I’ve not actually tried that process before.
>>>>> 
>>>>> On Nov 11, 2014, at 7:39 PM, Garth Grimm <
>>>>> GarthGrimm@averyranchconsulting.com<mailto:
>>>>> GarthGrimm@averyranchconsulting.com>> wrote:
>>>>> 
>>>>> “uuid” isn’t an out of the box field type that I’m familiar with.
>>>>> 
>>>>> Generally, I’d stick with the out of the box advice of the schema.xml
>>>>> file, which includes things like….
>>>>> 
>>>>> <!-- Only remove the "id" field if you have a very good reason to.
>>>>> While not strictly
>>>>>   required, it is highly recommended. A <uniqueKey> is present
in
>>>>> almost all Solr
>>>>>   installations. See the <uniqueKey> declaration below where
>>>>> <uniqueKey> is set to "id".
>>>>> -->
>>>>> <field name="id" type="string" indexed="true" stored="true"
>>>>> required="true" multiValued="false" />
>>>>> 
>>>>> and…
>>>>> 
>>>>> <!-- Field to use to determine and enforce document uniqueness.
>>>>>    Unless this field is marked with required="false", it will be a
>>>>> required field
>>>>> -->
>>>>> <uniqueKey>id</uniqueKey>
>>>>> 
>>>>> If you’re creating some key/value pair with uuid as the key as you
feed
>>>>> documents in, and you know that the uuid values you’re creating are
>> unique,
>>>>> just change the field name and unique key name from ‘id’ to ‘uuid’.
 Or
>>>>> change the key name you send in from ‘uuid’ to ‘id’.
>>>>> 
>>>>> On Nov 11, 2014, at 7:18 PM, S.L <simpleliving016@gmail.com<mailto:
>>>>> simpleliving016@gmail.com>> wrote:
>>>>> 
>>>>> Hi All,
>>>>> 
>>>>> I am seeing interesting behavior on the replicas , I have a single
>>>>> shard and 6 replicas and on SolrCloud 4.10.1 . I  only have a small
>>>>> number of documents ~375 that are replicated across the six replicas
.
>>>>> 
>>>>> The interesting thing is that the same  document has a different id in
>>>>> each one of those replicas .
>>>>> 
>>>>> This is causing the fq(id:xyz) type queries to fail, depending on
>>>>> which replica the query goes to.
>>>>> 
>>>>> I have  specified the id field in the following manner in schema.xml,
>>>>> is it the right way to specifiy an auto generated id in  SolrCloud ?
>>>>> 
>>>>>     <field name="id" type="uuid" indexed="true" stored="true"
>>>>>         required="true" multiValued="false" />
>>>>> 
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 

Mime
View raw message