lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Parashar <>
Subject RE: Multiple data-config.xml in one collection?
Date Thu, 14 Apr 2016 16:55:52 GMT
Thanks a lot Daniel.

-----Original Message-----
From: Davis, Daniel (NIH/NLM) [C] [] 
Sent: Thursday, April 14, 2016 11:41 AM
Subject: RE: Multiple data-config.xml in one collection?

Jay Parashar wrote:
> One thing, and please correct if wrong, I have noticed running 
> DataImport for a particular config overwrites the existing data  for a 
> document...that is, there is no way to preserve the existing data.
> For example if you have a schema of 5 fields and running the 
> health-topics-conf.xml DIH  loads 3 of those fields of a document 
> (id=XYZ) And then running the encyclopedia-conf.xml DIH will overwrite those 3 fields
for the same  document id = XYZ.

Not quite so.   You're right that each RequestHandler has a *default* data config, 
specified in solrconfig.xml.   As most things Solr, this can be overridden.   But it is still
good best practice.   You are right that if one DataImport imports the same ID as another,

it will overwrite the older copy completely.   However, you can control the overlap so that
indexing is independent even into the same collection.

Suppose you have two configured request handlers:

	/dataimport/healthtopics - this uses health-topics-conf.xml
	/dataimport/encyclopedia - this uses encyclopedia-conf.xml

These two files can load *completely separate records* with different ids, and they can 
have different delete queries configured.   An excerpt from my health-topics-conf.xml:

    <dataSource type="FileDataSource" name="filesrc" />

       <!-- Snipping here to omit actual file paths -->

          This entity match runs XSL on the entire file, and then streams
          the results as needed, one solr document at a time.   It uses
          Solr's UpdateXmlMessages XML schema and then farther enhances
          this using transformers.
        <entity name="topic" dataSource="filesrc" 

          <field column="source" template="health-topics"/>
          <field column="group" template="Health Topics"/>
          <field column="organization_t" template="National Library of Medicine"/>

          <!-- Snipping the remainder of the XML -->

Hope this helps,

Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications
Systems, National Library of Medicine, NIH

-----Original Message-----
From: Jay Parashar []
Sent: Thursday, April 14, 2016 11:43 AM
Subject: RE: Multiple data-config.xml in one collection?

You have to specify which one to run. Each DIH will run only one XML (e.g. health-topics-conf.xml)

-----Original Message-----
From: Yangrui Guo []
Sent: Tuesday, April 05, 2016 2:16 PM
Subject: Re: Multiple data-config.xml in one collection?

Hi Daniel,

So if I implement multiple dataimporthandler and do a full import, does Solr perform import
of all handlers at once or can just specify which handler to import? Thank you


On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <>

> If Shawn is correct, and you are using DIH, then I have done this by 
> implementing multiple requestHandlers each of them using Data Import 
> Handler, and have each specify a different XML file for the data config.
> Instead of using data-config.xml, I've used a large number of files such as:
>         health-topics-conf.xml
>         encyclopedia-conf.xml
>         ...
> I tend to index a single valued, required field named "source" that I 
> can use in the delete query, and I use the TemplateTranformer to make this easy:
> <entity name="topic"
>     ...
>    transformer="TemplateTransformer">
>    <field column="source" template="health-topics" />
>    ...
> Hope this helps,
> -Dan
> -----Original Message-----
> From: Shawn Heisey [ <javascript:;>]
> Sent: Tuesday, April 05, 2016 10:50 AM
> To: <javascript:;>
> Subject: Re: Multiple data-config.xml in one collection?
> On 4/5/2016 8:12 AM, Yangrui Guo wrote:
> > I'm using Solr Cloud to index a number of databases. The problem is 
> > there is unknown number of databases and each database has its own
> configuration.
> > If I create a single collection for every database the query would 
> > eventually become insanely long. Is it possible to upload different 
> > config to zookeeper for each node in a single collection?
> Every shard replica (core) in a collection shares the same 
> configuration, which it gets from zookeeper.  This is one of 
> SolrCloud's guarantees, to prevent problems found with old-style 
> sharding when the configuration is different on each machine.
> If you're using the dataimport handler, which you probably are since 
> you mentioned databases, you can parameterize pretty much everything 
> in the DIH config file so it comes from URL parameters on the 
> full-import or delta-import command.
> Below is a link to the DIH config that I'm using, redacted slightly.
> I'm not running SolrCloud, but the same thing should work in cloud.  
> It should give you some idea of how to use variables in your config, 
> set by parameters on the URL.
> CwIBaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nk
> p5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=ps8KnPZhgym3oVyuWub8JT0eZI39W0FLsBW4fx5
> 61NY&s=k7H8l9XT7yyH_KHFtnIi793EtkLZnUvOz3lZA1mV01s&e=
> Thanks,
> Shawn
View raw message