lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Davis, Daniel (NIH/NLM) [C]" <daniel.da...@nih.gov>
Subject RE: Multiple data-config.xml in one collection?
Date Tue, 05 Apr 2016 21:39:43 GMT
Yangrui,

Solr will just do one data import.    You can have a script invoke more than one, and they
will run concurrently.   There are some risks with that, depending on what you are doing.
  If it's just pulling from a database, I think you are all right.   I've even had 4 run concurrently
to make Data Import Handler be "multi-threaded".   My query in one case looks like this:

	SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM medplus.public_topic_sites_us_v t)
WHERE threadid = 0

And then I have 3 other queries in other DIH configurations for threadid 1,2,3.

You also have to be careful with the clean parameter - unless a specific delete query is specified
using the "preImportDeleteQuery" or "postImportDeleteQuery", then the clean parameter will
cause DIH will remove the index data from all data import handlers even though you are only
refreshing one.   If you configure it carefully, it all works however.   

These are the use cases for the "source" field I use:

- Filter only on documents from one source for the user, by specifying fq=source:health-topics
in the query to Solr.
- Filter only documents from one source in backend processing, for instance for the preImportDeleteQuery.
- Do something different in the application that front-ends Solr depending on the "source"
field value.

There are some impacts on relevancy from combining them into one collection:

When you combine multiple sources into one collection, whether using DIH or some other mechanism,
you have to remember that the relevancy calculations of Solr include documents from both sources.
  Even if documents having different "source" documents are queried independently (through
filter queries, such as fq:source=health-topics, the frequency of a word in the entire collection
is a factor.    

However, you can query them together, even if you have to carefully tune weighting of the
documents so that a large corpus doesn't dwarf a small one (unless it is appropriate).   As
always, relevancy gets pretty tricky.

Hope this helps,

Dan Davis

-----Original Message-----
From: Yangrui Guo [mailto:guoyangrui@gmail.com] 
Sent: Tuesday, April 05, 2016 3:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Multiple data-config.xml in one collection?

Hi Daniel,

So if I implement multiple dataimporthandler and do a full import, does Solr perform import
of all handlers at once or can just specify which handler to import? Thank you

Yangrui

On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <daniel.davis@nih.gov>
wrote:

> If Shawn is correct, and you are using DIH, then I have done this by 
> implementing multiple requestHandlers each of them using Data Import 
> Handler, and have each specify a different XML file for the data config.
> Instead of using data-config.xml, I've used a large number of files such as:
>         health-topics-conf.xml
>         encyclopedia-conf.xml
>         ...
> I tend to index a single valued, required field named "source" that I 
> can use in the delete query, and I use the TemplateTranformer to make this easy:
>
> <entity name="topic"
>     ...
>    transformer="TemplateTransformer">
>    <field column="source" template="health-topics" />
>    ...
>
> Hope this helps,
>
> -Dan
>
> -----Original Message-----
> From: Shawn Heisey [mailto:apache@elyograg.org <javascript:;>]
> Sent: Tuesday, April 05, 2016 10:50 AM
> To: solr-user@lucene.apache.org <javascript:;>
> Subject: Re: Multiple data-config.xml in one collection?
>
> On 4/5/2016 8:12 AM, Yangrui Guo wrote:
> > I'm using Solr Cloud to index a number of databases. The problem is 
> > there is unknown number of databases and each database has its own
> configuration.
> > If I create a single collection for every database the query would 
> > eventually become insanely long. Is it possible to upload different 
> > config to zookeeper for each node in a single collection?
>
> Every shard replica (core) in a collection shares the same 
> configuration, which it gets from zookeeper.  This is one of 
> SolrCloud's guarantees, to prevent problems found with old-style 
> sharding when the configuration is different on each machine.
>
> If you're using the dataimport handler, which you probably are since 
> you mentioned databases, you can parameterize pretty much everything 
> in the DIH config file so it comes from URL parameters on the 
> full-import or delta-import command.
>
> Below is a link to the DIH config that I'm using, redacted slightly.
> I'm not running SolrCloud, but the same thing should work in cloud.  
> It should give you some idea of how to use variables in your config, 
> set by parameters on the URL.
>
> http://apaste.info/jtq
>
> Thanks,
> Shawn
>
>
Mime
View raw message