lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Roberts <>
Subject Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?
Date Fri, 23 Jan 2015 18:29:09 GMT
Hi Alex,

If I am understanding this correctly, I can define multiple entities 
like this?


How would I trigger loading certain entities during start?

How would I trigger loading other entities during update?

Is there a way to set an auto-update for certain entities so that I 
don't have to invoke an update via curl?

Where / how do I specify the preImportDeleteQuery to avoid deleting 
everything upon each update?

Is there an example or doc that shows how to do all this?



On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
> You can define both multiple entities in the same file and nested
> entities if your list comes from an external source (e.g. a text file
> of URLs).
> You can also trigger DIH with a name of a specific entity to load just that.
> You can even pass DIH configuration file when you are triggering the
> processing start, so you can have different files completely for
> initial load and update. Though you can just do the same with
> entities.
> The only thing to be aware of is that before an entity definition is
> processed, a delete command is run. By default, it's "delete all", so
> executing one entity will delete everything but then just populate
> that one entity's results. You can avoid that by defining
> preImportDeleteQuery and having a clear identifier on content
> generated by each entity (e.g. source, either extracted or manually
> added with TemplateTransformer).
> Regards,
>     Alex.
> ----
> Sign up for my Solr resources newsletter at
> On 23 January 2015 at 11:15, Carl Roberts <> wrote:
>> Hi,
>> I have the RSS DIH example working with my own RSS feed - here is the
>> configuration for it.
>> <dataConfig>
>>      <dataSource type="URLDataSource" />
>>      <document>
>>          <entity name="nvd-rss"
>>                  pk="link"
>>                  url=""
>>                  processor="XPathEntityProcessor"
>>                  forEach="/RDF/item"
>>                  transformer="DateFormatTransformer">
>>              <field column="id" xpath="/RDF/item/title" commonField="true" />
>>              <field column="link" xpath="/RDF/item/link" commonField="true"
>> />
>>              <field column="summary" xpath="/RDF/item/description"
>> commonField="true" />
>>              <field column="date" xpath="/RDF/item/date" commonField="true"
>> />
>>          </entity>
>>      </document>
>> </dataConfig>
>> However, my problem is that I also have to load multiple XML feeds into the
>> same core.  Here is one example (there are about 10 of them):
>> Is there any built-in functionality that would allow me to do this?
>> Basically, the use-case is to load and index all the XML ZIP files first,
>> and then check the RSS feed every two hours and update the indexes with any
>> new ones.
>> Regards,
>> Joe

View raw message