lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Roberts <carl.roberts.zap...@gmail.com>
Subject Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?
Date Fri, 23 Jan 2015 18:29:09 GMT
Hi Alex,

If I am understanding this correctly, I can define multiple entities 
like this?

<document>
     <entity/>
     <entity/>
     <entity/>
     ...
</document>

How would I trigger loading certain entities during start?

How would I trigger loading other entities during update?

Is there a way to set an auto-update for certain entities so that I 
don't have to invoke an update via curl?

Where / how do I specify the preImportDeleteQuery to avoid deleting 
everything upon each update?

Is there an example or doc that shows how to do all this?

Regards,

Joe

On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
> You can define both multiple entities in the same file and nested
> entities if your list comes from an external source (e.g. a text file
> of URLs).
> You can also trigger DIH with a name of a specific entity to load just that.
> You can even pass DIH configuration file when you are triggering the
> processing start, so you can have different files completely for
> initial load and update. Though you can just do the same with
> entities.
>
> The only thing to be aware of is that before an entity definition is
> processed, a delete command is run. By default, it's "delete all", so
> executing one entity will delete everything but then just populate
> that one entity's results. You can avoid that by defining
> preImportDeleteQuery and having a clear identifier on content
> generated by each entity (e.g. source, either extracted or manually
> added with TemplateTransformer).
>
> Regards,
>     Alex.
>
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 23 January 2015 at 11:15, Carl Roberts <carl.roberts.zapata@gmail.com> wrote:
>> Hi,
>>
>> I have the RSS DIH example working with my own RSS feed - here is the
>> configuration for it.
>>
>> <dataConfig>
>>      <dataSource type="URLDataSource" />
>>      <document>
>>          <entity name="nvd-rss"
>>                  pk="link"
>>                  url="https://nvd.nist.gov/download/nvd-rss.xml"
>>                  processor="XPathEntityProcessor"
>>                  forEach="/RDF/item"
>>                  transformer="DateFormatTransformer">
>>
>>              <field column="id" xpath="/RDF/item/title" commonField="true" />
>>              <field column="link" xpath="/RDF/item/link" commonField="true"
>> />
>>              <field column="summary" xpath="/RDF/item/description"
>> commonField="true" />
>>              <field column="date" xpath="/RDF/item/date" commonField="true"
>> />
>>
>>          </entity>
>>      </document>
>> </dataConfig>
>>
>> However, my problem is that I also have to load multiple XML feeds into the
>> same core.  Here is one example (there are about 10 of them):
>>
>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>
>>
>> Is there any built-in functionality that would allow me to do this?
>> Basically, the use-case is to load and index all the XML ZIP files first,
>> and then check the RSS feed every two hours and update the indexes with any
>> new ones.
>>
>> Regards,
>>
>> Joe
>>
>>


Mime
View raw message