lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: SolrCloud, DIH, and XPathEntityProcessor
Date Tue, 12 Jan 2016 18:22:08 GMT
Yeah, that's essentially the nature of open source, someone
gets frustrated enough with current behavior and fixes it ;)...

There's never any harm in opening a JIRA, all you need to do
is register. It's not a bad idea to open on as you _start_ writing
the code, even providing very early versions of your patch for
people to comment on or to discuss approaches. And early
comments may save you a lot of work! No guarantees of course.

If you do put up a preliminary patch, just mention the current
state in the comments.

If you haven't seen it already, here's a primer:


On Tue, Jan 12, 2016 at 7:16 AM, Tom Evans <> wrote:
> On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heisey <> wrote:
>> On 1/12/2016 7:45 AM, Tom Evans wrote:
>>> That makes no sense whatsoever. DIH loads the data_import.conf from ZK
>>> just fine, or is that provided to DIH from another module that does
>>> know about ZK?
>> This is accomplished indirectly through a resource loader in the
>> SolrCore object that is responsible for config files.  Also, the
>> dataimport handler is created by the main Solr code which then hands the
>> configuration to the dataimport module.  DIH itself does not know about
>> zookeeper.
> ZkPropertiesWriter seems to know a little..
>>> Either way, it is entirely sub-optimal to have SolrCloud store "all"
>>> its configuration in ZK, but still require manually storing and
>>> updating files on specific nodes in order to influence DIH. If a
>>> server is mistakenly not updated, or manually modified locally on
>>> disk, that node would start indexing documents differently than other
>>> replicas, which sounds dangerous and scary!
>> The entity processor you are using accesses files through a Java
>> interface for mounted filesystems.  As already mentioned, it does not
>> know about zookeeper.
>>> If there is not a ZkFileDataSource, it shouldn't be too tricky to add
>>> one... I'll see how much I dislike having config files on the host...
>> Creating your own DIH class would be the only solution available right now.
>> I don't know how useful this would be in practice.  Without special
>> config in multiple places, Zookeeper limits the size of the files it
>> contains to 1MB.  It is not designed to deal with a large amount of data
>> at once.
> This is not large amounts of data, it is a 5kb XML file containing
> configuration of what tables to query for what fields and how to map
> them in to the document.
>> You could submit a feature request in Jira, but unless you supply a
>> complete patch that survives the review process, I do not know how
>> likely an implementation would be.
> We've already started implementation, basing around FileDataSource and
> using SolrZkClient, which we will deploy as an additional library
> whilst that process is ongoing or doesn't survive it.
> Cheers
> Tom

View raw message