lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Sokolov <>
Subject Re: Solr Config XML DTD's
Date Wed, 18 May 2011 20:17:05 GMT
I looked into inserting a formal validation step in o.a.solr.core.Config 
and ran some preliminary simple tests.  The code is fairly simple; just 
a couple of gotchas:

1) to use the RNC validation language (my preference), we would need to 
pull in a couple of new jars, one of which is over 600K.  Also, support 
for RNC in the XML world is not very widespread: it's gotten more 
interest from researchers and less uptake more broadly, so it might not 
be the best choice, even if, aesthetically it is superior IMO.

2) The other alternatives are XML Schema and DTD.  I think DTD is a 
non-starter since it just can't allow things like arbitrary attributes 
on an element (you have to list them explicitly).  Schema is probably 
the best choice all things considered: support for it is built into the 
XML tools already in use, and it is widely adopted.  The drawback is 
that it's a baroque and unwieldy syntax designed by an indecisive 
committee that loaded it down with excessive featuritis, and someone 
will end up having to maintain this: every time you add a new 
configuration option to the schema (or solrconfig, etc), then the 
schema-schema (validation schema?) will have to be updated to reflect that.

3) Finally, to get good error reporting it's important to show file name 
and line number where an error occurred.  Although you can validate a 
constructed XML tree (a DOM), it's better to run validation on a Stream 
so the line numbers are available.  Therefore it will probably be 
necessary to run two passes (one to validate, and one to construct the 
DOM), which means buffering the config.  Doesn't seem like a big deal: 
these are small files that only get loaded once, but this is a cost of 
validation, I think.

Of course the benefit is that users would actually get fast-failing 
specific and informative error messages covering a wide variety of 
misconfigurations: I would hope we could be restrictive enough to catch 
mis-spelled versions of known element and attribute names, or places 
where elements are out of order.

I'd be willing to work this up, develop a preliminary schema (of 
whichever sort we choose), and send in a patch, but other folks would 
probably end up having to maintain it from time to time if it's to have 
any value at all and not just get disabled, so I just want to make sure 
this is something you all think is worth while before going any further.


On 05/17/2011 09:04 AM, Michael McCandless wrote:
> is a good example
> where we are failing to catch mis-configuration on startup.
> Is there some way we can baby step here?  EG use one of these XML
> validation packages, incrementally, on only sub-strings from the XML?
> (Or simpler is to just do the checking ourselves w/ custom code).
> Mike
> On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov<>  wrote:
>> I'm not sure you will find anyone wanting to put in this effort now, but
>> another suggestion for a general approach might be:
>> 1 very basic static analysis to catch what you can - this should be a pretty
>> minimal effort only given what can reasonably be achieved
>> 2 throw runtime errors as Hoss says (probably already doing this well
>> enough, but maybe some incremental improvements are needed?)
>> 3 an option to run a "configtest" like httpd provides that preloads all
>> declared handlers/plugins/modules etc, instantiates them and gives them an
>> opportunity to read their config and throw whatever errors they find.  This
>> way you can set a standard (error on unrecognized parameter, say) in some
>> core areas, and distribute the effort.  This is a hugely useful sanity check
>> to be able to run when you want to make config changes and not have your
>> server fall over when it starts (or worse - later).
>> -Mike "kibitzer" Sokolov
>> On 5/4/2011 6:55 PM, Chris Hostetter wrote:
>>> As i said: any improvements to help catch the mistakes we can identify
>>> would be great, but we should maintain perspective of the effort/gain
>>> tradeoff given that there is likely nothing we can do about the basic
>>> problem of "a string that won't be evaluated until runtime"
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message