Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 57EB467E6 for ; Wed, 18 May 2011 20:17:37 +0000 (UTC) Received: (qmail 59620 invoked by uid 500); 18 May 2011 20:17:36 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 59549 invoked by uid 500); 18 May 2011 20:17:36 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 59542 invoked by uid 99); 18 May 2011 20:17:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 May 2011 20:17:35 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sokolov@ifactory.com designates 68.236.111.2 as permitted sender) Received: from [68.236.111.2] (HELO camelot.ifactory.com) (68.236.111.2) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 May 2011 20:17:29 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by camelot.ifactory.com (Postfix) with ESMTP id 32F203672DA2; Wed, 18 May 2011 16:17:07 -0400 (EDT) Received: from camelot.ifactory.com ([127.0.0.1]) by localhost (camelot.ifactory.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XdotSKQOhNjZ; Wed, 18 May 2011 16:17:05 -0400 (EDT) Received: from aix.ifactory.com (aix.ifactory.com [192.168.10.27]) by camelot.ifactory.com (Postfix) with ESMTPA id 6E8363672DA1; Wed, 18 May 2011 16:17:05 -0400 (EDT) Message-ID: <4DD42941.8050104@ifactory.com> Date: Wed, 18 May 2011 16:17:05 -0400 From: Mike Sokolov User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100317 Lightning/1.0b1 Thunderbird/3.0.4 MIME-Version: 1.0 To: dev@lucene.apache.org CC: Michael McCandless , Chris Hostetter Subject: Re: Solr Config XML DTD's References: <4DBDEE19.5090305@ifactory.com> <4DC2105C.3010907@ifactory.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I looked into inserting a formal validation step in o.a.solr.core.Config and ran some preliminary simple tests. The code is fairly simple; just a couple of gotchas: 1) to use the RNC validation language (my preference), we would need to pull in a couple of new jars, one of which is over 600K. Also, support for RNC in the XML world is not very widespread: it's gotten more interest from researchers and less uptake more broadly, so it might not be the best choice, even if, aesthetically it is superior IMO. 2) The other alternatives are XML Schema and DTD. I think DTD is a non-starter since it just can't allow things like arbitrary attributes on an element (you have to list them explicitly). Schema is probably the best choice all things considered: support for it is built into the XML tools already in use, and it is widely adopted. The drawback is that it's a baroque and unwieldy syntax designed by an indecisive committee that loaded it down with excessive featuritis, and someone will end up having to maintain this: every time you add a new configuration option to the schema (or solrconfig, etc), then the schema-schema (validation schema?) will have to be updated to reflect that. 3) Finally, to get good error reporting it's important to show file name and line number where an error occurred. Although you can validate a constructed XML tree (a DOM), it's better to run validation on a Stream so the line numbers are available. Therefore it will probably be necessary to run two passes (one to validate, and one to construct the DOM), which means buffering the config. Doesn't seem like a big deal: these are small files that only get loaded once, but this is a cost of validation, I think. Of course the benefit is that users would actually get fast-failing specific and informative error messages covering a wide variety of misconfigurations: I would hope we could be restrictive enough to catch mis-spelled versions of known element and attribute names, or places where elements are out of order. I'd be willing to work this up, develop a preliminary schema (of whichever sort we choose), and send in a patch, but other folks would probably end up having to maintain it from time to time if it's to have any value at all and not just get disabled, so I just want to make sure this is something you all think is worth while before going any further. -Mike On 05/17/2011 09:04 AM, Michael McCandless wrote: > https://issues.apache.org/jira/browse/SOLR-2119 is a good example > where we are failing to catch mis-configuration on startup. > > Is there some way we can baby step here? EG use one of these XML > validation packages, incrementally, on only sub-strings from the XML? > (Or simpler is to just do the checking ourselves w/ custom code). > > Mike > > http://blog.mikemccandless.com > > On Wed, May 4, 2011 at 10:50 PM, Michael Sokolov wrote: > >> I'm not sure you will find anyone wanting to put in this effort now, but >> another suggestion for a general approach might be: >> >> 1 very basic static analysis to catch what you can - this should be a pretty >> minimal effort only given what can reasonably be achieved >> >> 2 throw runtime errors as Hoss says (probably already doing this well >> enough, but maybe some incremental improvements are needed?) >> >> 3 an option to run a "configtest" like httpd provides that preloads all >> declared handlers/plugins/modules etc, instantiates them and gives them an >> opportunity to read their config and throw whatever errors they find. This >> way you can set a standard (error on unrecognized parameter, say) in some >> core areas, and distribute the effort. This is a hugely useful sanity check >> to be able to run when you want to make config changes and not have your >> server fall over when it starts (or worse - later). >> >> -Mike "kibitzer" Sokolov >> >> On 5/4/2011 6:55 PM, Chris Hostetter wrote: >> >>> As i said: any improvements to help catch the mistakes we can identify >>> would be great, but we should maintain perspective of the effort/gain >>> tradeoff given that there is likely nothing we can do about the basic >>> problem of "a string that won't be evaluated until runtime" >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: dev-help@lucene.apache.org >> >> >> > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: dev-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org