lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Namespaces in response (SOLR-1586)
Date Wed, 09 Dec 2009 15:40:21 GMT
Hi Grant,

My replies inline as well:

>>> 
>>> Discussion points:
>>> 1. If there are standard namespaces, then people can use them to do fun XML
>>> things
>> 
>> +1. This includes things like validation,
> 
> Yeah, but the rest of Solr's response doesn't have it, so...
> 

You mean the rest of SOLR's default response and the components that add to
it. I can, arbitrarily, as a user of SOLR, introduce as many inline xmlns
attributes (and thus declare arbitrary number of namespaces) as I want,
there is nothing that precludes me from doing so was my point.

>>> 3. The indexing side doesn't support them, so it seems odd to put in
>>> something
>>> like <field name="point">55.3 27.9</field> and get back <georss:point
>>> name="point"> 55.3 27.9</georss:point>.  At the same time, it seems
equally
>>> weird to get back <str name="point">...</str> when there is in fact
more
>>> semantic information available about this particular field that would
>>> otherwise require more work by an application to make sense of.
>> 
>> You got it. I'm not sure why it seems weird -- the translation from
>> docs/fields to external representation (via response writers or field type
>> representation) is one of the benefits of SOLR IMHO.
> 
> It's weird b/c no XML type was specified upfront, but a type was given out on
> the back end.  It's not a show stopper or anything, just an interesting point,
> I think.

I actually disagree with this. FieldTypes, if we agree on a data type
representation, e.g., georss point format, or line format, etc., define
their XML representation. So, if we have a FieldType of type georss:point,
then a type _is_ given up front, it's just defined in the standard that
defines the field element.

Imagine if you wanted to standardize on something like dublin core, for
titles, formats, etc. SOLR expects a fairly simple XML structure (Documents,
with Fields, with attributes), but the advantage of SOLR over traditional
Lucene is that via FieldTypes, you can understand what the true type of the
field you are indexing is. In other words, we can say in a schema file that
e.g., this incoming title is DublinCore, so its field type is
solr.DublinCoreAuthor, which inside of the FieldType definition, tells us
how to go from the given representation to the index reprsentation
(#toINternal) and subsequently tells us how to go from the index
representation to the external representation (#toExternal).

I'm not advocating for change SOLR's input doc format for indexing -- I'm
arguing that what you guys have done is actually a great idea. Having
FieldTypes and SolrInputDocuments as separate, allows each to involve
independently of one another, but the same time, be brought back together
for the purpose of e.g., validation, (see the lat/lon validation I did in
the attached patch), response writing (for plugging into external tools),
and for representation in the Lucene index outside of plain ol' Strings.

> 
>> 
>>> 4. If we let in other namespaces, we then are opening ourselves to longer
>>> responses, etc.  It is also likely the case that there isn't just one
>>> standard.  This likely could mean slower responses, etc.
>> 
>> How does adding in some characters (e.g., an "ns" tag and an associated URL)
>> add anything other than noise? We're talking the difference between O(n)
>> versus O(n+20) here. Also it's perfectly legit IMHO to say, well if you
>> introduce 10, 000 namespaces, well, that's on you, and be prepared for
>> slower client/server interactions.
> 
> You'd be surprised how slow XML parsing often is, especially for larger
> responses, XML processing can be quite expensive and most of the information
> in verbose at best.   I've seen this on a number of occasions and it is why we
> switched to a binary response format in SolrJ and why I think all clients
> should speak the binary protocol.

Sure, XML parsing can be slow, but from your point above, you guys have
standardized on using a binary request/response format in things like SolrJ,
so what does the XML have to do this with anyways and why performance a
concern then? In the case where people want XML, in their particular format,
it's up to them to parse (and in most cases, if they are outputting a
format, there's likely already readers/etc. that exist for that format,
where things like optimizations can be delegated to).

On the other hand, let's consider XSLT, which is a big performance hit as
well, in many cases, more of a hit than simply outputting XML with the
namespaces inline. Also, let's quality this. I'm not saying we should make
SOLR's default response (and all its Components that add to the response) be
forced to use namespaces. However, it should definitely not be precluded.

> 
> 
>> 
>>> 5. If people wanted them, they could just do XSLT, but that is an extra step
>>> too.
>> 
>> Yep, that's an extra step, and it's not explicit, like the patch I attached
>> is. I tried to take advantage of one of SOLR's extension points in the
>> architecture to explicitly tie a representation of a Field to its external
>> and internal representation (aka, the point of a FieldType, no?)
>>> 
>>> An alternative is that we could refactor things a bit and allow the
>>> FieldType
>>> to specify the tag name instead of it being hardcoded in the writers.  This
>>> way people writing FieldTypes could define them.  For instance, we could
>>> have
>>> FieldType.getTagName() that could be overridden and clients could have tools
>>> for introspecting this.
>> 
>> This is basically what I did right? I did an inline namespace using a
>> variant of #writePrm in XMLWriter (#writeCdata) and had the
>> FieldType#toExternal method set the tag name, which is allowed by the API.
> 
> As Hoss' points out on the thread, I think the longer term goal seems to be to
> be more agnostic of the FieldType, so this would argue against my proposal.

My opinion is that if you've got all of this flow and logic going through
FieldType (which makes a lot of sense IMHO, see my comments on that same
thread), which is similar e.g., to what we see in databases, etc.., it
actually makes a lot of sense. So, I would be +1 for your proposal, but as I
mentioned your proposal is already possible (as shown in this patch). There
is just not explicit API like you suggested to do so, with the method
signatures that you proposed.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Mime
View raw message