cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [RT] Input Pipelines (long)
Date Mon, 23 Dec 2002 09:16:26 GMT
Hmmm, maybe deep architectural discussions are good during holydays 
seasons... we'll see :)

Daniel Fagerstrom wrote:
> Input Pipelines
> ===============
> 
> There is, IMO, a need for better support for input handling in
> Cocoon. I believe that the introduction of "input pipelines" can be an
> important step in this direction. In the rest of this (long) RT I will
> discuss use cases for them, a possible definition of input pipelines,
> compare them with the existing pipeline concept in Cocoon (henceforth
> called output pipelines), discuss what kind of components that would
> be useful in them, how they can be used in the sitemap and from
> flowscripts, and also relate them to the current discussion about how
> to reuse functionality "Cocoon services" between blocks.

Cool, let's rock and roll.

> Use cases
> ---------
> 
> There is an ongoing trend of packaging all kinds of application as web
> applications or to decompose them as sets of web services. At the same
> time web browsers are more and more becoming a universal GUI for all
> kinds of applications (e.g. XUL).
> 
> This leads to an increasing need for handling of structured input data
> in web applications. SOAP might be the most important example, we also
> have XML-RPC and most certainly numerous home brewn formats, some might
> even be binary non-xml legacy formats. WebDAV is another example of
> xml-input, and next generation form handling, XForms, use xml as
> transport format.
> 
> As people are building more and more advanced Cocoon-systems there is
> also a growing need for reusing functionality in a structured way,
> there have been discussions about how to package and reuse "Cocoon
> services" in the context of blocks [1] and [2]. Here there is also a
> need for handling xml-input.
> 
> The company I work for build data warehouses, some of our customer are
> starting to get interested in using the functionality of the data
> warehouses, not only from the the web interfaces that we usually build
> but also as parts of their own webapps. This means that we want,
> besides Cocoons flexibility in presenting data in different forms,
> also flexibility in asking for the data through different input
> formats.
> 
> There is thus a world of input beyond the request parameters, and a
> world of rapidly growing importance.

I acknowledge that and I think everybody here does.

> Does Cocoon support the abovementioned use cases? Yes and no: there
> are numerous components that implements SOAP, WebDAV, parts of XForms
> etc. But while the components designed for publishing are highly
> reusable in various context, this is not the case for input
> components. 

Stop.

Before we go on I would like to point out that there is a *huge* 
difference between poor 'reusability of components' depending on their 
implementation or depending on architectural limitations of the 
component framework.

> IMO the reason for this is that Cocoon as a framework does
> not have much support for input handling.

This is obviously debetable, but I do agree with you that it's worth 
considering to challenge the very architecture of the framework and test 
its balance toward input and output.

So, no matter what result this discussion will bring, it will be a good 
design challenge.

> IMO Cocoon could be as good in handling input as it currently is in
> creating output, by reusing exactly the same concept: pipelines. We
> can however not use the existing "output pipelines" as is, there are
> some assymetries in their design that makes them unsuitable for input.

I fail to see the asymmetries, but let's keep going.

> The term "input pipeline" has sometimes been used on the list, it is
> time to try to define what it could be.
> 
> What is an Input Pipeline
> -------------------------
> 
> An input pipeline typically starts by reading octet data from the
> input stream of the request object. The input data could be xml, tab
> separated data, text that is structured according to a certain
> grammar, binary legacy formats like Excel or Word or anything else
> that could be translated to xml. The first step in the input pipeline
> is an adapter from octet data to a sax events. This sounds quite
> similar to a generator, we will return to this in the next session.

This sounds so similar to a generator that I fail to see any difference 
to what a generator is... that is: whould you need any additional method 
in an interface that describes such a 'generator for input pipelines'? 
I'm not being ironic, but honestly curious.

> The structure of the xml from the first step in the pipeline might not
> be in a form that is suitable for the data model that we would like to
> use internally in the system. Reasons for this can be that the xml
> input is supposed to follow some standard or some customer defined
> format. Input adapters for legacy formats will probably produce xml
> that is similar to the input format and repeat all kinds of
> idiosyncrasies from that format. There is thus a need to transform the
> input xml to an xml format more suited to our application specific
> needs. One or several xslt-transformer steps would therefore be
> useful in the input pipeline.

And these sounds like transformers to me, unless I'm really missing a 
big piece of the puzzle.

> As a last step in the input pipeline the sax events should be adapted
> to some binary format so that e.g. the business logic in the system
> can be applied to it. The xml input could e.g. be serialized to an
> octet stream for storage in a file (as text, xml, pdf, images, ...),
> transformed to java objects for storage in the session object, be put
> into an xml db or into an relational db.

Ah, now I'm starting to get it: you want to detach the pipeline output 
to the response!

Yes, I've been thinking about this a lot and I think I do have a 
solution (more below)

> Isn't this exactly what an output pipeline does?
> 
> Comparison to Output Pipelines
> ------------------------------
> 
> Booth an input and an output pipeline consists of a an adaptor from
> a binary format to sax events followed by a (possibly empty) sequence
> of transformers that take sax events as input as well as output. The
> last step is an adaptor from sax events to a binary format. The main
> difference (and the one I will focus on) is how the binary input and
> output is connected to the pipeline.
> 
> Let us look at an example of an output pipeline:
> 
> <match pattern="*.html"/>
>   <generate type="xml" src="{1}.xml"/>
>   <transform type="xsl" src="foo.xsl"/>
>   <serialize type="html"/>
> </match>
> 
> The input to the pipeline is controlled from the sitemap by the src
> attribute in the generator, while the output from the serializer can't
> be controlled from the sitemap, the context in which the sitemap is
> used is responsible for directing the output to an appropriate
> place. If the pipeline is used from a servlet, the output will be
> directed to the output stream of the response object in the serlet. If
> it is used from the command line, the output will be redirected to a
> file. If it is used in the cocoon: protocol the output will be
> redirected to be used as input from the src attribute of e.g. a
> generator or a transformer (cf. with Carstens and mine writings in
> [1] about the semantics of the cocoon: protocol).
> 
> Here is another example:
> 
> <match pattern="bar.pdf"/>
>   <generate type="xsp" src="bar.xsp"/>
>   <transform type="xsl" src="foo.xsl"/>
>   <serialize type="pdf"/>
> </match>
> 
> In this case the binary input is taken from the object model and the
> component manager in Cocoon and the input file to the generator,
> "bar.xsp" describes how to extract the input and how to structure it
> as an xml document.
> 
> If we compare a Cocoon output pipeline with a unix pipeline, it always
> ignore standard input and always write to standard output.

Sorry, but this is plain wrong.

Cocoon already ships generators that do *NOT* ignore the request input. 
Extending those components to perform higher-level functionality is 
*NOT* an architectural problem. Or at least, I don't see why it should be.

> An input
> pipeline would be the opposite: it would always read from standard
> input and ignore standard output. In Cocoon this would mean that the
> input source would be set by the context.

What context? do you imply that input pipelines don't work out of 
request parameter matching?

> In a servlet, input would be
> taken from the input stream of the request object. We could also have
> a writable cocoon: protocol where the input stream would be set by the
> user of the protocol, more about that later, (see also my post in the
> thread [1]).
> 
> An example:
> 
> <match pattern="**.xls"/>
>   <generate type="xls"/>
>   <transform type="xsl" src="foo.xsl"/>
>   <serialize type="xml" dest="context://repository/{1}.xml"/>
> </match>

I see two things here:

1) the current pipeline components don't seem to be asymmetric (and this 
goes somewhat against what you wrote at the beginning of your email), 
the asymmetry is in the fact that the serializer output is *always* 
bound to the client response. Am I right on this assumption?

2) what is this pipeline returning to the requesting client? This is not 
SMTP, we have to return something. Sure, we might simply return an HTTP 
header with some error code depending on the result of the 
serialization, but then people will ask how to control that part.

> Here the generator reads an Excel document from the input stream that
> is submitted by the context, and translate it to some xml format. The
> serializer write its xml input in the file system. I reused the names
> generator and serializer partly because I didn't found any good names
> (deserializer is the inverse to serializer, but what is the inverse of
> a generator?)

There is none, because the opposite of generation would be destruction 
and you are definately *not* distructing something, but still *generate* 
it. Where the data the generator uses comes from is *not* an 
architectural concern and should not modify the component's name.

>, and partly because it IMO would be the best solution if
> the generator and serializer from output pipelines can be extended to
> be usable in input pipelines as well.

I don't see the need to change anything in pipeline components. IoC 
keeps serializers totally unaware of where they are writing and 
Generators already have access to all request input.

> Several of the existing
> generators would be highly usable in input pipelines if they were
> modified in such a way that they read from "standard input" when no
> src attribute is given.

I lost you here.

> There are also some serializers that would be
> usefull in the input pipelines as well, in this case the output stream
> given i the dest attribute should be used instead of the one that is
> supplied by the context. It can of course be problematic to extend the
> definition of generators anda serializers as it might lead to back
> compabillity problems.

Please, tell me what kind of changes to those interfaces you think you'd 
require to implement what you are proposing. It will be much easier to 
follow.

> Another example of an input pipeline:
> 
> <match pattern="in"/>
>   <generate type="textparser">
>     <parameter name="grammar" value="example.txt"/>
>   </generate>
>   <transform type="xsl" src="foo.xsl"/>
>   <serialize type="xsp" src="toSql.xsp"/>
> </match>
> 
> In this example the serializer modify the content of components that
> can be found from the object model and the component manager. We use a
> hypothetical "output xsp" language to describe how to modify the
> environment. Such a language could be a little bit like xslt in the
> sense that it recursively applies templates (rules) with matching
> xpath patterns. But the template would contain custom tags that have
> side effects instead of just emitting xml. Could such a language be
> implemented in Jelly? It would be useful with custom tags that modify
> the session object, that writes to sql databases, connect with business
> logic and so on.

This example is a security nightmare.

> Error Handling
> --------------
> 
> Error handling in input pipelines is even more important than in
> output pipelines: We must protect the system against non well formed
> input and the user must be given detailed enough information about
> whats wrong, while they in many cases has no access to log files or
> access to the internals of the system.
> 
> Examples of things that can go wrong is that the input not is parsable
> or that it isn't valid with respect to some grammar or scheme. If we
> want input pipelines to work in streaming mode, without unnecessary
> buffering, it is impossible to know that the input data is correct until 
> all
> of it is processed. This means that serializer might already have
> stored some parts of the pipeline data when an error is detected. I
> think that serializers where faulty input data would be unacceptable,
> should use some kind of transactions and that they should be notified
> when something goes wrong earlier in the pipeline so that they are
> able to roll back the transaction.
> 
> I have not studied the error handling system in Cocoon, maybe there
> already are mechanisms that could be used in input pipelines as well?

It's entirely possible to have 'ValidationTransformers' that trigger an 
exception if something is wrong, and this exception will be picked up by 
the usual error handler.

> 
> In Sitemaps
> -----------
> 
> In a sitemap an input pipeline could be used e.g. for implementing a
> web service:
> 
> <match pattern="myservice">
>   <generate type="xml">
>     <parameter name="scheme" value="myInputFormat.scm"/>
>   </generate>
>   <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>   <serialize type="dom-session" non-terminating="true">
>     <parameter name="dom-name" value="input"/>
>   </serialize>
>   <select type="pipeline-state">
>     <when test="success">
>       <act type="my-business-logic"/>
>       <generate type="xsp" src="collectTheResult.xsp"/>
>       <serialize type="xml"/>
>     </when>
>     <when test="non-valid">
>       <!-- produce an error document -->
>     </when>
>   </select>
> </match>
> 
> Here we have first an input pipeline that reads and validates xml
> input, transforms it to some appropriate format and store the result
> as a dom-tree in a session attribute. A serializer normally means that
> the pipeline should be executed and thereafter an exit from the
> sitemap. I used the attribute non-terminating="true", to mark that
> the input pipeline should be executed but that there is more to do in
> the sitemap afterwards.
> 
> After the input pipeline there is a selector that select the output
> pipeline depending of if the input pipeline succeed or not. This use
> of selection have some relation to the discussion about pipe-aware
> selection (see [3] and the references therein). It would solve at
> least my main use cases for pipe-aware selection, without having its
> drawbacks: Stefano considered pipe-aware selection mix of concern,
> selection should be based on meta data (pipeline state) rather than on
> data (pipeline content). There were also some people who didn't like
> my use of buffering of all input to the pipe-aware selector. IMO the
> use of selectors above solves booth of these issues.
> 
> The output pipeline start with an action that takes care about the
> business logic for the application. This is IMHO a more legitimate use
> for actions than the current mix of input handling and business logic.

Wouldn't the following pipeline achieve the same functionality you want 
without requiring changes to the architecture?

  <match pattern="myservice">
   <generate type="payload"/>
   <transform type="validator">
     <parameter name="scheme" value="myInputFormat.scm"/>
   </transform>
   <select type="pipeline-state">
    <when test="valid">
     <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
     <transform type="my-business-logic"/>
     <serialize type="xml"/>
    </when>
    <otherwise>
     <!-- produce an error document -->
    </otherwise>
   </select>
  </match>

> In Flowscripts
> --------------
> 
> IIRC the discussion and examples of input for flowscripts this far has
> mainly dealed with request parameter based input. If we want to use
> flowscripts for describing e.g. web service flow, more advanced input
> handling is needed. IMO it would be an excelent SOC to use output
> pipelines for the presentation of the data used in the system, input
> pipelines for going from input to system data, java objects (or some
> other programming language) for describing business logic working on
> the data within the system, and flowscripts for connecting all this in
> an appropriate temporal order.

A while ago, I proposed the addition of a new flowscript method that 
would be something like this

  invoquePipeline(uri, parameters, outputStream)

that means that the flow will be calling the pipeline associated with 
the given URI, but the serializer will write on the given outputStream.

Since there were already too many irons in the fire, I wanted to see the 
flowscript settle down before starting to push for this again, but your 
RT brings back pressure on this concept and I think this is all we need 
to remove the asymmetry from cocoon pipelines.

> For Reuseability Between Blocks
> -------------------------------
> 
> There have been some discussions about how to reuse functionality
> between blocks in Cocoon (see the threads [1] and [2] for
> background). IMO (cf. my post in the thread [1]), a natural way of
> exporting pipeline functionality is by extending the cocoon pseudo
> protocol, so that it accepts input as well as produces output. The
> protocol should also be extended so that input as well as output can
> be any octet stream, not just xml.

The above flowscript method could use the URI to connect to 
block-contained pipelines.... but I'm not sure if this would solve the 
entire solution space.

> If we extend generators so that their input can be set by the
> environment (as proposed in the discussion about input pipelines), we
> have what is needed for creating a writable cocoon protocol. The web
> service example in the section "In Sitemaps" could also be used as an
> internal service, exported from a block.
> 
> Booth input and output for the extended cocoon protocol can be booth
> xml and non-xml, this give us 4 cases:
> 
> xml input, xml output: could be used from a "pipeline"-transformer,
> the input to the transformer is redirected to the protocol and the
> output from the protocol is redirected to the output of the
> transformer.
> 
> non-xml input, xml output: could be used from a generator.
> 
> xml input, non-xml output: could be used from a serializer.
> 
> non-xml input, non-xml output: could be used from a reader if the
> input is ignored, from a "writer" if the output is ignored and from a
> "reader-writer", if booth are used.
> 
> Generators that accepts xml should of course also accept sax-events
> for efficiency reasons, and serializer that produces xml should of the
> same reason also be able to produce sax-events.

I still can't see any difference between a reader and a writer (or an 
input-generator vs. output-generator) in terms of interface methods. 
They look totally similar to me. It's the way the sitemap uses them that 
changes their behavior. IoC should enforce that.

> Conclusion
> ----------
> 
> The ability to handle structured input (e.g. xml) in a convenient way,
> will probably be an important requirement on webapp frameworks in the
> near future.

Agreed.

> By removing the asymmetry between generators and serializers, by letting
> the input of a generator be set by the context and the output of a
> serializer be set from the sitemap, Cocoon could IMO be as good in
> handling input as it is today in producing output.

I don't understand what you mean by 'setting the input by the context'.

As far as allowing the serializer to have a destination semantic in the 
sitemap, I'd be against it because I see it more harmful than useful.

I do agree that serializers should not be connected only to the servlet 
output stream, but this is not a concern of the pipeline itself, but of 
who assembles the pipeline... and, IMO, the flow logic is what is 
closest to that that we have today.

> This would also make it possible to introduce a writable as well as
> readable Cocoon pseudo protocol, that would be a good way to export
> functionality from blocks.

I agree that a writeable cocoon: protocol is required, expecially for 
blocks, but this doesn't mean we have to change the sitemap semantics 
for that.

> There are of course many open questions, e.g. how to implement those
> ideas without introducing to much back incompability.

The best idea is to avoid changing what it doesn't require changes and 
work to minimize architectural changes from that point on.

But enough for now.

And thanks for keeping up with the input-oriented discussions :-)

-- 
Stefano Mazzocchi                               <stefano@apache.org>
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message