commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Ansell <ansell.pe...@gmail.com>
Subject Re: Immutable builder pattern for parsers?
Date Thu, 09 Feb 2017 04:02:03 GMT
Not all fluent configuration APIs are "builder patterns". By that I
mean that you can have a "return this" convention for the builder with
mutable fields without it building or cloning new objects.

A builder pattern is characterised by a terminal method returning a
new instance of an object, avoiding the archaic multiple-constructor
pattern. It is also generally fluent, but there is no requirement that
a builder be fluent. There is also no requirement that the result of
the build() method be fully immutable. I.e., in this case, it could be
a parser object with mutable parsing state, even though its
configuration fields are likely to be immutable.

You can do all of that without a difficult to define and implement
"clone" method, and without reconfiguring multiple builder objects in
the same way.

For example, your code could change to:

     JenaRDFParserFactory factory = new JenaRDFParserFactory()
         .contentType(RDFSyntax.TURTLE)
         .target(g1);

    factory.parse(Path);

where .parse is a terminal method that creates a new parser object and
parses the given "Path" immediately, without the typical build method.
However, you could also support a "build" method. The reason for
naming it a Factory is to make it obvious that you aren't setting
methods on an actual parser, just setting up for one. In this case the
parser itself is only an internal object, and may not even be on the
API

I am not sure how the Future fits a specific use case though, as it
would not return an actual result, as the results are emitted into the
graph as directed by .target(g1). Not everyone wants parsing to spawn
off a new task/thread, particularly if they have maximum latencies to
maintain for service level agreements, and threads may not be able to
be consistently spawned within those time frames.

It may even be possible to support both forms of parsing without
changing the API, as I did for RDF4J-2.2 a few days ago without
changing the parser API, although most of the code already existed and
has been used for many years in the server components:

https://github.com/eclipse/rdf4j/pull/741

That method relies on the parser having statement "handle" methods
which can be implemented using a BlockingQueue. I am not sure how that
would apply to the examples here, which appear to be background
parsers that don't necessarily block on particular parts of the
parser, but attempt to parse all the way through without stopping.

I am just not sure why you would necessarily want to standardise on
off-thread parsing rather than the following which is also
asynchronous and can easily be implemented to be immutable from the
point of the internally created RDFParser:

ForkJoinPool.commonPool().submit(() -> factory.parse(Path));
ForkJoinPool.commonPool().submit(() -> factory.parse(Path));

RDFParser's are not inherently multi-threadable, as they generally
operate on a single concrete byte-stream, which is generally much more
efficient to do in a single thread than in multiple threads. To
multi-thread you add overhead related to splitting the byte stream at
valid split points, which isn't generally worthwhile and sometimes not
possible.

Based on my experience with RDF4J, I would recommend allowing a range
of RDFParserFactory.parse alternatives, including those that don't
require the target or content type to be set on the factory before
getting to the parse method. With those you could do the following:

JenaRDFParserFactory factory = new JenaRDFParserFactory().set(setting,
value).set(setting, value);

And then use the factory as often as necessary with whatever formats,
without having to be stuck to the original format, using:

factory.parse(Path, RDFSyntax, RDFSink);

Those are a few alternatives to think about. The current design seems
to be artificially complex and fixed to a particular use case, per its
use of Future. My main comment is that I would prefer if there were
either example code on how to use an ExecutorService to implement
asynchronous parsing, or a helper utility, rather than it being
required by the core API.

Cheers,

Peter


On 9 February 2017 at 13:38, Stian Soiland-Reyes <stain@apache.org> wrote:
> Peter Ansell raised a valid question about why in Commons RDF I made the
> RDFParser interface as an immutable factory-like builder pattern.
>
> https://commons.apache.org/proper/commons-rdf/apidocs/org/apache/commons/rdf/experimental/RDFParser.html
>
>
> Here is how it can be used today:
>
>  RDF rdf = new JenaRDF();
>  Graph g1 = rdf.createGraph();
>  Future future = new JenaRDFParser()
>          .source(Paths.get("/tmp/graph.ttl"))
>          .contentType(RDFSyntax.TURTLE)
>          .target(g1).parse();
>  future.get(30, TimeUnit.Seconds); // block
>
>
> You may notice it is not a factory as-such, as you only get a Future
> of the parser result, rather than the actual "parser". Also it is
> currently an interface, with multiple implementations (one per RDF
> implementation).
>
>
>
> Although the javadoc only suggests for it to be immutable and
> threadsafe, the abstract class AbstractRDFParser
> https://commons.apache.org/proper/commons-rdf/apidocs/org/apache/commons/rdf/simple/experimental/AbstractRDFParser.html
> https://github.com/apache/commons-rdf/blob/master/simple/src/main/java/org/apache/commons/rdf/simple/experimental/AbstractRDFParser.java
> is both, and thus every "option setter" method therefore return a cloned
> copy of the factory with the mutated fields set there. This simply uses
> .clone() at the moment.
>
> This abstract class and immutability pattern is used by each of the
> three RDF implementations of the interface.
>
>
> This means it's safe to parse many files like this:
>
>  JenaRDFParser parser = new JenaRDFParser()
>          .contentType(RDFSyntax.TURTLE)
>          .target(g1);
>
>  // Parse multiple files in parallell
>  Future future1 = parser.source(Paths.get("/tmp/graph1.ttl")).parse();
>  Future future2 = parser.source(Paths.get("/tmp/graph2.ttl")).parse();
>
>  // Wait for both to finish
>  future1.get(30, TimeUnit.Seconds);
>  future2.get(30, TimeUnit.Seconds);
>
>
> Above the second call to parser.source(..) is not a problem (even if
> it happens in another thread) as both .source() calls return separate
> cloned JenaRDFParser instances and don't mutate the "parser" instance.
>
>
> Now a valid critique of such a pattern is that it's wasteful to make
> clones that are immediately thrown away, for instance the above
> future1/future2 example would cause 7 instantiations of JenaRDFParser()
> (.parse() does some finalization in a final clone).
>
> I have not done any benchmarking, but envision that as parsing RDF files
> generally creates many thousands of RDFTerm instances, a couple of
> extra parser instantiations shouldn't make a big dent in the bill.
>
>
> The typical builder pattern is rather to return "this" and mutate fields
> directly. This is obviously not thread-safe but generally achives the
> same. Multiple threads would however instead have to create a brand new
> parser from start and set all the options themselves.  (Or they could a
> similar .clone() method from a carefully locked down "proto" instance
> that everyone has to be careful not to call any mutating methods on)
>
>
> Now multiple threads is now very common because of Java 8 streams, and
> for instance it would not be unthinkable that a parallelStream() is
> mapped as a .source() or .target() on such a parser builder. The Futures
> fit somewhat into this picture as well.   I think this would get
> complicated with having to set everything again rather than just pass
> "parser::source" as a method reference.
>
>
>
>
> As this was just thrown together as an experiment (and we've not yet
> tried writing the corresponding RDFWriter interface), I think we could
> revisit how to do this pattern, hopefully drawing on the wider
> experience of the Apache Commons developers on this list.
>
>
>
> Roughly here are my requirements:
>
> * Ability to set various options (e.g. path of file to parse, format,
>   destination) in any order
> * Avoid mega-constructors/multi-arg methods
> * Some options take multiple types, e.g a source can be given as a Path,
>   InputStream or IRI (URL) -- but would override each other
>
> Nice to have:
>
> * Reusable (e.g. set options that are common, but modify
>   source() to parse multiple files)
> * Thread-safe (e.g. immediate reuse of builder before a parsing session
>   has started or finished)
> * Easy to serialize / keep around (should not keep references to big
>   objects in memory uneccessarily)
> * Parsing in the background, e.g. return Future. (implementation manages
>   threads/queues as it pleases)
>
> Unclear:
>
> * How to select between the multiple RDFParser instances?
>   I was thinking the RDF interface could have a .makeParser() method,
>   but not all parsers can take all formats, so possibly some kind of
>   registry or "what can you handle" mechanism might be needed.
>
>
> How do you think we should proceed? Mutable or immutable? How should the
> settings be kept? Fields, map, or what?  Does it make sense with an
> interface, abstract class (keeps settings) and implementations
> (processess settings), or should we have a single ParserFactory class
> and have a new internal interface below?
>
>
> --
> Stian Soiland-Reyes
> University of Manchester
> http://www.esciencelab.org.uk/
> http://orcid.org/0000-0001-9842-9718
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message