creadur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Burrell Donkin <>
Subject RAT: IHeaderMatcher Design
Date Fri, 12 Jul 2013 19:26:22 GMT
Rat spends a lot of effort parsing textual documents, looking for 
headers and boilerplate text. There's an extension point (of sorts) for 
the searches that can be performed, provided by IHeaderMatcher[1].

This interface has a few TODOs in. It's used by pushing the text in one 
line at a time, after doing some pre-processing. As the TODO indicates, 
this may not the most elegant design.

As an extension point, IHeaderMatcher has the advantage of flexibility. 
It would be possible to plug in radically different implementations. It 
turns out, though, that few clever new implementations have emerge. All 
implementations seem to do is check for license headers.

One disadvantage of this arrangement is that it pushes some of the 
parsing outwards toward supposedly pluggable implementations. This means 
that adding new licenses means adding a partial parser.

I wonder whether it might be more intuitive (as well as opening 
potential for faster parsing) to use immutable domain objects for 
licenses and so on, making them data rather than processors.

Opinions...? Alternatives...?


* Resets this matches.
* Subsequent calls to {@link #match} will accumulate new text.
public void reset();

* Matches the text accumulated to licenses.
* TODO probably a poor design choice - hope to fix later
* @param subject TODO
* @param line next line of text, not null
* @return TODO
public boolean match(Document subject, String line) throws 

View raw message