creadur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Burrell Donkin <robertburrelldon...@blueyonder.co.uk>
Subject Easier New License Addition
Date Mon, 28 Apr 2008 21:26:47 GMT
ATM license readers are hard coded. this just won't scale. it's better
to be able to read some meta-data linking a header to an URL and then to
a license URL.

been trying to think of more efficient ways to parse the headers when
faced with more possible headers. 

i've been thinking about creating a specialised tokeniser which strips
an extended set of whitespace characters. this set can either be guessed
from the document MIME type or hard coded (not sure which would be best)
and would include punctuation. conversion is also perform to upper case.
this tokeniser would produce a stream of words. should be good enough to
ignore words which are too long (>20 characters, say) which means a word
-> number mapping can be used to reduce each word to a fixed number of
longs.

for each license, upon initialisation generate a state machine. some
limited ability to handle simple regexes (? meaning one or none would be
enough to start with) would be needed to cope with license families.
should be able to use bitwise operations to compare words with words in
the license. 

sounds complex, i know. is it likely to be faster than java's regex?

opinions?

- robert



Mime
View raw message