devicemap-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r1650383 - /devicemap/branches/2.0/data/README_PATTERNS
Date Thu, 08 Jan 2015 21:26:25 GMT
Author: rezan
Date: Thu Jan  8 21:26:25 2015
New Revision: 1650383

pattern spec


Added: devicemap/branches/2.0/data/README_PATTERNS
--- devicemap/branches/2.0/data/README_PATTERNS (added)
+++ devicemap/branches/2.0/data/README_PATTERNS Thu Jan  8 21:26:25 2015
@@ -0,0 +1,103 @@
+Patterns 2.0 Draft 1
+Everything in here is assumed to be UTF8.
+It is completely valid for a client to return an initialization error
+if it cannot support the pattern file as configured.
+The pattern file has a file format version number.
+Each pattern file has a header. It defines the following attributes
+which instruct the client how to parse the input:
+-Token separators: a list of strings
+-N-gram size: an int
+-Transformers: a set of regular expressions, TODO: define this better
+The input gets tokenized using the separators. It then gets n-gram'ed. The
+default n-gram size is 1. Each ngram is then passed thru optional transformers.
+The output of this process is a stream of pattern tokens which are passed into
+the pattern matcher as they are processed. Patterns must be streamed in order.
+If n-gram > 1 is configured, the largest n-gram needs to be process before
+moving onto the smaller ones.
+So for example, a domain can set its separator as a space, n-gram size of 2,
+and a lowercasing transformer expression. The following string:
+A 12 xyZ
+Produces the following pattern token stream:
+a12, a, 12xyz, 12, xyz
+Each pattern file has a set of patterns. Each pattern defines its matching
+attributes. The highest ranking pattern is returned. All patterns are matched
+using simple UTF8 string matching. This allows for simple hashtable matching
+between a pattern token and the resulting pattern. So its very fast and has
+the scaling properties of the underlying hashtable implementation.
+Pattern attributes:
+-PatternId: string, must be unique in the pattern set
+-RankType: string
+-Rank: int
+-PatternType: string
+-Pattern: object, its attributes are defined by the PatternType
+Also, 1 pattern can be configured to be the default pattern returned
+when no patterns match.
+RankType is either Strong, Weak, or None. Strong patterns ignore the Rank
+attribute and are ranked by their position in the pattern token stream.
+Therefor, when a strong pattern is matched, it can be immediately returned
+and no more processing is required. In the absence of a Strong pattern, the
+highest ranking Weak pattern is returned. In the absence of a Strong and Weak
+pattern, the highest ranking None pattern is returned. In the case of Weak
+and None, the whole Pattern stream must be processed. If no match is found,
+the default pattren is returned. If no default is defined, a null pattern
+is returned. In all cases, just the patternId is returned.
+Rank is an integer used to rank patterns amoung their RankType. In the case of
+a tie, the pattern with the longest concatenate length of pattern tokens is
+returned. If that causes another tie, the first pattern found is returned.
+The following pattern types are defined:
+SimpleOrderedAnd - This pattern is an array of strings. Each string in the array
+must appear in the pattern token stream in array index order. Its valid for
+other pattern tokens to appear inbetween the matched patterns and there is no minimum
+or maximum proximity that the matched patterns must appear in.
+SimpleAnd - This pattern is an array of strings. Each string in the array must
+appear in the pattern token stream in any order. Its ok for other pattern tokens
+to appear inbetween these patterns. No min or max proximity is definied.
+SimpleOr - This pattern is an array of strings. Only one of the strings must
+appear in the pattern stream for this pattern to be valid.
+Simple - This pattern is a single string. It must appear in the pattern stream
+to be valid.
+Note, while the word stream is used to define the set of pattern tokens,
+this is a single threaded serial process. Think of a for loop which iterates
+over the tokens, generates the pattern tokens, find pattern candidates, stores
+them, and then returns winning pattern.
+When a PatternId is returned from the Matching phase, its used to lookup the
+matching attributes in the attributes file. The attributes are returned as
+a key value map.
+TODO: Null pattern?

View raw message