devicemap-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From re...@apache.org
Subject svn commit: r1650774 - /devicemap/branches/2.0/data/README_PATTERNS
Date Sat, 10 Jan 2015 17:57:58 GMT
Author: rezan
Date: Sat Jan 10 17:57:58 2015
New Revision: 1650774

URL: http://svn.apache.org/r1650774
Log:
wiki sync

Modified:
    devicemap/branches/2.0/data/README_PATTERNS

Modified: devicemap/branches/2.0/data/README_PATTERNS
URL: http://svn.apache.org/viewvc/devicemap/branches/2.0/data/README_PATTERNS?rev=1650774&r1=1650773&r2=1650774&view=diff
==============================================================================
--- devicemap/branches/2.0/data/README_PATTERNS (original)
+++ devicemap/branches/2.0/data/README_PATTERNS Sat Jan 10 17:57:58 2015
@@ -1,10 +1,11 @@
 <<TableOfContents(2)>>
 
-= Pattern Specification 2.0 =
-Draft 1, 2014-01-09
+= Data Specification 2.0 =
+Draft 1, 2014-01-10
 
 This is the DeviceMap data specification for patterns and attributes.
 
+
 === Overview ===
 
 This document goes over how DeviceMap data 2.0 (domains) are defined and how the
@@ -28,15 +29,21 @@ The following definitions are used:
  :: this is a complete pattern definition with an id, type, rank, and pattern tokens
 
  pattern tokens::
- :: these are the individual pattern strings which comprise a pattern
+ :: these are the individual pattern strings which comprise a pattern type
 
  pattern type::
- :: this defines how the pattern tokens must appear in the input string for the pattern to
be valid
+ :: this defines how the pattern tokens must be matched against the token stream
 
  matched tokens::
  :: these are pattern tokens which are successfully matched in the token stream
 
-The pattern and attribute files are JSON objects. These objects will contain:
+ candidate::
+ :: this is a pattern which has successfully matched a pattern type in the token stream
+
+
+=== Format ===
+
+The pattern and attribute objects are JSON objects (TODO: better define this). These objects
will contain:
 
  * Format version
  * Name
@@ -52,7 +59,7 @@ The objects will also contain the attrib
 
 This step parses the input string and creates the token stream.
 
-Each pattern file defines the input parsing rules:
+Each pattern object defines the input parsing rules:
 
  InputTransformers::
  :: Type: list of transformation steps
@@ -64,7 +71,7 @@ Each pattern file defines the input pars
  :: Optional. Default: none
 
  NgramConcatSize::
- :: Type: greater than zero integer
+ :: Type: integer, greater than zero
  :: Optional. Default: 1
 
 The input string first gets processed thru the transformers.
@@ -104,10 +111,10 @@ Ngram:         a_NUM, a, _NUMxyz, _NUM,
 
 = Pattern Matching =
 
-This step processes the token stream and picks the highest ranking pattern which
-matches on the stream.
+This step processes the token stream and returns the highest ranking pattern which
+matches on the stream (highest ranking candidate).
 
-The pattern file defines a set of patterns. Each pattern has 2 main attributes,
+The pattern object defines a set of patterns. Each pattern has 2 main attributes,
 its pattern type and its pattern rank. The pattern
 type defines how the pattern is supposed to be matched against the token stream.
 The pattern rank defines how the pattern ranks against other patterns.
@@ -116,9 +123,9 @@ If the pattern type is successfully matc
 for being returned. Candidates are ranked against each other using the pattern ranking
 and the highest ranking pattern is returned.
 
-All the pattern types are prefixed with 'Simple'. This means that each pattern token is matched
-using a plain UTF8 string comparison. No regex or other syntax is allowed in Simple patterns.
-This allows the algorithm to use simple string hashing for matching. This gives maximum performance
and scaling complexity equal to a hashtable implementation. A Simple``Hash``Count attribute
can be optionally defined which hints the classifier as to how many unique hashes it would
need to generate to support the pattern set.
+All the pattern types in 2.0 are prefixed with 'Simple'. This means that each pattern token
is matched
+using a plain byte string comparison. No regex or other syntax is allowed in Simple patterns.
+This allows the algorithm to use simple byte or string hashing for matching. This gives maximum
performance and scaling complexity equal to a hashtable implementation. A Simple``Hash``Count
attribute can be optionally defined which hints the classifier as to how many unique hashes
it would need to generate to support the pattern set.
 
 Pattern attributes:
 
@@ -131,7 +138,7 @@ Pattern attributes:
  :: Required.
 
  RankValue::
- :: Type: integer
+ :: Type: integer, 0 to 1000
  :: Optional. Default: 0.
 
  PatternType::
@@ -142,10 +149,15 @@ Pattern attributes:
  :: Type: list of pattern token strings
  :: Required.
 
- Default::
- :: Type: boolean
- :: Optional. Default: false.
- :: Only 1 pattern can have a true value.
+Pattern set attributes:
+
+ DefaultId::
+ :: Type: string
+ :: Optional. Default: none
+
+ SimpleHashCount::
+ :: Type: integer, greater than zero
+ :: Optional. Default: none. Must be defined before the pattern set.
 
 
 == PatternType ==
@@ -159,7 +171,7 @@ The following pattern types are defined:
  :: Each pattern token must appear in the token stream. Order does not matter.
 
  Simple::
- :: Only one pattern must appear in the token stream.
+ :: Only one pattern token must appear in the token stream.
 
 
 == RankType ==
@@ -167,7 +179,7 @@ The following pattern types are defined:
 The following rank types are defined:
 
  Strong::
- :: Strong patterns are ranked higher than Weak and None. The Rank``Value is ignored and
they are ranked by their position in the pattern stream. The lower the position, the higher
the rank. When a Strong pattern is found, the pattern matching step can stop and this pattern
can be returned without analyzing the rest of the stream. This is because its impossible for
another pattern to rank higher.
+ :: Strong patterns are ranked higher than Weak and None. The Rank``Value is ignored and
they are ranked by their position in the pattern stream. Specifically, the last matched token
position. The lower the position, the higher the rank. When a Strong pattern is found, the
pattern matching step can stop and this pattern can be returned without analyzing the rest
of the stream. This is because its impossible for another pattern to rank higher.
 
  Weak::
  :: Weak patterns are ranked below Strong but above None. A Weak pattern can only be returned
in the absence of a Strong pattern. Weak patterns always rank higher than None patterns, regardless
of their Rank``Value. The Rank``Value is used to rank between successfully matched Weak patterns.
@@ -175,22 +187,20 @@ The following rank types are defined:
  None::
  :: None patterns are ranked below Strong and Weak. A None pattern can only be returned in
the absence of successful Strong and Weak patterns. The Rank``Value is used to rank between
successfully matched None patterns.
 
-In the case where 2 or more Weak or None patterns have the same Rank``Value resulting in
a tie,
+In the case where 2 or more patterns have the same Rank``Type and Rank``Value resulting in
a tie,
 the pattern with the longest concatenated matched pattern length is used. If that results
in
-another tie, the pattern found first is returned.
+another tie, the pattern with the first matched token found is returned.
 
-If no pattern is successfully matched, the default pattern is returned. If no
-default pattern is defined, a null pattern is returned.
+If no pattern is successfully matched, the Default``Id is returned. If no
+Default``Id is defined, a null pattern is returned.
 
 === Notes ===
 
 If 2 or more patterns share the same Pattern``Id, then only 1 of their Pattern``Types
 need to match. There is an implied OR between multiple Pattern``Types with equal Pattern``Id.
 
-If more than 1 default is defined, the 1st one found in the Pattern file is used.
-
-2 or more patterns cannot have identical Rank``Type, Rank``Value, and matched tokens. Since
they will be
-found at the same time, the pattern the classifier chooses is undefined.
+2 or more patterns cannot have identical Rank``Type, Rank``Value, and matched tokens. This
is undefined behavior since they will be
+found at the same time. The pattern the classifier chooses can be random.
 
 
 === Examples ===
@@ -246,17 +256,15 @@ TODO: define this more
 
 If no attribute map is found, an empty map is used.
 
-The attribute map must be immutable.
-
 If a null pattern is returned from the previous step, this must be safely returned.
 TODO: how?
 
 
 
-= Patch Files =
+= Patch Objects =
 
-The pattern and attribute files can be patched with a user created pattern and
-attribute file. In this case, parsing configurations override, pattern
+The pattern and attribute objects can be patched with a user created pattern and
+attribute objects. In this case, parsing configurations override, pattern
 definitions get appended (you can override using pattern ranking), and attributes
 override using the Pattern``Id.
 



Mime
View raw message