devicemap-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Devicemap Wiki] Update of "DataSpec2" by rezan
Date Sun, 18 Jan 2015 08:00:39 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Devicemap Wiki" for change notification.

The "DataSpec2" page has been changed by rezan:
https://wiki.apache.org/devicemap/DataSpec2?action=diff&rev1=39&rev2=40

   domain::
   :: a versioned pattern and attribute file
  
+ 
+ 
+ = Input Parsing =
+ 
+ This step parses the input string and creates the token stream.
+ 
+ Each pattern file defines these input parsing rules:
+ 
+  InputTransformers::
+  :: Type: list of transformers
+  :: Optional. Default: none
+ 
+  TokenSeparators::
+  :: Type: list of token seperator strings
+  :: Optional. Default: none
+ 
+  NgramConcatSize::
+  :: Type: integer, greater than zero
+  :: Optional. Default: 1
+ 
+ The input string first gets processed thru the transformers.
+ Then it gets tokenized using the configured seperators. Then ngram
+ concatenation happens. The final result of these 3 steps is the token stream.
+ 
+ 
+ === Notes ===
+ 
+ Empty tokens are removed from the tokenization step.
+ 
+ When a token is added to the token stream, it can be processed by the
+ pattern matching step before moving on to the next token. This algorithm is pipeline
+ and thread safe.
+ 
+ If the Ngram``Concat``Size is greater than 1, ngrams must be added to the token stream ordered
largest to smallest.
+ 
+ 
+ === Example ===
+ 
+ {{{
+ InputTransformers: Lowercase(), ReplaceAll(find: '-', replaceWith: '')
+ TokenSeparators:   [space]
+ NgramConcatSize:   2
+ 
+ Input string:  'A 12 x-yZ'
+ 
+ Transform:     'a 12 xyz'
+ 
+ Tokenization:  a, 12, xyz
+ 
+ Ngram:         a12, a, 12xyz, 12, xyz
+ }}}
+ 
+ 
+ 
+ = Pattern Matching =
+ 
+ This step processes the token stream and returns the highest ranking candidate pattern.
+ 
+ The pattern file defines a pattern set. All patterns in the pattern set are evaluated to
find the candidates.
+ 
+ Each pattern has 2 main attributes,
+ its pattern type and its pattern rank. The pattern
+ type defines how the pattern is supposed to be matched against the token stream.
+ The pattern rank defines how the pattern ranks against other patterns.
+ 
+ All the pattern types in 2.0 are prefixed with 'Simple'. This means that each pattern token
is matched
+ using a plain byte or string comparison. No regex or other syntax is allowed in Simple patterns.
+ This allows the algorithm to use simple byte or string hashing for matching. This gives
maximum performance and scaling complexity equal to a hashtable implementation. A Simple``Hash``Count
attribute can be optionally defined which hints the classifier as to how many unique hashes
it would need to generate to support the pattern set.
+ 
+ Pattern attributes:
+ 
+  PatternId::
+  :: Type: string
+  :: Required.
+ 
+  RankType::
+  :: Type: string
+  :: Required.
+ 
+  RankValue::
+  :: Type: integer, -1000 to 1000
+  :: Optional. Default: 0.
+ 
+  PatternType::
+  :: Type: string
+  :: Required.
+ 
+  PatternTokens::
+  :: Type: list of pattern token strings
+  :: Required.
+ 
+ Pattern set attributes:
+ 
+  DefaultId::
+  :: Type: string
+  :: Optional. Default: none
+ 
+  SimpleHashCount::
+  :: Type: integer, greater than zero
+  :: Optional. Default: none. Must be defined before the pattern set.
+ 
+ 
+ == PatternType ==
+ 
+ The following pattern types are defined:
+ 
+  SimpleOrderedAnd::
+  :: Each pattern token must appear in the token stream in index order, as defined in the
Pattern``Tokens list. Its okay for non matched tokens to appear inbetween matched tokens as
long as the matched tokens are still in order.
+ 
+  SimpleAnd::
+  :: Each pattern token must appear in the token stream. Order does not matter.
+ 
+  Simple::
+  :: Only one pattern token must appear in the token stream.
+ 
+ 
+ == RankType ==
+ 
+ The following rank types are defined:
+ 
+  Strong::
+  :: Strong patterns are ranked higher than Weak and None. The Rank``Value is ignored and
they are ranked by their position in the pattern stream. Specifically, the last matched token
position. The lower the position, the higher the rank. When a Strong pattern is found, the
pattern matching step can stop and this pattern can be returned without analyzing the rest
of the stream. This is because its impossible for another pattern to rank higher.
+ 
+  Weak::
+  :: Weak patterns are ranked below Strong but above None. A Weak candidate can only be returned
in the absence of a Strong candidate. Weak candidates always rank higher than None candidates,
regardless of their Rank``Value. The Rank``Value is used to rank between other Weak patterns.
+ 
+  None::
+  :: None patterns are ranked below Strong and Weak. A None candidate can only be returned
in the absence of Strong and Weak candidates. The Rank``Value is used to rank between other
None patterns.
+ 
+ In the case where 2 or more candidates have the same Rank``Type and Rank``Value resulting
in a tie,
+ the candidate with the longest concatenated matched pattern length is used. If that results
in
+ another tie, the candidate with the first matched token found is returned.
+ 
+ 
+ === Notes ===
+ 
+ If no candidate patterns are found, the Default``Id is returned. If no
+ Default``Id is defined, a null pattern is returned.
+ 
+ 2 or more patterns may share the same Pattern``Id. These patterns function completely independent
of each other.
+ 
+ 2 or more patterns cannot have identical Rank``Type, Rank``Value, and pattern tokens. This
results in undefined behavior when the patterns are candidates since they have identical rank.
The classifier is free to choose any one candidate in this situation.
+ 
+ New pattern types and ranks can be introduced in future specifications. If a classifier
encounters a definition it cannot support, it must immediately return an initialization error.
+ 
+ 
+ === Examples ===
+ 
+ {{{
+ Pattern:
+   PatternId: p1
+   RankType: Strong
+   PatternType: Simple
+   PatternTokens: bingo, jackpot
+ 
+ Pattern:
+   PatternId: p2
+   RankType: Weak
+   RankValue: 100
+   PatternType: SimpleOrderedAnd
+   PatternTokens: two, four, six
+ 
+ Pattern:
+   PatternId: p3
+   RankType: None
+   RankValue: 1000
+   PatternType: Simple
+   PatternTokens: two, four, six
+ 
+ Token stream: one, two, three, four, five, six, seven
+ Pattern: p2
+ 
+ Token stream: one, two, three, six, five, four, seven
+ Pattern: p3
+ 
+ Token stream: one, two, three, four, five, six, bingo, seven
+ Pattern: p1
+ }}}
+ 
+ 
+ 
+ = Attribute Retrieval =
+ 
+ This step processes the result of the Pattern Matching step. The Pattern``Id is used
+ to look up the corresponding attribute map. The Pattern``Id and the attribute map
+ are returned.
+ 
+ 
+ === Attribute Parsing ===
+ 
+ An attribute map can contain attributes values which are parsed out of the input string.
+ This is done by configuring the attribute as a set of transformers. The attribute can also
+ have a default value if the transformers return an error.
+ 
+ 
+ === Notes ===
+ 
+ If no attribute map is found, an empty map is used.
+ 
+ If a null pattern is returned from the previous step, this must be properly returned to
the user.
+ A null pattern must be discernible from a user defined pattern.
+ 
+ 
+ 
+ = Transformers =
+ 
+ Transformers accept a string, apply an action, and then return a string.
+ If multiple transformers are defined in a set, the outputs and inputs are
+ linked together.
+ Transformers are used in the input parsing phase and the attribute retrieval phase.
+ 
+ Transformers can cause errors. Errors in input parsing are fatal, input parsing
+ is immediately stopped and an error is returned to the user. Errors in attribute retrieval
+ are okay. The error is written to [attribute]_error and the attribute is set to the default
value,
+ if configured, or a blank value. [attribute]_error is a reserved attribute name.
+ 
+ The following transformer functions are supported:
+ 
+  Lowercase::
+  :: Description: converts the input to all lowercase
+  :: Return: the input in lowercase
+ 
+  Uppercase::
+  :: Description: converts the input to all uppercase
+  :: Return: the input in uppercase
+ 
+  ReplaceFirst::
+  :: Description: replace the first occurrence of a string with another string
+  :: Parameter - find: the substring to replace
+  :: Parameter - replaceWith: the string to replace 'find' with
+  :: Return: the string with the replacement made
+ 
+  ReplaceAll::
+  :: Description: replace all occurrences of a string with another string
+  :: Parameter - find: the substring to replace
+  :: Parameter - replaceWith: the string to replace 'find' with
+  :: Return: the string with the replacements made
+ 
+  Substring::
+  :: Description: return a substring of the input
+  :: Parameter - start: the starting index, 0 based
+  :: Parameter - maxLength: optional. If defined, the maximum amount of characters to return.
+  :: Return: the specified substring
+  :: Error: if 'start' is out of bounds
+ 
+  SplitAndGet::
+  :: Description: split the input and return a part of the split
+  :: Parameter - delimiter: the delimiter to use for splitting. If not found, the entire
string is part 0. Empty parts are ignored.
+  :: Parameter - get: the part of the split string to return, 0 based index. -1 is the last
part.
+  :: Return: the specified part of the split string
+  :: Error: if the 'get' index does not exist
+ 
+  IsNumber::
+  :: Description: checks if the input is a number
+  :: Return: the input string
+  :: Error: if the input is not a number
+ 
+ 
+ === Notes ===
+ 
+ New transformers can be introduced in future specifications. If a classifier encounters
a definition it cannot support, it must immediately return an initialization error.
+ 
+ 
+ === Examples ===
+ 
+ {{{
+ Input string: 'aaa bbb 123 ccc'
+ 
+ Transformers:
+ 
+ SplitAndGet(delimiter: 'ccc', get: 0)
+ Result: 'aaa bbb 123 '
+ 
+ SplitAndGet(delimiter: ' ', get: -1)
+ Result: '123'
+ 
+ IsNumber()
+ Result: '123'
+ }}}
+ 
+ 
+ 
+ = Patch Files =
+ 
+ The pattern and attribute files can be patched with a user created pattern and
+ attribute file. In this case, parsing configurations override, the pattern sets get appended
(you can override using pattern ranking), and attributes
+ override using the Pattern``Id.
+ 
+ 
+ 
- === Format ===
+ = Format =
  
  The pattern and attribute files are JSON objects. These files will contain:
  
-  * Format version
   * Specification version
   * Type (pattern, attribute, etc)
   * Domain name
@@ -67, +356 @@

   * Description
   * Publish date
  
- The files will also contain the attributes defined below in this specification.
+ TODO: define the JSON and example 
  
- TODO: define the 1.0 JSON format spec
- 
- 
- 
- = Input Parsing =
- 
- This step parses the input string and creates the token stream.
- 
- Each pattern file defines these input parsing rules:
- 
-  InputTransformers::
-  :: Type: list of transformers
-  :: Optional. Default: none
- 
-  TokenSeparators::
-  :: Type: list of token seperator strings
-  :: Optional. Default: none
- 
-  NgramConcatSize::
-  :: Type: integer, greater than zero
-  :: Optional. Default: 1
- 
- The input string first gets processed thru the transformers.
- Then it gets tokenized using the configured seperators. Then ngram
- concatenation happens. The final result of these 3 steps is the token stream.
- 
- 
- === Notes ===
- 
- Empty tokens are removed from the tokenization step.
- 
- When a token is added to the token stream, it can be processed by the
- pattern matching step before moving on to the next token. This algorithm is pipeline
- and thread safe.
- 
- If the Ngram``Concat``Size is greater than 1, ngrams must be added to the token stream ordered
largest to smallest.
- 
- 
- === Example ===
- 
- {{{
- InputTransformers: Lowercase(), ReplaceAll(find: '-', replaceWith: '')
- TokenSeparators:   [space]
- NgramConcatSize:   2
- 
- Input string:  'A 12 x-yZ'
- 
- Transform:     'a 12 xyz'
- 
- Tokenization:  a, 12, xyz
- 
- Ngram:         a12, a, 12xyz, 12, xyz
- }}}
- 
- 
- 
- = Pattern Matching =
- 
- This step processes the token stream and returns the highest ranking candidate pattern.
- 
- The pattern file defines a pattern set. All patterns in the pattern set are evaluated to
find the candidates.
- 
- Each pattern has 2 main attributes,
- its pattern type and its pattern rank. The pattern
- type defines how the pattern is supposed to be matched against the token stream.
- The pattern rank defines how the pattern ranks against other patterns.
- 
- All the pattern types in 2.0 are prefixed with 'Simple'. This means that each pattern token
is matched
- using a plain byte or string comparison. No regex or other syntax is allowed in Simple patterns.
- This allows the algorithm to use simple byte or string hashing for matching. This gives
maximum performance and scaling complexity equal to a hashtable implementation. A Simple``Hash``Count
attribute can be optionally defined which hints the classifier as to how many unique hashes
it would need to generate to support the pattern set.
- 
- Pattern attributes:
- 
-  PatternId::
-  :: Type: string
-  :: Required.
- 
-  RankType::
-  :: Type: string
-  :: Required.
- 
-  RankValue::
-  :: Type: integer, -1000 to 1000
-  :: Optional. Default: 0.
- 
-  PatternType::
-  :: Type: string
-  :: Required.
- 
-  PatternTokens::
-  :: Type: list of pattern token strings
-  :: Required.
- 
- Pattern set attributes:
- 
-  DefaultId::
-  :: Type: string
-  :: Optional. Default: none
- 
-  SimpleHashCount::
-  :: Type: integer, greater than zero
-  :: Optional. Default: none. Must be defined before the pattern set.
- 
- 
- == PatternType ==
- 
- The following pattern types are defined:
- 
-  SimpleOrderedAnd::
-  :: Each pattern token must appear in the token stream in index order, as defined in the
Pattern``Tokens list. Its okay for non matched tokens to appear inbetween matched tokens as
long as the matched tokens are still in order.
- 
-  SimpleAnd::
-  :: Each pattern token must appear in the token stream. Order does not matter.
- 
-  Simple::
-  :: Only one pattern token must appear in the token stream.
- 
- 
- == RankType ==
- 
- The following rank types are defined:
- 
-  Strong::
-  :: Strong patterns are ranked higher than Weak and None. The Rank``Value is ignored and
they are ranked by their position in the pattern stream. Specifically, the last matched token
position. The lower the position, the higher the rank. When a Strong pattern is found, the
pattern matching step can stop and this pattern can be returned without analyzing the rest
of the stream. This is because its impossible for another pattern to rank higher.
- 
-  Weak::
-  :: Weak patterns are ranked below Strong but above None. A Weak candidate can only be returned
in the absence of a Strong candidate. Weak candidates always rank higher than None candidates,
regardless of their Rank``Value. The Rank``Value is used to rank between other Weak patterns.
- 
-  None::
-  :: None patterns are ranked below Strong and Weak. A None candidate can only be returned
in the absence of Strong and Weak candidates. The Rank``Value is used to rank between other
None patterns.
- 
- In the case where 2 or more candidates have the same Rank``Type and Rank``Value resulting
in a tie,
- the candidate with the longest concatenated matched pattern length is used. If that results
in
- another tie, the candidate with the first matched token found is returned.
- 
- 
- === Notes ===
- 
- If no candidate patterns are found, the Default``Id is returned. If no
- Default``Id is defined, a null pattern is returned.
- 
- 2 or more patterns may share the same Pattern``Id. These patterns function completely independent
of each other.
- 
- 2 or more patterns cannot have identical Rank``Type, Rank``Value, and pattern tokens. This
results in undefined behavior when the patterns are candidates since they have identical rank.
The classifier is free to choose any one candidate in this situation.
- 
- New pattern types and ranks can be introduced in future specifications. If a classifier
encounters a definition it cannot support, it must immediately return an initialization error.
- 
- 
- === Examples ===
- 
- {{{
- Pattern:
-   PatternId: p1
-   RankType: Strong
-   PatternType: Simple
-   PatternTokens: bingo, jackpot
- 
- Pattern:
-   PatternId: p2
-   RankType: Weak
-   RankValue: 100
-   PatternType: SimpleOrderedAnd
-   PatternTokens: two, four, six
- 
- Pattern:
-   PatternId: p3
-   RankType: None
-   RankValue: 1000
-   PatternType: Simple
-   PatternTokens: two, four, six
- 
- Token stream: one, two, three, four, five, six, seven
- Pattern: p2
- 
- Token stream: one, two, three, six, five, four, seven
- Pattern: p3
- 
- Token stream: one, two, three, four, five, six, bingo, seven
- Pattern: p1
- }}}
- 
- 
- 
- = Attribute Retrieval =
- 
- This step processes the result of the Pattern Matching step. The Pattern``Id is used
- to look up the corresponding attribute map. The Pattern``Id and the attribute map
- are returned.
- 
- 
- === Attribute Parsing ===
- 
- An attribute map can contain attributes values which are parsed out of the input string.
- This is done by configuring the attribute as a set of transformers. The attribute can also
- have a default value if the transformers return an error.
- 
- 
- === Notes ===
- 
- If no attribute map is found, an empty map is used.
- 
- If a null pattern is returned from the previous step, this must be properly returned to
the user.
- A null pattern must be discernible from a user defined pattern.
- 
- 
- 
- = Transformers =
- 
- Transformers accept a string, apply an action, and then return a string.
- If multiple transformers are defined in a set, the outputs and inputs are
- linked together.
- Transformers are used in the input parsing phase and the attribute retrieval phase.
- 
- Transformers can cause errors. Errors in input parsing are fatal, input parsing
- is immediately stopped and an error is returned to the user. Errors in attribute retrieval
- are okay. The error is written to [attribute]_error and the attribute is set to the default
value,
- if configured, or a blank value. [attribute]_error is a reserved attribute name.
- 
- The following transformer functions are supported:
- 
-  Lowercase::
-  :: Description: converts the input to all lowercase
-  :: Return: the input in lowercase
- 
-  Uppercase::
-  :: Description: converts the input to all uppercase
-  :: Return: the input in uppercase
- 
-  ReplaceFirst::
-  :: Description: replace the first occurrence of a string with another string
-  :: Parameter - find: the substring to replace
-  :: Parameter - replaceWith: the string to replace 'find' with
-  :: Return: the string with the replacement made
- 
-  ReplaceAll::
-  :: Description: replace all occurrences of a string with another string
-  :: Parameter - find: the substring to replace
-  :: Parameter - replaceWith: the string to replace 'find' with
-  :: Return: the string with the replacements made
- 
-  Substring::
-  :: Description: return a substring of the input
-  :: Parameter - start: the starting index, 0 based
-  :: Parameter - maxLength: optional. If defined, the maximum amount of characters to return.
-  :: Return: the specified substring
-  :: Error: if 'start' is out of bounds
- 
-  SplitAndGet::
-  :: Description: split the input and return a part of the split
-  :: Parameter - delimiter: the delimiter to use for splitting. If not found, the entire
string is part 0. Empty parts are ignored.
-  :: Parameter - get: the part of the split string to return, 0 based index. -1 is the last
part.
-  :: Return: the specified part of the split string
-  :: Error: if the 'get' index does not exist
- 
-  IsNumber::
-  :: Description: checks if the input is a number
-  :: Return: the input string
-  :: Error: if the input is not a number
- 
- 
- === Notes ===
- 
- New transformers can be introduced in future specifications. If a classifier encounters
a definition it cannot support, it must immediately return an initialization error.
- 
- 
- === Examples ===
- 
- {{{
- Input string: 'I am 47 years old.'
- 
- Transformers:
- 
- SplitAndGet(delimiter: 'years old', get: 0)
- Result: 'I am 47 '
- 
- SplitAndGet(delimiter: ' ', get: -1)
- Result: '47'
- 
- IsNumber()
- Result: '47'
- }}}
- 
- 
- 
- = Patch Files =
- 
- The pattern and attribute files can be patched with a user created pattern and
- attribute file. In this case, parsing configurations override, the pattern sets get appended
(you can override using pattern ranking), and attributes
- override using the Pattern``Id.
- 

Mime
View raw message