devicemap-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From re...@apache.org
Subject svn commit: r1650410 - /devicemap/branches/2.0/data/README_PATTERNS
Date Thu, 08 Jan 2015 22:47:50 GMT
Author: rezan
Date: Thu Jan  8 22:47:50 2015
New Revision: 1650410

URL: http://svn.apache.org/r1650410
Log:
transformers

Modified:
    devicemap/branches/2.0/data/README_PATTERNS

Modified: devicemap/branches/2.0/data/README_PATTERNS
URL: http://svn.apache.org/viewvc/devicemap/branches/2.0/data/README_PATTERNS?rev=1650410&r1=1650409&r2=1650410&view=diff
==============================================================================
--- devicemap/branches/2.0/data/README_PATTERNS (original)
+++ devicemap/branches/2.0/data/README_PATTERNS Thu Jan  8 22:47:50 2015
@@ -14,12 +14,13 @@ INPUT PARSING INTO PATTERN TOKENS
 Each pattern file has a header. It defines the following attributes
 which instruct the client how to parse the input:
 
+-Transformers: a set of regular expressions, TODO: define better
 -Token separators: a list of strings
 -N-gram size: an int
--Transformers: a set of regular expressions, TODO: define this better
 
-The input gets tokenized using the separators. It then gets n-gram'ed. The
-default n-gram size is 1. Each ngram is then passed thru optional transformers.
+The input gets transformed thru the transformers (optional). Then it gets
+tokenized using the separators. No blank tokens. It then gets n-gram'ed.
+The default n-gram size is 1.
 
 The output of this process is a stream of pattern tokens which are passed into
 the pattern matcher as they are processed. Patterns must be streamed in order.
@@ -27,13 +28,16 @@ If n-gram > 1 is configured, the largest
 moving onto the smaller ones.
 
 So for example, a domain can set its separator as a space, n-gram size of 2,
-and a lowercasing transformer expression. The following string:
+and a lowercase transformer and a number transformer: [0-9]+ => _NUM.
+The following string:
 
-A 12 xyZ
+Original: A 12 xyZ
 
-Produces the following pattern token stream:
+Post transform: a _NUM xyz
 
-a12, a, 12xyz, 12, xyz
+Tokens: a, _NUM, xyz
+
+Pattern token stream: a_NUM, a, _NUMxyz, _NUM, xyz
 
 ###
 PATTERN TOKEN MATCHING
@@ -104,6 +108,6 @@ a key value map along with the PatternId
 
 Also, at this point, we can have an optional post processing step. The attribute
 map can contain regex parsing rules which can be applied to the original string to
-extract detailed information. TODO: this needs to be defined better
+extract detailed information into new attributes. TODO: define better
 
 TODO: Null pattern needs to be defined



Mime
View raw message