jakarta-regexp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Layton" <ELay...@novell.com>
Subject [BUG + PATCH] \w does not match "_"
Date Fri, 11 May 2001 17:03:09 GMT
The Javadocs for regexp 1.2 claim that \w matches a word character, alphanumeric plus "_".
 However this is not true, as seen by running the following:

try {
    RE reTest = new RE("\\w");
    System.out.println(reTest.match("a"));
    System.out.println(reTest.match("1"));
    System.out.println(reTest.match("!"));
    System.out.println(reTest.match("_"));
} catch (Exception e) { }


This block of code outputs the following:

true
true
false
false


Notice that the final match of "\w" on "_" fails.  Similarly, the match for a word boundary
is incorrect:

try {
    RE reTest = new RE("reg\\b");
    System.out.println(reTest.match("reg exp"));
    System.out.println(reTest.match("reg_exp"));
} catch (Exception e) { }


Displays:

true
true


Attached is a patch that to RE.java will treat "_" as an alphanumeric character.  With this
patch, the final matches on each of the above two examples are flipped:  "\w" matching on
"_" returns true, and "reg\b" compared to "reg_exp" returns false.  I'd like to suggest that
this patch be integrated with the next release of regexp.


―--------------------
% diff -ub RE.java.orig RE.java >patchfile.txt
% cat patchfile.txt
--- RE.java.orig	Fri May 11 09:17:00 2001
+++ RE.java	Fri May 11 10:39:29 2001
@@ -1048,7 +1048,9 @@
                             {
                                 char cLast = ((idx == getParenStart(0)) ? '\n' : search.charAt(idx
- 1));
                                 char cNext = ((search.isEnd(idx)) ? '\n' : search.charAt(idx));
-                                if ((Character.isLetterOrDigit(cLast) == Character.isLetterOrDigit(cNext))
== (opdata == E_BOUND))
+                                boolean bLast = Character.isLetterOrDigit(cLast) || cLast
== '_';
+                                boolean bNext = Character.isLetterOrDigit(cNext) || cNext
== '_';
+                                if ((bLast == bNext) == (opdata == E_BOUND))
                                 {
                                     return -1;
                                 }
@@ -1074,7 +1076,8 @@
                             {
                                 case E_ALNUM:
                                 case E_NALNUM:
-                                    if (!(Character.isLetterOrDigit(search.charAt(idx)) ==
(opdata == E_ALNUM)))
+                                    char ch = search.charAt(idx);
+                                    if (!((Character.isLetterOrDigit(ch) || ch =='_') ==
(opdata == E_ALNUM)))
                                     {
                                         return -1;
                                     }
@@ -1178,7 +1181,8 @@
                         switch (opdata)
                         {
                             case POSIX_CLASS_ALNUM:
-                                if (!Character.isLetterOrDigit(search.charAt(idx)))
+                                char ch = search.charAt(idx);
+                                if (!(Character.isLetterOrDigit(ch) || ch == '_'))
                                 {
                                     return -1;
                                 }
―--------------------

 -- Eric



Mime
View raw message