jakarta-oro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bob Dickinson \(BSL\)" <...@brutesquadlabs.com>
Subject ORO performance tuning efforts
Date Sun, 17 Nov 2002 06:39:22 GMT
I spent that last couple of (long) days profiling and tuning ORO.  My
general impression is that the core ORO engine is pretty tight already.  It
was obviously designed with performance in mind.  I did find a couple of
potential speed-ups or areas that could use some attention.

I ran variations of the same tests in a few different ways.  I have a
RegexPerf application that I run from JBuilder7 that runs a 3k message body
through each of 583 regular expressions 1000 times.  I also ran that app
under OptimizeIt 5.0.  Additional testing was done using a specialized
configuration of our own app (Iocaine) reading 1,000 email messages from a
inbound queue, parsing them, applying the same set of 583 regexes (to
various parts of the message), and writing the messages to an outbound
queue.  The Iocaine configuration was run standalone and from within
OptimizeIt.  These were all done on a single proc dev machine (1.8Ghz, 512Mb
machine with JDK 1.4.1).  I did another set of tests with our full app
sending mail through it on our benchmark server (dual proc, fast, 1Gb, 15k
rpm SCSI with JDK 1.4.1).  The Iocaine tests (both dev and benchmark) were
performed with a variety of processor thread count settings (the "processor"
is where the our regex work happens, among other things).

Part of the reason that I had such a variety of test environments is that
tuning Java code that is already tight is a pretty touchy business.  For
example, I saw some spikes when using OptimizeIt that looked like a method
not being inlined by the JVM, so I manually inlined it only to find that
when running standalone, there was no difference (that the JVM did inline it
when not running under the profiler).  I found the same kind of things
between running the RegexPerf app from JBuilder and running it standalone.
So my general approach was to find spikes using the profiler, or my own
instrumentation, fix them, observe an improvement, then test in the
standalone environments to make sure the gain was real.  Lather, rinse,
repeat.  And repeat.  And repeat.

The single biggest thing that I found was the performance of the _toLower
method in Perl5Matcher.  There is a big comment there indicating that this
needs to be refactored, presumably for performance.  For case insensitive
operations, _toLower makes a copy of the input char[] and converts the copy
to lower case.  The array copy is reasonably fast, but calling
Character.isUpperCase( ) on each character takes a significant amount of
time.  It depended a lot on the test, but for case insensitive operations
_toLower was consistently 20% to 40% of the total time.  The array copy was
3-4% of this, and Character.isUpperCase( ) was most of the balance.  As per
the comment that is already there, this method really just needs to go away.
I think there are solid performance related reasons to make that happen
sooner than later.

On a related note, there was some discussion on the list about updating ORO
to use CharSequence for input, in part to eliminate the toCharArray( ) that
is done when using String inputs, which is presumably pretty common.  (Note:
The toCharArray( ) was an additional 3-4% overhead before I changed our
tests to use char[] inputs).  As has been said, with modern JVMs there is no
longer a performance benefit to using the char[].  I would be a big fan of
going to CharSequence exclusively.  I seems like a reasonably
straightforward change for someone who knows the code.  The big question is
probably whether a change that made ORO require JDK 1.4 was cool.  We
finally decided not to support 1.3 a month or so ago because there were
enough compelling things in 1.4 for us, but I don't know how the ORO
community feels about doing this.  A side benefit is that using CharSequence
will reduce the number of match/contains entrypoints significantly and I
think will completely eliminate the need for the PatternMatcherInput (though
I could be wrong about that).

I found one other change that produced significant improved performance in
my tests.  It seems that the Character static methods, which are called a
LOT, are either slower than they look like they would be, or are not being
inlined, or both.  I added an optional static table which precomputes the
character properties across each dimension for each possible Unicode
character.  I wrote the method that tests for a character type in a way that
was inline-friendly.  If the table is not initialized, then overall
performance decreases by 1% to 2%.  If the table is initialized, it will use
128k bytes and will take a little startup time (not that much really), but
yield 10% to 20% better performance.  For us, using the static table is a no
brainer.  But I have no idea how attractive this change would be to the ORO
community at large.  I wrote it in a way that doesn't hurt (and may even
help) maintainability/readability.  I have included the code at the end of
this message.

The one other thing that I found was that when using high thread counts (10
or more), there was significant time spent on Perl5Repetition object
allocation (in both places where these objects are created).  I'm doing
583,000 regexes in about a minute, so a lot of these objects get allocated.
As with anything in Java that needs to be fast, object creation can hurt
(the fact that ORO generally doesn't use objects during matching is a big
reason it beats Sun, IMO).  I'm not sure if there is a way to effectively
inline the stuff in the Perl5Repetition, or eliminate/reduce it's use in
some other way, but it's definitely something to think about for apps like
ours that do a lot or regex on multiple threads.

I'm not entirely sure how to move forward from here.  Comments appreciated.

Bob Dickinson
Brute Squad Labs, Inc.

----------------------------------------------------------------

The code for the charType table (Perl5Matcher.java) follows:

  // The following are bitflags used to indicate character properties
  //
  private static final short CHAR_UPPER       = 0x0001;
  private static final short CHAR_LOWER       = 0x0002;
  private static final short CHAR_WORDCHAR    = 0x0004;
  private static final short CHAR_LETTER      = 0x0008;
  private static final short CHAR_DIGIT       = 0x0010;
  private static final short CHAR_LETTERDIGIT = 0x0020;
  private static final short CHAR_SPACE       = 0x0040;
  private static final short CHAR_WHITESPACE  = 0x0080;
  private static final short CHAR_ISOCONTROL  = 0x0100;
  private static final short CHAR_PUNCTUATION = 0x0200;
  private static final short CHAR_XDIGIT      = 0x0400;
  private static final short CHAR_ASCII       = 0x0800;
  private static final short CHAR_HASLOWER    = 0x1000;

  // The charType table is indexed by char value.  If initialized,
  // it will contain a set of CHAR_* bitflags (see above) representing
  // the character properties of each Unicode character.
  //
  private static short[] charType;

  /**
   * Determine whether a given character is of the specified type, using
charType table
   * if initialized, otherwise computing the type.
   *
   * @param ch the character to test
   * @param type a CHAR_ bit flag representing the type being tested
   * @return true if the character is of the specificed type, otherwise
false.
   */
  private static boolean charIsType(char ch, short type)
  {
      if (charType != null)
      {
          return (charType[ch] & type) == type;
      }
      else
      {
          return charIsComputedType(ch, type);
      }
  }

  /**
   * Determine whether a given character is of the specified type by
comuting the type.
   *
   * @param ch the character to test
   * @param type a CHAR_ bit flag representing the type being tested
   * @return true if the character is of the specificed type, otherwise
false.
   */
  private static boolean charIsComputedType(char ch, short type)
  {
      switch (type)
      {
          case CHAR_UPPER:
              return Character.isUpperCase(ch);

          case CHAR_LOWER:
              return Character.isLowerCase(ch);

          case CHAR_WORDCHAR:
              return Character.isLetterOrDigit(ch) || ch == '_';

          case CHAR_LETTER:
              return Character.isLetter(ch);

          case CHAR_DIGIT:
              return Character.isDigit(ch);

          case CHAR_LETTERDIGIT:
              return Character.isLetterOrDigit(ch);

          case CHAR_SPACE:
              return Character.isSpaceChar(ch);

          case CHAR_WHITESPACE:
              return Character.isWhitespace(ch);

          case CHAR_ISOCONTROL:
              return Character.isISOControl(ch);

          case CHAR_PUNCTUATION:
              switch (Character.getType(ch))
              {
                  case Character.DASH_PUNCTUATION:
                  case Character.START_PUNCTUATION:
                  case Character.END_PUNCTUATION:
                  case Character.CONNECTOR_PUNCTUATION:
                  case Character.OTHER_PUNCTUATION:
                      return true;
                  default:
                      return false;
              }

          case CHAR_XDIGIT:
              return (ch >= '0' && ch <= '9') || (ch >= 'a' && ch
<= 'f') ||
(ch >= 'A' && ch <= 'F');

          case CHAR_ASCII:
              return ch < 0x80;

          case CHAR_HASLOWER:
              return Character.toLowerCase(ch) != ch;

          default:
              throw new IllegalArgumentException("Unknown type: " + type);
      }
  }

  /**
   * Initialize the static charType table of character properties to provide
optimized
   * access to character properties.  Initializing the charType table is
optional.
   * If initialized, the table uses 128k of memory, but can improve overall
performance,
   * depending on use, in the range of 10% to 25%.
   */
  public static void initCharTypeTable()
  {
      Perl5Matcher.charType = new short[Character.MAX_VALUE -
Character.MIN_VALUE + 1];

      char ch = Character.MIN_VALUE;
      do
      {
          if (charIsComputedType(ch, CHAR_UPPER)) charType[ch] |=
CHAR_UPPER;
          if (charIsComputedType(ch, CHAR_LOWER)) charType[ch] |=
CHAR_LOWER;
          if (charIsComputedType(ch, CHAR_WORDCHAR)) charType[ch] |=
CHAR_WORDCHAR;
          if (charIsComputedType(ch, CHAR_LETTER)) charType[ch] |=
CHAR_LETTER;
          if (charIsComputedType(ch, CHAR_DIGIT)) charType[ch] |=
CHAR_DIGIT;
          if (charIsComputedType(ch, CHAR_LETTERDIGIT)) charType[ch] |=
CHAR_LETTERDIGIT;
          if (charIsComputedType(ch, CHAR_SPACE)) charType[ch] |=
CHAR_SPACE;
          if (charIsComputedType(ch, CHAR_WHITESPACE)) charType[ch] |=
CHAR_WHITESPACE;
          if (charIsComputedType(ch, CHAR_ISOCONTROL)) charType[ch] |=
CHAR_ISOCONTROL;
          if (charIsComputedType(ch, CHAR_PUNCTUATION)) charType[ch] |=
CHAR_PUNCTUATION;
          if (charIsComputedType(ch, CHAR_XDIGIT)) charType[ch] |=
CHAR_XDIGIT;
          if (charIsComputedType(ch, CHAR_ASCII)) charType[ch] |=
CHAR_ASCII;
          if (charIsComputedType(ch, CHAR_HASLOWER)) charType[ch] |=
CHAR_HASLOWER;
      }
      while (ch++ < Character.MAX_VALUE);
  }

Note: All places in Perl5Matcher that test character types/properties were
changed to use charIsType( ).










--
To unsubscribe, e-mail:   <mailto:oro-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:oro-dev-help@jakarta.apache.org>


Mime
View raw message