Thanks for the reply.
Wes Garland wrote:
1. There is no APR equivalent for free, as it is neither
needed nor desired. Simply allocate your memory from a pool, and
destroy the pool when it is no longer needed. I would suggest making a
subpool on RE create and bury it in an opaque pointer describing your
RE, if you're actually going to go whole-hog on this. Me? I use the
OS regexec/regcomp (search only) and register an apr_pool_cleanup
handler to avoid leaking memory.
I'm creating a series of pre-compiled/analyzed regex expressions at
server start up - and doing a lot of S&R during processing. I do
create a dedicated pool for this, however, I can never destroy it, the
pre-compiled expression are stored there and should stay there till
server shutdown. And the PCRE documentation states that I should use
one memory allocation function before first usage. I will try to use
one pool for the regex creations, and another to be used for the search
part - see if that works.
For those interested, I traced the issue to UTF-8 handling- PCRE_UTF8
flag will significantly slow down the searches. Not all my regexes need
to have UTF-8 enabled, only those dealing with embedded strings, so I
shaved a lot of time off by being more selective.
2. Personally, I would never roll my own search and replace except
under exceptional circumstances. That said, your approach doesn't sound
unreasonable, but it's difficult to say what your problem is without
profiling the code and looking at memory consumption. Start by
consulting the literature, S&R is a well-understood problem; and
maybe google some stuff on ropes, they may serve you better than
Here's a paper on ropes which discusses concatenation,
which *should* be where you're spending your search and replace time: www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf
Will read thanks! But with UTF-8 out of the way,
output = apr_array_pstrcat ( subpool, strip_arr, 0 );
works perfectly fine and fast.
Note - if your S&R is regexp instead of strcmp, you could also be
spending most of your time in the regex state machine. Profile!
I guess I now have to deal with my UTF-8 issues.. ugh. I wonder if
UTF-16 would be faster as all chars are 2 bytes long. I'll also try
memcached to cache the results so I don't have to do the same
processing on every request.