apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bing Swen" <bs...@pku.edu.cn>
Subject RE: apr_hash: Memory consumption
Date Tue, 15 Sep 2009 02:43:51 GMT
We'd love to see such a patch. we've been hampered by this problem for years. We use apr_hash_t
to look up several in-memory lexicons, each of them with over half a million words. The initial
slot array size expands from 2^n - 1, n = 4. That wastes too much as we need to put all the
hash lexicons in persistent pools. The easiest remedy we've been using is to add a new "API"
function like this:

APR_DECLARE(apr_hash_t *) apr_hash_make_ex(apr_pool_t *pool, unsigned int init_size)
{
    apr_hash_t *ht;
    ht = apr_palloc(pool, sizeof(apr_hash_t));
    ht->pool = pool;
    ht->free = NULL;
    ht->count = 0;
    ht->max = init_size;  // INITIAL_MAX;
    ht->array = alloc_array(ht, ht->max);
    ht->hash_func = apr_hashfunc_default;
    return ht;
}

which resembles the other two APR functions:

	APR_DECLARE(apr_table_t *) apr_table_make(apr_pool_t *p, int nelts);
	APR_DECLARE(apr_array_header_t *) apr_array_make(apr_pool_t *p, int nelts, int elt_size);

Also, we'd like very much to see the possiblity whether apr_hash_t could be extended to make
use of the "move-to-front" technique that is presented by the following papers:

	"In-memory hash tables for accumulating text vocabularies", J. Zobel, S. Heinz, and H.E.
Williams, Information Processing Letters, 80(6):271-277, December 2001. (http://www.cs.rmit.edu.au/~jz/fulltext/ipl01.pdf)
 
	"Cache-conscious collision resolution in string hash tables", N. Askitis and J. Zobel, Proceedings
of the SPIRE String Processing and Information Retrieval Symposium, November 2005, pp. 91-102.
LNCS 3772.  (http://www.cs.rmit.edu.au/~jz/fulltext/spire05.pdf)

Cheers,
Bing Swen



-----Original Message-----
From: Neil Conway [mailto:nrc@cs.berkeley.edu] 
Sent: Tuesday, September 15, 2009 3:55 AM
To: dev@apr.apache.org
Subject: apr_hash: Memory consumption

I notice that the implementation of expand_array() in
tables/apr_hash.c allocates a new bucket array, without making any
attempt to release the memory allocated for the previous bucket array.
That wastes memory: if the hash table grows exponentially, this
strategy wastes O(n) extra memory.

Is there a rational for the current behavior? If not, perhaps the
easiest fix is to allocate the bucket array in a subpool, and then
create a new subpool and destroy the old one in expand_array(). If a
patch to do this would be accepted, let me know and I'll submit one.

Neil


Mime
View raw message