lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] input 47 too high
Date Mon, 04 Mar 2013 23:36:52 GMT
On Mon, Mar 4, 2013 at 7:11 AM, Thomas den Braber <thomas@delos.nl> wrote:
> On Mon, Mar 4, 2013 at 03:17 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
>>
>> This is a bug, which the following patch should address:
>>
>> --- a/core/Lucy/Index/IndexManager.c
>> +++ b/core/Lucy/Index/IndexManager.c
>> @@ -122,7 +122,7 @@ static uint32_t
>>  S_fibonacci(uint32_t n) {
>>      uint32_t result = 0;
>>      if (n > 46) {
>> -        THROW(ERR, "input %u32 too high", n);
>> +        return UINT32_MAX;
>>      }
>>      else if (n < 2) {
>>          result = n;
>
> I am trying to understand this code and the bug.

Thanks for the feedback.  It is useful for us to know which parts of the
codebase are easy to grok and which parts are not.  Ideally, everything would
be simple and transparent.

> Does this mean that after 47 commits without an index merge this warning is
> shown?  What was the original idea behind this exception handling ("input
> %u32 too high"), was it for testing only ?

I wrote that line.  The goal is to avoid overflowing a 32-bit integer.  I
didn't think too hard about the failure case because I didn't expect it to
trigger under normal circumstances -- and when I don't want to think too hard,
my habit is to insert an exception so that we fail noisily rather than
silently.

In this algorithm, we have an array of SegReader objects sorted by
Doc_Max(), and we're trying to figure out which ones to recycle.  The goal is
to end up with a list of segments whose sizes roughly approximate the
fibonacci series.

Here's the relevant block from IndexManager.c:

    // Find sparsely populated segments.
    for (uint32_t i = 0; i < num_candidates; i++) {
        uint32_t num_segs_when_done = num_candidates - threshold + 1;
        total_docs += I32Arr_Get(doc_counts, i);
        if (total_docs < S_fibonacci(num_segs_when_done + 5)) {
            threshold = i + 1;
        }
    }

Now I've had to think about the failure case. :)  Here's the reasoning behind
the patch:

The `threshold` variable -- an array index -- starts at 0, so
`num_segs_when_done` starts high.  As `threshold` grows, the number passed
through S_fibonacci() drops.  It's fine if we clip S_fibonacci() early in the
loop because `total_docs` will never approach `UINT32_MAX`[1] and we really
only care what happens once `num_segs_when_done` drops to a reasonable number,
late in the loop.  Until then, we'll continue to accumulate small segments
that we want to recycle.

Marvin Humphrey

[1] Doc ids are signed 32-bit integers, so even if we ignore practical
    performance considerations, we can't exceed INT32_MAX -- and UINT32_MAX is
    out of reach.

Mime
View raw message