# lucene-java-user mailing list archives

##### Site index · List index
Message view
Top
Subject RE: contains
Date Mon, 22 Jul 2002 19:08:29 GMT
```On Tue, 16 Jul 2002, Lothar Simon wrote:

> Just to correct a few points:
> - The factor would be 2 * (average no of chars per word)/2 = (average
> no of chars per word).

Actually, I made a mistake in my earlier analysis, but your factor is also
inaccurate.  Ignoring other factors and overhead, storing just the set of
strings S = {s_1, s_2, ..., s_n} with corresponding lengths L = {l_1, l_2,
..., l_n} requires

(sum_i l_i) [sum of all lengths in L]

= (mean(L) * n) space
[where mean(L) = the above sum divided by the size of L (n)].

Storing all prefixes of each string requires

(sum_i sum_j j) [where j varies from 1 to l_i]

= (sum_i (l_i(l_i + 1))/2)

> 1/2(sum_i (l_i^2)).

= (sum_i (l_i^2)) for both prefixes and suffixes

!= (l_i * sum_i l_i) [because you can't take the extra factor of l_i
outside of the sum]

The mistake that you and I both made is this: the sum of the squares of
the mean lengths is not the same as the mean length squared times n.

E.g.: L = {2,3,4,6,7,8} [mean 5]
sum_i l_i = 30
2 * (sum_i sum_j j) = 262; 262/30 = 8.73 [correct multiplicative factor]
2 * mean(L)^2 = 180; 180/30 = 6 [this is *wrong*]
and the original hypothesis was that the additional factor should have
been the mean, 5.

Some additional calculations suggest that if you assume an exponential
distribution of word lengths (cf. Zipf's Law) that basing your guess on a
mean word length will cause you to underestimate by a factor of something
like 17%.

This is all just a fancy way of demonstrating that having long strings
hurts you more than having short ones helps you, i.e., using a mean value
in place of the sum is inaccurate in general.

> - One would probably create a set of 2 * (maximum number of chars per word)
> as Fields for a document. If this could work was actually my question...
> - Most important: my proposal is exactly (and almost only) designed to solve
> the substring ("*uti*") problem !!! One field in the first group of fields
> in my example contains "utiful" and would be found by "uti*", a field in the
> other group of fields contains "itueb" and would be found by "itu*". Voila!

You are correct; I goofed.

> I still think my idea would work (given you spend the space for the index).

I still don't see how you deal with the problem that I mentioned before:

'Another problem with this is that in order to be able to get from "ful"
to "beautiful", you have to store, in the index entry for "ful", (pointers
to) every single complete word in your document set that contains "ful" as
a prefix or suffix.  Just _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like "e" or
"n".'

In any case, I personally would consider the expected overhead of space to
be prohibitive.  However, so long as you address the remaining issue I
just mentioned, yes, I think that your scheme would work.

Regards,