On Thu, Mar 25, 2010 at 1:20 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
>> >> Also, will Lucy store the original stats?
>> >
>> > These?
>> >
>> > * Total number of tokens in the field.
>> > * Number of unique terms in the field.
>> > * Doc boost.
>> > * Field boost.
>>
>> Also sum(tf). Robert can generate more :)
>
> Hmm, aren't "Total number of tokens in the field" and sum(tf) normally
> equivalent? I guess there might be analyzers for which that isn't true, e.g.
> those which perform synonym-injection?
>
> In any case, "sum(tf)" is probably a better definition, because it makes no
> ancillary claims...
Sorry, yes they are.
>> > Incidentally, what are you planning to do about field boost if it's not always
>> > 1.0? Are you going to store full 32-bit floats?
>>
>> For starters, yes.
>
> OK, how are those going to be encoded? IEEE 754? Big-endian?
>
> http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness
For starters, I think so. Lucene's ints are bigendian today.
>> We may (later) want to make a new attr that sets
>> the #bits (levels/precision) you want... then uses packed ints to
>> encode.
>
> I'm concerned that the bit-wise entropy of floats may make them a poor match
> for compression via packed ints. We'll probably get a compressed
> representation which is larger than the original.
>
> Are there any standard algorithms out there for compressing IEEE 754 floats?
> RLE works, but only with certain data patterns.
>
> ... [ time passes ] ...
>
> Hmm, maybe not:
>
> http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data
Sorry, I was proposing a fixed-point boost, where you specify how many
levels (in bits, powers of 2) you want.
>> I was specifically asking if Lucy will allow the user to force true
>> average to be recomputed, ie, at commit time from the writer.
>
> That's theoretically possible. We'd have to implement the reader the same way
> we have DeletionsReader -- the most recent segment may contain data which
> applies to older segments.
OK.
> Here's the DeletionsReader code, which searches backwards through the segments
> looking for a particular file:
>
> /* Start with deletions files in the most recently added segments and work
> * backwards. The first one we find which addresses our segment is the
> * one we need. */
> for (i = VA_Get_Size(segments) - 1; i >= 0; i--) {
> Segment *other_seg = (Segment*)VA_Fetch(segments, i);
> Hash *metadata
> = (Hash*)Seg_Fetch_Metadata_Str(other_seg, "deletions", 9);
> if (metadata) {
> Hash *files = (Hash*)CERTIFY(
> Hash_Fetch_Str(metadata, "files", 5), HASH);
> Hash *seg_files_data
> = (Hash*)Hash_Fetch(files, (Obj*)my_seg_name);
> if (seg_files_data) {
> Obj *count = (Obj*)CERTIFY(
> Hash_Fetch_Str(seg_files_data, "count", 5), OBJ);
> del_count = (i32_t)Obj_To_I64(count);
> del_file = (CharBuf*)CERTIFY(
> Hash_Fetch_Str(seg_files_data, "filename", 8), CHARBUF);
> break;
> }
> }
> }
Hmm -- simililar to tombstones? But, different in that the most
recently written file has *all* deletions for that old segment? Ie
you don't have to OR together N generations of written
deletions... only 1 file has all current deletions for the segment?
This is somewhat wasteful of disk space though? Hmm unless your
deletion policy can reclaim the now-stale deletions files from past
flushed segments?
> What we'd do is write the regenerated boost bytes for *all* segments to the
> most recent segment. It would be roughly analogous to building up an NRT
> reader.
Right, except Lucy must go through the filesystem.
>> > What's trickier is that Schemas are not normally mutable, and that they are
>> > part of the index. You don't have to supply an Analyzer, or a Similarity, or
>> > anything else when opening a Searcher -- you just provide the location of the
>> > index, and the Schema gets deserialized from the latest schema_NNN.json file.
>> > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty much
>> > a thing of the past for us.
>>
>> That's nice... though... is it too rigid? Do users even want to pick
>> a different analyzer at search time?
>
> It's not common.
>
> To my mind, the way a field is tokenized is part of its field definition, thus
> the Analyzer is part of the field definition, thus the analyzer is part of the
> schema and needs to be stored with the index.
OK.
> Still, we support different Analyzers at search time by way of QueryParser.
> QueryParser's constructor requires a Schema, but also accepts an optional
> Analyzer which if supplied will be used instead of the Analyzers from the
> Schema.
Ahh OK there's an out.
>> > Maybe aggressive automatic data-reduction makes more sense in the context of
>> > "flexible matching", which is more expansive than "flexible scoring"?
>>
>> I think so. Maybe it shouldn't be called a Similarity (which to me
>> (though, carrying a heavy curse of knowledge burden...) means
>> "scoring")? Matcher?
>
> Heh. "Matcher" is taken. It's a crucial class, too, roughly combining the
> roles of Lucene's Scorer and DocIDSetIterator.
>
> The first alternative that comes to mind is "Relevance", because not only can
> one thing's relevance to another be continuously variable (i.e. score), it can
> also be binary: relevant/not-relevant (i.e. match).
>
> But I don't see why "Relevance", "Matcher", or anything else would be so much
> better than "Similarity". I think this is your hang up. ;)
Yeah OK.
>> > I'm +0 (FWIW) on search-time Sim settability for Lucene. It's a nice feature,
>> > but I don't think we've worked out all the problems yet. If we can, I might
>> > switch to +1 (FWIW).
>>
>> What problems remain, for Lucene?
>
> Storage, formatting, and compression of boosts.
>
> I'm also concerned about making significant changes to the file format when
> you've indicated they're "for starters". IMO, file format changes ought to
> clear a higher bar than that. But I expect to to dissent on that point.
I think we do dissent on this... progress not perfection ;)
I see file format as an impl detail, not as a public API. It's free
to change, and because it's easy to version, changing it isn't that
bad.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|