incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Sort cache file format
Date Sat, 11 Apr 2009 15:16:28 GMT
On Fri, Apr 10, 2009 at 9:51 PM, Marvin Humphrey <> wrote:

>> >> So for Lucy you'll move away from KS's required "compound file
>> >> format", and allow either compound or not.
>> >
>> > I think so.  It would be nice not to foreclose on the option of a
>> > discrete file format.  In some sense discrete files are actually
>> > easier to troubleshoot because you can inspect the individual binary
>> > files directly.  And of course, that extra file copy at index time
>> > goes away.
>> I'm still torn on how crucial "live" transparency is, vs running a
>> tool (CheckIndex, Luke) to see things, but yes it's a plus.
> It can be pretty powerful stuff.  Case in point: today, a colleague and I were
> examining a snapshot_NNN.json file, and we realized that we could implement
> index data truncation just by deleting all the segment related files from the
> "entries" array.  First, we hacked it using ordinary Perl JSON editing tools,
> and now we're contemplating how to make it work for real in public KS module
> code.

But this is a supremely expert use case.  It's like me getting a
scalpel and attempting to improve how my stomach works, or something.

> I think it's really, really important to approach the Lucy file format
> specification as a public API design task.  Not so much because we expect
> ordinary users to look at it, but because something which is simple and easy
> to grok will stimulate and facilitate innovation by expert users.

I agree, encouraging experimentation/adoption by curious experts is a
good reason for transparency.  Community growth...

> The index directory structure should be as transparent as possible:
>  * Segments should be housed in individual directories, to reinforce that
>    each segment is an independent, coherent entity on its own.
>  * Segment data files should be given meaningful names.
>  * As previously discussed and provisionally resolved, snapshot and segment
>    metadata should be human-readable and easily accessible.

This is great.

> But then, let's consider discrete vs. compound with regards to transparency.
> When we're talking about discrete segment files, we're only talking about
> binary data -- because the metadata is all in segmeta.json.  Those binary
> files are hard to examine without a tool anyway -- hexdumping is hard core. :)
> So, transparency-wise, perhaps not so much is gained by going discrete.

You can list their size, and see their presence or not.

>> But: I've often wondered whether that extra copy gains performance.
> Mmm.  On a fragmented disk, I suppose it might.  And search-time performance
> is more important than index-time performance.

I think on a "somewhat" fragmented disk, it could gain performance.
On a heavily fragmented disk you're just doomed. On a barely used
drive, nomatter what you do it's great (I think).

And yes search peformance trumps indexing when the two conflict.

> I guess I'm now leaning towards requiring the compound file after all.  Are
> there any other criteria we should consider?
> I guess another thing is that creating lots of little files is reportedly
> expensive on Windows.

You should retest to confirm.  Since you are putting each segment in a
subdir this may not matter.

> FWIW, in the KS prototype, compound files are handled at the Store level,
> rather than the index level.  So e.g. a HadoopFolder implementation wouldn't
> perform the extra file copy.

In Lucene they really are at the store level (CFSReader extends
Directory), but live in the index package for some reason.

>> I suppose we could do a similar trick to the discrete files too.
> The only way we could do that would be to rewrite them all at the end of the
> indexing session.  I'd want to see benchmarking data proving that actually did
> something before committing.


> I guess my first priority is implementing mmap'd sort caches, and I consider
> that so important that I want to make sure there are no design compromises
> made to sort caching because we've decided that we want column-stride fields
> to do double duty as both field storage and sort caches.  The field storage
> use case is much less important, IMO.

OK, I agree.

>> EG maybe the APP stores price & manuf only in CSF and not in the
>> "normal" stored fields... the additional cost to retrieve stored field
>> / excerpts for each of 10 hits on the page can be quite high.
> FWIW...  I've already implemented a ByteBufDocReader proof-of-concept class in
> pure Perl; instead of serializing all fields marked as "stored", it stores one
> fixed-width byte array per document -- so doc storage is essentially a
> flatfile.  I'm also pretty close to finishing a ZlibDocReader that uses Zlib
> compression. (The "compressed" field spec flag has been removed.)

How can doc storage be fixed width?  (text fields have different

So you removed "compressed" from FieldSpec and instead the user swaps
out the DocReader component?  I wonder how compression compares if you
did column-stride body text vs row stride body text plus all other

> Lucy's plugin API will allow people to add their own column-stride field
> storage if they so choose.  If we make the sort cache writer pluggable as
> well, then someone could make something that performs double duty, if that's
> important to them.
> To my mind, it's more important to expose a plugin API that makes this kind of
> thing possible than it is to provide support for column-stride fields in core
> Lucy.


>> Well if the terms are "largely" unique (eg a "title" field), you lose
>> by treating them as enumerated (as Lucene does today) since the extra
>> deref only adds waste.
> That's true, but it's not important for sorted search.  The vast majority of
> comparisons will occur within the segment using only the ords.  We only need
> to retrieve the values for resolving intra-segment comparisons, and the extra
> cost won't matter too much there because it scales with the number of segments
> rather than the number of documents in the index.


> Are there other uses for a field cache where the extra deref would
> meaningfully impede performance?

Simple retrieval, but that's low on the totem pole.

>> But then, if you are going to sort by that field, you need the ord
>> [deref] anyway, so... yeah maybe we only implement "enumerated"
>> values, at least for the sort cache.
>> If I just want to retrieve values, and the values are mostly unique,
>> it's more efficient to inline the values.
>> And yes at indexing time we know precisely all stats on the field so
>> we could pick and choose betweeen the "hey it's obviously enumerated"
>> vs the "hey most values are unique" cases, if we want to do both.
> My inclination is to implement the enumerated version for now and add code to
> support the "hey most values are unique" case if necessary later.

Sounds good.


View raw message