lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flexible indexing
Date Wed, 14 Mar 2007 02:41:14 GMT

On Mar 12, 2007, at 5:08 PM, Grant Ingersoll wrote:

> I can see having storage at:
> Index
> Document/Field  //already exists
> Token

I hadn't thought of it that way, as a logical extension outwards at  
all levels.

If I understand you correctly, it's a clever point, but the thing is,  
it's cake for someone to add arbitrary index-level data on their own,  
just by adding their own file.  We'd have to come up with and support  
an infrastructure for handling this kind of data, and whatever we  
invented would be unlikely to suit all needs.  Ergo, I think it makes  
sense for us to focus on the Token and Document/Field levels.

I think we can do much better with regards to opening up Document/ 
Field retrieval.  Under global field semantics, the fieldbits Byte is  
no longer needed.  Go one step beyond that, and change the field  
number to a field name string, and documents can be handled as  
monolithic blobs when merging segments.  Document storage becomes  
simply a combination of fixed width storage and (optional) variable  
width storage, and the possibilities for subclassing break wide  
open.  Extended thoughts below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Begin forwarded message:
From: Marvin Humphrey <marvin@rectangular.com>
Date: February 26, 2007 1:26:00 PM PST
To: KinoSearch discussion forum <kinosearch@rectangular.com>
Subject: [KinoSearch] Subclassing DocWriter/DocReader
Reply-To: KinoSearch discussion forum <kinosearch@rectangular.com>

Greets,

The file format changes in the new KS have opened up possibilities  
for subclassing DocWriter/DocReader, the classes responsible for  
storage/retrieval of serialized documents.

Here are some potential features that subclasses could implement:

   * storage of arbitrary data (e.g. arrayref values)
   * different field values for display and searching
   * complete document recovery
   * arbitrary compression algo choice
   * lazy loading
   * optimized external document storage (e.g. in SQL DB)

Anything else?  The more ideas we dream up now and consider how to  
support, the better the design will be.

Right now, there are two files, _XXX.ds and _XXX.dsx, with .ds being  
"document storage", and .dsx being "document storage index".  .ds is  
a stack of variable width records -- serialized documents -- stored  
end to end.  .dsx is a stack of fixed width records: 64-bit pointers  
into the variable-width .ds file.  (For a more extensive explanation,  
see <http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/ 
FileFormat.html>)

The fixed width file, I intend to monkey with myself, because I'm  
going to start storing document boost as a 32-bit float within it.  
(That's what's driving this development track -- I need a place to  
put these doc boosts.)

My thinking is, why not add more than that?  So long as the  
additional data is fixed width, we can still index into the .dsx file  
quickly.

The variable width .ds file is up for grabs.  Right now, docs are  
serialized using a scheme derived from Lucene which isn't really  
optimal for KS and doesn't need to be as complicated as it is.  So  
long as we can recover a hash from the serialized data, we're fine.

Rough sketch example subclasses implementing storage of arbitrary  
data and external storage in a DB are below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#--------------------------------------------------------------------

package ArbitraryDataDocWriter;
use base qw( KinoSearch::Index::DocWriter );
use Storable qw( nfreeze );

sub store_doc {
     my ( $self, $doc ) = @_;
     my %ret_hash = ( var_width_data => nfreeze($doc) );
     return \%ret_hash;
}

package ArbitraryDataDocReader;
use base qw( KinoSearch::Index::DocReader );
use Storable qw( thaw );

sub fetch_doc {
     my ( $self, %args ) = @_;
     my $serialized;
     $self->read_var_width( \$serialized, $args{var_width_bytes} );
     return thaw($$serialized);
}


#--------------------------------------------------------------------

package DBDocWriter;
use base qw( KinoSearch::Index::DocWriter );
use DBI;

sub fixed_width_data_size { 8 }

sub store_doc {
     my ( $self, $doc ) = @_;
     $self->store_in_db($doc);
     my %ret_hash = ( fixed_width_data => $doc->{primary_key} );
     return \%ret_hash;
}

package DBDocReader;
use base qw( KinoSearch::Index::DocReader );
use DBI;

sub fixed_width_data_size { 8 }

sub fetch_doc {
     my ( $self, %args ) = @_;
     return $self->fetch_from_db( $args{fixed_width_data} );
}



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message