lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Kirchgessner <>
Subject Re: Lucene Index backboned by DB
Date Wed, 16 Nov 2005 00:53:46 GMT

a discussion in

might be of interest to you.

Did you think about storing the large pieces of documents
in a database to reduce the size of Lucene index?

I think there are good reasons to adding support for
storing fields in separate files:

1. One could define a binary field of fixed length and store it
in a separate file. Then load it into memory and have fast
access for field contents.

A use case might be: store calendar date (YYYY-MM-DD)
in three bytes, 4 bits for months, 5 bits for days and up to
15 bits for years. If you want to retrieve hits sorted by date
you can load the fields file of size  (3 * documents in index) bytes
and support sorting by date without accessing hard drive
for reading dates.

2.  One could store document contents in a separate
file and fields of small size like title and some metadata
in the way it is stored now. It could speed up access to
fields. It would be interesting to know whether you gain
significant perfomance leaving the big chunks out, i.e.
not storing them in index.

In my opinion 1. is the most interesting case: storing some
binary fields (dates, prices, length, any numeric metrics of
documents) would enable *really* fast sorting of hits.

Any thoughts about this?



We have a similiar problem

Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora:
> Hi all,
>     in our testing application using lucene 1.4.3. Thanks you guys for
> that great job.
> We have index file around 12GiB, one file (merged). To retrieve hits it
> takes nice small amount of the time, but reading fields takes 10-100
> times more (the stored ones). I think because all the fields are read.
> I would like to try implement lucene index files as tables in db with
> some lazy fields loading. As I have searched web I have found only impl.
> of the store.Directory (bdb), but it only holds data as binary streams.
> This technique will be not so helpful because BLOB operations are not
> fast performing. On another side I will have a lack of the freedom from
> documents fields variability but I can omit a lot of the skipping and
> many opened files. Also IndexWriter can have document/term locking
> granuality.
> So I think that way leads to extends IndexWriter / IndexReader and have
> own implementation of index.Segment* classes. It is the best way or I
> missing smthg how achieve this?
> If it is bad idea, I will be happy to heard another possibilities.
> I would like also join development of the lucene.  Is there some points
> how to start?
> Thx for reading this,
> sorry if I did some mistakes
> Karel
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message