lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <>
Subject [jira] Resolved: (LUCENE-196) [PATCH] Added support for segmented field data files and cached directories
Date Thu, 15 Jun 2006 19:28:30 GMT
     [ ]
Otis Gospodnetic resolved LUCENE-196:

    Resolution: Duplicate
     Assign To:     (was: Lucene Developers)

Thanks Christian.  I think LUCENE-545 provided the solution to selective field loading now.

> [PATCH] Added support for segmented field data files and cached directories
> ---------------------------------------------------------------------------
>          Key: LUCENE-196
>          URL:
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Index
>     Versions: CVS Nightly - Specify date in submission
>  Environment: Operating System: All
> Platform: All
>     Reporter: Christian Kohlschütter
>     Priority: Minor
>  Attachments: docStore-patch.txt, docStore-test-patch.txt, docStore-test-patch.txt, docStore-test-patch.txt,
newDocStore-patch.txt, newDocStore-test-patch.txt
> Hello, 
> I would like to contribute the following enhancement, hoping that it would be 
> as useful for you as it is for me. 
> For one of my applications, it was necessary to reprocess the Documents 
> returned by a search in a Lucene index according to some Field values (for 
> applying an "edit distance" function on unindexed fields, in my case). 
> Because Lucene has to load every possibly relevant document (*all* fields, 
> including the ones which are irrelevant for the algorithm) from disk into 
> memory for this operation - doing so is extensively time-consuming. 
> As far as I can see, currently, there is no satisfying solution to improve 
> this situation except buffering all data in RAM using a RAMDirectory. 
> But what if the field data is just too big to fit in RAM? 
> My patch will handle this by splitting the monolithic "*.fdt"-Field data file 
> into several "data store" files .fdt, .fd1, .fd2 and so on. 
> These "data store" files are connected as a linked-list which permits you to 
> load only the part of the field data that is relevant for the current 
> operation. 
> So, you can load all field data (as in the current implementation), or the 
> fields from a specific interval [0;n] of data stores. Store 0 represents the 
> data in the ".fdt" file, all data stores with ids > 0 are represented by files 
> ".fd1", ".fd2", and so on. 
> In my case, I would then simply cache the ".fdt" (data store 0) file in RAM 
> (using a symbolic link to shm-/tmp), but leave all other .fd* files on 
> harddisk. The .fdt file only contains the relevant field for my algorithm 
> (which therefore remains quite small); all the other fields are stored in the 
> rather big ".fd0" file. So, accessing Fields in .fdt requires no disk I/O, 
> which speeds up things remarkably. 
> You can compare this feature with having multiple tables in a relational 
> database that are linked with 1..1 cardinality instead of having one big 
> table. 
> My proposed enhancement requires some API additions, which I try to explain 
> now. 
> To specify the desired data store for a Field, simply call the new method 
> "Field setDataStore(int)" (docstore 0 is the default): 
> doc.add(Field.Keyword("fieldA", "this is in docstore 0")); 
> doc.add(Field.Keyword("fieldB", "this is in docstore 1").setDataStore(1)); 
> In this example, fieldA would be stored in ".fdt"; fieldB in ".fd1". 
> When you retrieve the Document object (example docId = 123) using an 
> IndexReader, you have the following options: 
> "indexReader.document(123)" would load all fields from all data stores. 
> "indexReader.document(123, 0)" would load only the fields from data store 0. 
> "indexReader.document(123, 1)" would explictly load only the fields from data 
> stores 0 and 1. 
> The method "IndexReader.document(int n, int k)" is defined to fetch all fields 
> from all data stores *at least* up to ID k. That way, existing IndexReader 
> subclasses do not have to be modified, as I provide an overridable method in 
> IndexReader which simply calls document(int n). 
> A more concrete example is attached to this feature request as a 
> JUnit-Testcase, as well as the patch itself. 
> Have fun with it! 
> Best regards, 
> Christian Kohlschuetter

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message