lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (LUCENE-196) [PATCH] Added support for segmented field data files and cached directories
Date Thu, 15 Jun 2006 19:28:30 GMT
     [ http://issues.apache.org/jira/browse/LUCENE-196?page=all ]
     
Otis Gospodnetic resolved LUCENE-196:
-------------------------------------

    Resolution: Duplicate
     Assign To:     (was: Lucene Developers)

Thanks Christian.  I think LUCENE-545 provided the solution to selective field loading now.

> [PATCH] Added support for segmented field data files and cached directories
> ---------------------------------------------------------------------------
>
>          Key: LUCENE-196
>          URL: http://issues.apache.org/jira/browse/LUCENE-196
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Index
>     Versions: CVS Nightly - Specify date in submission
>  Environment: Operating System: All
> Platform: All
>     Reporter: Christian Kohlschütter
>     Priority: Minor
>  Attachments: docStore-patch.txt, docStore-test-patch.txt, docStore-test-patch.txt, docStore-test-patch.txt,
newDocStore-patch.txt, newDocStore-test-patch.txt
>
> Hello, 
>  
> I would like to contribute the following enhancement, hoping that it would be 
> as useful for you as it is for me. 
>  
> For one of my applications, it was necessary to reprocess the Documents 
> returned by a search in a Lucene index according to some Field values (for 
> applying an "edit distance" function on unindexed fields, in my case). 
>  
> Because Lucene has to load every possibly relevant document (*all* fields, 
> including the ones which are irrelevant for the algorithm) from disk into 
> memory for this operation - doing so is extensively time-consuming. 
>  
> As far as I can see, currently, there is no satisfying solution to improve 
> this situation except buffering all data in RAM using a RAMDirectory. 
>  
> But what if the field data is just too big to fit in RAM? 
>  
> My patch will handle this by splitting the monolithic "*.fdt"-Field data file 
> into several "data store" files .fdt, .fd1, .fd2 and so on. 
>  
> These "data store" files are connected as a linked-list which permits you to 
> load only the part of the field data that is relevant for the current 
> operation. 
>  
> So, you can load all field data (as in the current implementation), or the 
> fields from a specific interval [0;n] of data stores. Store 0 represents the 
> data in the ".fdt" file, all data stores with ids > 0 are represented by files 
> ".fd1", ".fd2", and so on. 
>  
> In my case, I would then simply cache the ".fdt" (data store 0) file in RAM 
> (using a symbolic link to shm-/tmp), but leave all other .fd* files on 
> harddisk. The .fdt file only contains the relevant field for my algorithm 
> (which therefore remains quite small); all the other fields are stored in the 
> rather big ".fd0" file. So, accessing Fields in .fdt requires no disk I/O, 
> which speeds up things remarkably. 
>  
> You can compare this feature with having multiple tables in a relational 
> database that are linked with 1..1 cardinality instead of having one big 
> table. 
>  
> My proposed enhancement requires some API additions, which I try to explain 
> now. 
>  
> To specify the desired data store for a Field, simply call the new method 
> "Field setDataStore(int)" (docstore 0 is the default): 
> doc.add(Field.Keyword("fieldA", "this is in docstore 0")); 
> doc.add(Field.Keyword("fieldB", "this is in docstore 1").setDataStore(1)); 
>  
> In this example, fieldA would be stored in ".fdt"; fieldB in ".fd1". 
>  
> When you retrieve the Document object (example docId = 123) using an 
> IndexReader, you have the following options: 
> "indexReader.document(123)" would load all fields from all data stores. 
> "indexReader.document(123, 0)" would load only the fields from data store 0. 
> "indexReader.document(123, 1)" would explictly load only the fields from data 
> stores 0 and 1. 
>  
> The method "IndexReader.document(int n, int k)" is defined to fetch all fields 
> from all data stores *at least* up to ID k. That way, existing IndexReader 
> subclasses do not have to be modified, as I provide an overridable method in 
> IndexReader which simply calls document(int n). 
>  
> A more concrete example is attached to this feature request as a 
> JUnit-Testcase, as well as the patch itself. 
>  
> Have fun with it! 
>  
>  
> Best regards, 
>  
> Christian Kohlschuetter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message