lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 27743] - [PATCH] Added support for segmented field data files and cached directories
Date Thu, 18 Mar 2004 21:07:54 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=27743>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=27743

[PATCH] Added support for segmented field data files and cached directories





------- Additional Comments From ck@rrzn.uni-hannover.de  2004-03-18 21:07 -------
Doug, 
 
thanks for your reply. I think that I should explain some background of this 
patch. 
 
The main reason for writing this patch was to provide support for applying 
functions on field values that are independent on an upstream index but 
dependent on the entered query. 
 
In my application, I do use an index (access through TermEnum/TermDocs) to 
reduce the number of returned documents K to a fraction of all documents N. The 
returned set (probably multiple terms per document) needs to be reprocessed 
against the entered query (which may consist of multiple terms as well). After 
reprocessing, the resulting set R is much smaller than the set of the initially 
returned documents K (|R| << |K|), whereas R is a subset of K. 
 
This procedere can be compared to something like this in the SQL world: 
SELECT TextValue FROM table1 WHERE IndexValue = "FOOBAR" AND 
DISTANCE_FUNCTION(TextValue, "Query String") < 0.4 
 
There would be an index-based solution if DISTANCE_FUNCTION had only 
dependencies on stored columns ("functional indexes", as in PostgreSQL), but in 
this case, I see no other way than applying some function on every returned 
document (something like O(k)) 
 
Unfortunately, my initial dataset (the monolithic .FDT file) was far too big 
(gigabytes) to fit into a RAMDirectory. So seeking and reading from harddisk 
must be included in the calculations. 
 
So, I came up with the idea of "partitioning" the field data file: Partition 
("dataStore") 0 would be small enough to fit into RAM (having no seek time when 
skipping from one document to another one). Partition 1 will only fit on my 
slow harddisk, but from this partition, I only need the data belonging to the 
documents in R, not of all in K. 
 
That way, I am still in linear costs, but without seeking (and this makes an 
remarkably speedup in search time when you, for example, have to look at 
200,000 entries K of 5,000,000 entries N, just to get some 100 entries R as the 
final result, and this per user - with lots of simultaenous requests). 
 
Perhaps something like this could also be implemented using the new TermVector 
support somehow,  but I have not thought about it in detail, yet. 
 
Regarding compression issues, I would say that there would be no benefit of 
compressing the field data values, as they are not traversed sequentially. 
 
 
However, another application for "partitioned" field data files would simply be 
the shared storage of document information. You could have partition 0 on a 
RAMDirectory, partition 1 on a local FSDirectory and partition 2 on an 
NFS-mounted FSDirectory, for example. Partition would then only be accessed if 
necessary (if the user wants more detailed information about a document - for 
example the original HTML document along with all the other information as in 
Google Cache). 
 
Currently, I would store that data outside lucene and have a filename or URI as 
a field value pointing to the data ("CLOB"). With the new feature, a simple 
indexReader.document(docNo, 2).get("FieldName") would be enough.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message