Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7E64810D38 for ; Fri, 18 Oct 2013 05:09:56 +0000 (UTC) Received: (qmail 65621 invoked by uid 500); 18 Oct 2013 05:09:48 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 65390 invoked by uid 500); 18 Oct 2013 05:09:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 65271 invoked by uid 99); 18 Oct 2013 05:09:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 05:09:45 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of serera@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-we0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 05:09:41 +0000 Received: by mail-we0-f170.google.com with SMTP id u57so3254049wes.1 for ; Thu, 17 Oct 2013 22:09:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=A5RWsHkxhgyLhVkMOwhOVvl+lhKAKl7o4CucwAjrOtE=; b=AxZFtWxD9uQNji0hFEX1BxuQc6/Zt0PjKHzA44AgEa7uJ+PrPn2e1j7T2A9SwUqeqT wYUvD+Ux/1ezqkwy6P6s+HLuqaDHkuSCn7j5klEhU7oCGI0sso0XpwyPAoePOrWsuM81 m/y69Wc8MMciR3qQiL48TZhu3AxyTvwVnC+JQ4cU8WKqbA1EuCJ6IZCjG+J4B7B7XC/y UBINQeAacDs9YXa/WbaRN8rAz4SVNN2bQuk1K+Sr7i1WqwiyDpXQTYb/HcOjEy1+XbRh 90JC014iB3UlmnkF4TnEFqpg9Dxsdf88oXMEAezB6MRhQPqMNFe534Qwb9ELK+ETRMx3 /Uxg== X-Received: by 10.194.158.67 with SMTP id ws3mr789805wjb.5.1382072959904; Thu, 17 Oct 2013 22:09:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.238.68 with HTTP; Thu, 17 Oct 2013 22:08:59 -0700 (PDT) In-Reply-To: <5260B58F.8010508@safaribooksonline.com> References: <52582F47.50509@safaribooksonline.com> <52584F47.5010205@safaribooksonline.com> <52587050.2080905@safaribooksonline.com> <525B364F.5070602@safaribooksonline.com> <5260B58F.8010508@safaribooksonline.com> From: Shai Erera Date: Fri, 18 Oct 2013 08:08:59 +0300 Message-ID: Subject: Re: external file stored field codec To: "java-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=089e0122eca4d431ba04e8fceff5 X-Virus-Checked: Checked by ClamAV on apache.org --089e0122eca4d431ba04e8fceff5 Content-Type: text/plain; charset=ISO-8859-1 > > The codec intercepts merges in order to clean up files that are no longer > referenced > What happens if a document is deleted while there's a reader open on the index, and the segments are merged? Maybe I misunderstand what you meant by this statement, but if the external file is deleted, since the document is "pruned" from the index, how will the reader be able to read the stored fields from it? How do you track references to the external files? Since you write that all tests in the o.a.l.index package pass, I assume you handle this, but here's a simple testcase I have in mind: IndexWriter writer = new IndexWriter(dir, configWithNewCode()); writer.addDocument(addDocWithStoredFields("doc1")); writer.addDocument(addDocWithStoredFields("doc2")); writer.commit(); writer.addDocument(addDocWithStoredFields("doc3")); writer.addDocument(addDocWithStoredFields("doc4")); IndexReader reader = writer.getReader(); writer.deleteDocuments("doc1"); writer.deleteDocuments("doc4"); writer.forceMerge(1); writer.close(); System.out.println(reader.document("doc1")); System.out.println(reader.document("doc4")); Does this test pass? Shai On Fri, Oct 18, 2013 at 7:14 AM, Michael Sokolov < msokolov@safaribooksonline.com> wrote: > On 10/13/13 8:09 PM, Michael Sokolov wrote: > >> On 10/13/2013 1:52 PM, Adrien Grand wrote: >> >>> Hi Michael, >>> >>> I'm not aware enough of operating system internals to know what >>> exactly happens when a file is open but it sounds to be like having >>> separate files per document or field adds levels of indirection when >>> loading stored fields, so I would be surprised it it actually proved >>> to be more efficient than storing everything in a single file. >>> >>> That's true, Adrien, there's definitely a cost to using files. There >> are some gnarly challenges in here (mostly to do with the large number of >> files, as you say, and with cleaning up after deletes - deletion is always >> hard). I'm not sure it's going to be possible to both clean up and >> maintain files for stale commits; this will become problematic in the way >> that having index files on NFS mounts are problematic. >> >> I think the hope is that there will be countervailing savings during >> writes and merges (mostly) because we may be able to cleverly avoid copying >> the contents of stored fields being merged. There may also be savings when >> querying due to reduced RAM requirements since the large stored fields >> won't be paged in while performing queries. As I said, some simple tests >> do show improvements under at least some circumstances, so I'm pursuing >> this a bit further. I have a preliminary implementation as a codec now, >> and I'm learning a bit about Lucene's index internals. BTW SimpleTextCodec >> is a great tool for learning and debugging. >> >> The background for this is a document store with large files (think PDFs, >> but lots of formats) that have to be tracked, and have associated metadata. >> We've been storing these externally, but it would be beneficial to have a >> single data management layer: i.e. to push this down into Lucene, for a >> variety of reasons. For one, we could rely on Solr to do our replication >> for us. >> >> I'll post back when I have some measurements. >> >> -Mike >> > This idea actually does seem to be working out pretty nicely. I compared > time to write and then to read documents that included a couple of small > indexed fields and a binary stored field that varied in size. Writing to > external files, via the FSFieldCodec, was 3-20 times faster than writing to > the index in the normal way (using MMapDirectory). Reading was sometimes > faster and sometimes slower. I also measured time for a forceMerge(1) at > the end of each test: this was almost always nearly zero when binaries were > external, and grew larger with more data in the normal case. I believe the > improvements we're seeing here result largely from removing the bulk of the > data from the merge I/O path. > > As with any performance measurements, a lot of factors can affect the > measurements, but this effect seems pretty robust across the conditions I > measured (different file sizes, numbers of files, and frequency of commits, > with lots of repetition). One oddity is a large difference between Mac SSD > filesystem (15-20x writing, reading 0.6x) via FSFieldCodec) and Linux ext4 > HD filesystem (3-4x writing, 1.5x reading). > > The codec works as a wrapper around another codec (like the compressing > codecs), intercepting binary and string stored fields larger than a > configurable threshold, and storing a file number as a reference in the > main index which then functions kind of like a symlink. The codec > intercepts merges in order to clean up files that are no longer referenced, > taking special care to preserve the ability of the other codecs to perform > bulk merges. The codec passes all the Lucene unit tests in the o.a.l.index > package. > > The implementation is still very experimental: there are lots of details > to be worked out: for example, I haven't yet measured the performance > impact of deletions, which could be pretty significant. It would be really > great if someone with intimate knowledge of Lucene's indexing internals > were able to review it: I'd be happy to share the code and my list of > TODO's and questions if there's any interest, but at least I thought it > would be interesting to know that the approach does seem to be worth > pursuing. > > -Mike > > ------------------------------**------------------------------**--------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org > For additional commands, e-mail: java-user-help@lucene.apache.**org > > --089e0122eca4d431ba04e8fceff5--