Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 36761 invoked from network); 11 Apr 2009 15:10:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Apr 2009 15:10:49 -0000 Received: (qmail 93436 invoked by uid 500); 11 Apr 2009 15:10:48 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 93338 invoked by uid 500); 11 Apr 2009 15:10:47 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 93330 invoked by uid 99); 11 Apr 2009 15:10:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 11 Apr 2009 15:10:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of serera@gmail.com designates 209.85.219.179 as permitted sender) Received: from [209.85.219.179] (HELO mail-ew0-f179.google.com) (209.85.219.179) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 11 Apr 2009 15:10:39 +0000 Received: by ewy27 with SMTP id 27so1624925ewy.5 for ; Sat, 11 Apr 2009 08:10:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=3F+h3H84GnMbTuQecHkXl0LPtqwPQdAx+R+oWcjVtD0=; b=v9AcdtIQTEQ95umpy6Vxk2I3fWnjvu/Z8iGdXntfL0xIr27a9EeWaR9GIfYc5ucWkj rX/t4h/BCCh3bPGJPfbmdIe3+OeU7CI8e1+h8Z4yQOUbqyunIQRjxgOXwgKxeQOzXeqR A8KtxC3rk+hphdY1xa6610kGMAk882ImIHpDs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=Hm/bMuGrC1T7s/py42jLugyRBdmJVswtGRqPil7xDyZA+h2Cjqtlcfw4lyxYTkNvBw kAnhnruIDmluVVaN4CxehBhTBxEg3r6NEqNl6bxwOuyuNCtaCmOjvi55CpTwX1WKpHbZ kaZ2+OVw7oh3X5s/Slm8BSRZIQAtZtFqUqtes= MIME-Version: 1.0 Received: by 10.216.6.200 with SMTP id 50mr1121353wen.117.1239462619258; Sat, 11 Apr 2009 08:10:19 -0700 (PDT) Date: Sat, 11 Apr 2009 18:10:19 +0300 Message-ID: <786fde50904110810t3e1e98c4ma45e12f78cf93a34@mail.gmail.com> Subject: (Benchmark) Split DocMaker into DocCollector and DocMaker From: Shai Erera To: java-dev@lucene.apache.org Content-Type: multipart/alternative; boundary=0016364c7b8d221ed6046748e10d X-Virus-Checked: Checked by ClamAV on apache.org --0016364c7b8d221ed6046748e10d Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi I would like to propose some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. I think these two should actually be split up to DocCollector and DocMaker, which will use a DocCollector instance. DocCollector will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic DocCollector that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. DocMaker will implement the makeDocument methods, reusing DocState etc. The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way the create documents (using DocState). That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today. So by having a EnwikiDocCollector, ReutersDocCollector and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a DocCollector instance (much like we do today w/ DocMaker) and reuse the same DocMaking algorithm with many document collections, as well as the same document collection algorithm with many DocMaker implementations. This will also give us the opportunity to perf test document collection alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. What do you think? I can open an issue and work out a patch. Shai --0016364c7b8d221ed6046748e10d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi

I would like to propose some refactoring to the = benchmark package. Today, DocMaker has two roles: collecting documents from= a collection and preparing a Document object. I think these two should act= ually be split up to DocCollector and DocMaker, which will use a DocCollect= or instance.

DocCollector will implement all the methods of DocMaker, like getNextDo= cData, raw size in bytes tracking etc. This can actually fit well w/ 1591, = by having a basic DocCollector that offers input stream services, and wraps= a file (for example) with a bzip or gzip streams etc.

DocMaker will implement the makeDocument methods, reusing DocState etc.=

The idea is that collecting the Enwiki documents, for example, shou= ld be the same whether I create documents using DocState, add payloads or i= ndex additional metadata. Same goes for Trec and Reuters collections, as we= ll as LineDocMaker.
In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are = 99% the same and 99% different. Most of their differences lie in the way th= ey read the data, while most of the similarity lies in the way the create d= ocuments (using DocState).
That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker = (just the reuse of DocState). Also, other DocMakers do not use that DocStat= e today.

So by having a EnwikiDocCollector, ReutersDocCollector and = others (TREC, Line, Simple), I can write several DocMakers, such as DocStat= eMaker, ConfigurableDocMaker (one which accpets all kinds of config options= ) and custom DocMakers (payload, facets, sorting), passing to them a DocCol= lector instance (much like we do today w/ DocMaker) and reuse the same DocM= aking algorithm with many document collections, as well as the same documen= t collection algorithm with many DocMaker implementations.

This will also give us the opportunity to perf test document collection= alone (i.e., compare bzip, gzip and regular input streams), w/o the overhe= ad of creating a Document object.

I've already done so in my cod= e environment (I extend the benchmark package for my application's purp= oses) and I like the flexibility I have. I think this can be a nice contrib= ution to the benchmark package, which can result in some code cleanup as we= ll.

What do you think? I can open an issue and work out a patch.

Sha= i
--0016364c7b8d221ed6046748e10d--