Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 75397 invoked from network); 10 Jun 2009 13:02:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Jun 2009 13:02:21 -0000 Received: (qmail 79378 invoked by uid 500); 10 Jun 2009 13:02:32 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 79291 invoked by uid 500); 10 Jun 2009 13:02:32 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 79283 invoked by uid 99); 10 Jun 2009 13:02:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2009 13:02:32 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2009 13:02:28 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 58D4A234C044 for ; Wed, 10 Jun 2009 06:02:07 -0700 (PDT) Message-ID: <1557053501.1244638927351.JavaMail.jira@brutus> Date: Wed, 10 Jun 2009 06:02:07 -0700 (PDT) From: "Shai Erera (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker In-Reply-To: <1279253617.1239463995255.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1595: ------------------------------- Attachment: LUCENE-1595.patch Some updates: # Added to PerfTask a log.step config parameter, and implemented in tearDown logging messages. Also introduced a getLogMessage(int recsCount) which can be overridden by sub classes. #* Overrode getLogMessage in the relevant tasks which logged messages, such as AddDocTask, DeleteDocTask, WriteLineDocTask ... I also removed logging from these tasks # Added ConsumeContentSource task together with a readContent.Source.alg - this can be used to simply read from a content source, if we want to measure the performance of a particular impl. # Removed the "xerces" class name from EnwikiContentSource (read more below). I changed EnwikiContentSource to not specifically request for a Xerces SAXParser. However, the default is to use the JRE's SAXParser, which is Xerces. I wanted to remove the Xerces .jar, but when I attempted to read the enwiki-20090306-pages-articles.xml, it failed w/ an AIOOBE, so I don't think we can remove the .jar yet. BTW, in LUCENE-1591 I reported that I am not able to parse that particular enwiki version, w/ and w/o Xerces, however Mike succeeded. So I don't know if this enwiki version is defective, or it's a problem on Windows. Anyway, the bottom line is we cannot remove the Xerces .jar. I think this patch is ready for commit. All benchmark tests pass. > Split DocMaker into ContentSource and DocMaker > ---------------------------------------------- > > Key: LUCENE-1595 > URL: https://issues.apache.org/jira/browse/LUCENE-1595 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Reporter: Shai Erera > Assignee: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch > > > This issue proposes some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. These two should actually be split up to ContentSource and DocMaker, which will use a ContentSource instance. > ContentSource will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic ContentSource that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. > DocMaker will implement the makeDocument methods, reusing DocState etc. > The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. > In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way they create documents (using DocState). > That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today, something they could have gotten for free with this refactoring proposed. > So by having a EnwikiContentSource, ReutersContentSource and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a ContentSource instance and reuse the same DocMaking algorithm with many content sources, as well as the same ContentSource algorithm with many DocMaker implementations. > This will also give us the opportunity to perf test content sources alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. > I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org