Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 91454 invoked from network); 7 Aug 2007 00:54:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Aug 2007 00:54:27 -0000 Received: (qmail 71382 invoked by uid 500); 7 Aug 2007 00:54:22 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 71319 invoked by uid 500); 7 Aug 2007 00:54:22 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 71246 invoked by uid 99); 7 Aug 2007 00:54:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Aug 2007 17:54:21 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Aug 2007 00:54:12 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 566F67141E2 for ; Mon, 6 Aug 2007 17:53:59 -0700 (PDT) Message-ID: <32210286.1186448039351.JavaMail.jira@brutus> Date: Mon, 6 Aug 2007 17:53:59 -0700 (PDT) From: "Steven Parkes (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article In-Reply-To: <7143942.1185916072798.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Parkes updated LUCENE-971: --------------------------------- Attachment: LUCENE-971.patch.txt Okay. Here's an update to the patch. Changes: 1) EnwikiDocMaker replaces ExtractWikipedia 2) A sample algorithm is provided (and used by the build.xml file, which could be removed if desired 3) A bug in LineDocMaker is fixed (it was storing both the title and date in the title field (small enough that it doesn't need its own JIRA(?)) 4) LineDocMaker was made derivable-from Much of the code in LineDocMaker is useful in EnwikiDocMaker so I made it so (it's inheritance for impl, not abstraction so it could be changed, of course) 5) Made LineDocMaker and WriteLineDocTask multicharater safe Or at least I tried to. Wikipedia has non-ascii characters in it. To make LineDocMaker work as a base class, I made it use an explicit FileInputStream which is required so that SAX can extract the encoding correctly. I made WriteLineDocTask always write UTF-8 so that I can get non-ASCII in the output file. Seems like UTF-8 is the best encoding for line files? At the same time, I made LineDocMaker assume UTF-8 (unless told otherwise by a derived class like EnwikiDocMaker) so that the line files created by EnwikiDocMaker/WriteLineDocTask can be read by LineDocMaker w/o loss. > Create enwiki indexable data as line-per-article rather than file-per-article > ----------------------------------------------------------------------------- > > Key: LUCENE-971 > URL: https://issues.apache.org/jira/browse/LUCENE-971 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Steven Parkes > Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt > > > Create a line per article rather than a file. Consume with indexLineFile task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org