Return-Path: Delivered-To: apmail-lucene-java-commits-archive@www.apache.org Received: (qmail 69952 invoked from network); 6 Jul 2009 15:56:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Jul 2009 15:56:55 -0000 Received: (qmail 24124 invoked by uid 500); 6 Jul 2009 15:57:05 -0000 Delivered-To: apmail-lucene-java-commits-archive@lucene.apache.org Received: (qmail 24066 invoked by uid 500); 6 Jul 2009 15:57:05 -0000 Mailing-List: contact java-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-commits@lucene.apache.org Received: (qmail 24057 invoked by uid 99); 6 Jul 2009 15:57:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Jul 2009 15:57:05 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Jul 2009 15:57:01 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 5029F238887A; Mon, 6 Jul 2009 15:56:40 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r791528 - in /lucene/java/trunk/contrib/benchmark: ./ src/java/org/apache/lucene/benchmark/byTask/feeds/ src/test/org/apache/lucene/benchmark/byTask/feeds/ Date: Mon, 06 Jul 2009 15:56:40 -0000 To: java-commits@lucene.apache.org From: markrmiller@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20090706155640.5029F238887A@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: markrmiller Date: Mon Jul 6 15:56:39 2009 New Revision: 791528 URL: http://svn.apache.org/viewvc?rev=791528&view=rev Log: LUCENE-1730: Fix TrecContentSource to use ISO-8859-1 when reading the TREC files, unless a different encoding is specified. Additionally, ContentSource now supports a content.source.encoding parameter in the configuration file. Modified: lucene/java/trunk/contrib/benchmark/CHANGES.txt lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ContentSource.java lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/LineDocSource.java lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecContentSource.java lucene/java/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/TrecContentSourceTest.java Modified: lucene/java/trunk/contrib/benchmark/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/benchmark/CHANGES.txt?rev=791528&r1=791527&r2=791528&view=diff ============================================================================== --- lucene/java/trunk/contrib/benchmark/CHANGES.txt (original) +++ lucene/java/trunk/contrib/benchmark/CHANGES.txt Mon Jul 6 15:56:39 2009 @@ -4,6 +4,12 @@ $Id:$ +7/6/2009 + LUCENE-1730: Fix TrecContentSource to use ISO-8859-1 when reading the TREC files, + unless a different encoding is specified. Additionally, ContentSource now supports + a content.source.encoding parameter in the configuration file. + (Shai Erera via Mark Miller) + 6/26/2009 LUCENE-1716: Added the following support: doc.tokenized.norms: specifies whether to store norms Modified: lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ContentSource.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ContentSource.java?rev=791528&r1=791527&r2=791528&view=diff ============================================================================== --- lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ContentSource.java (original) +++ lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ContentSource.java Mon Jul 6 15:56:39 2009 @@ -34,9 +34,9 @@ /** * Represents content from a specified source, such as TREC, Reuters etc. A * {@link ContentSource} is responsible for creating {@link DocData} objects for - * its documents to be consumed by {@link ToDeleteDocMaker}. It also keeps track of - * various statistics, such as how many documents were generated, size in bytes - * etc. + * its documents to be consumed by {@link ToDeleteDocMaker}. It also keeps track + * of various statistics, such as how many documents were generated, size in + * bytes etc. *

* Supports the following configuration parameters: *

    @@ -44,6 +44,9 @@ * forever (default=true). *
  • content.source.verbose - specifies whether messages should be * output by the content source (default=false). + *
  • content.source.encoding - specifies which encoding to use when + * reading the files of that content source. Certain implementations may define + * a default value if this parameter is not specified. (default=null). *
  • content.source.log.step - specifies for how many documents a * message should be logged. If set to 0 it means no logging should occur. * NOTE: if verbose is set to false, logging should not occur even if @@ -71,6 +74,7 @@ protected boolean forever; protected int logStep; protected boolean verbose; + protected String encoding; private CompressorStreamFactory csFactory = new CompressorStreamFactory(); @@ -196,6 +200,7 @@ forever = config.get("content.source.forever", true); logStep = config.get("content.source.log.step", 0); verbose = config.get("content.source.verbose", false); + encoding = config.get("content.source.encoding", null); } } Modified: lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/LineDocSource.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/LineDocSource.java?rev=791528&r1=791527&r2=791528&view=diff ============================================================================== --- lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/LineDocSource.java (original) +++ lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/LineDocSource.java Mon Jul 6 15:56:39 2009 @@ -39,6 +39,7 @@ * Config properties: *
      *
    • docs.file=<path to the file> + *
    • content.source.encoding - default to UTF-8. *
    */ public class LineDocSource extends ContentSource { @@ -54,7 +55,7 @@ reader.close(); } InputStream is = getInputStream(file); - reader = new BufferedReader(new InputStreamReader(is, "UTF-8"), BUFFER_SIZE); + reader = new BufferedReader(new InputStreamReader(is, encoding), BUFFER_SIZE); } catch (IOException e) { throw new RuntimeException(e); } @@ -111,6 +112,9 @@ throw new IllegalArgumentException("docs.file must be set"); } file = new File(fileName).getAbsoluteFile(); + if (encoding == null) { + encoding = "UTF-8"; + } } } Modified: lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecContentSource.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecContentSource.java?rev=791528&r1=791527&r2=791528&view=diff ============================================================================== --- lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecContentSource.java (original) +++ lucene/java/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecContentSource.java Mon Jul 6 15:56:39 2009 @@ -47,10 +47,11 @@ * (default=trec). *
  • html.parser - specifies the {@link HTMLParser} class to use for * parsing the TREC documents content (default=DemoHTMLParser). + *
  • content.source.encoding - if not specified, ISO-8859-1 is used. *
*/ public class TrecContentSource extends ContentSource { - // TODO (3.0): change StringBuffer to StringBuffer + // TODO (3.0): change StringBuffer to StringBuilder private static final class DateFormatInfo { DateFormat[] dfs; @@ -181,8 +182,8 @@ System.out.println("opening: " + f + " length: " + f.length()); } try { - GZIPInputStream zis = new GZIPInputStream(new FileInputStream(f), 1 << 16); - reader = new BufferedReader(new InputStreamReader(zis), 1 << 16); + GZIPInputStream zis = new GZIPInputStream(new FileInputStream(f), BUFFER_SIZE); + reader = new BufferedReader(new InputStreamReader(zis, encoding), BUFFER_SIZE); return; } catch (Exception e) { retries++; @@ -334,6 +335,9 @@ // Should not get here. Throw runtime exception. throw new RuntimeException(e); } + if (encoding == null) { + encoding = "ISO-8859-1"; + } } } Modified: lucene/java/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/TrecContentSourceTest.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/TrecContentSourceTest.java?rev=791528&r1=791527&r2=791528&view=diff ============================================================================== --- lucene/java/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/TrecContentSourceTest.java (original) +++ lucene/java/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/TrecContentSourceTest.java Mon Jul 6 15:56:39 2009 @@ -40,7 +40,7 @@ this.forever = forever; } - protected void openNextFile() throws NoMoreDataException, IOException { + void openNextFile() throws NoMoreDataException, IOException { if (reader != null) { if (!forever) { throw new NoMoreDataException();