Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 36114 invoked from network); 11 Feb 2005 22:19:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 11 Feb 2005 22:19:29 -0000 Received: (qmail 78004 invoked by uid 500); 11 Feb 2005 22:19:26 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 77697 invoked by uid 500); 11 Feb 2005 22:19:25 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 77681 invoked by uid 99); 11 Feb 2005 22:19:24 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from sparta.hostgo.com (HELO sparta.hostgo.com) (66.220.22.162) by apache.org (qpsmtpd/0.28) with ESMTP; Fri, 11 Feb 2005 14:19:23 -0800 Received: from [65.19.37.63] (helo=[10.0.1.2]) by sparta.hostgo.com with esmtpa (Exim 4.44) id 1Czj8A-0003GJ-MG for lucene-user@jakarta.apache.org; Fri, 11 Feb 2005 17:19:18 -0500 Mime-Version: 1.0 (Apple Message framework v619) Content-Transfer-Encoding: 7bit Message-Id: Content-Type: text/plain; charset=US-ASCII; format=flowed To: lucene-user@jakarta.apache.org From: Owen Densmore Subject: Re: Lucene Unicode Usage Date: Fri, 11 Feb 2005 15:19:16 -0700 X-Mailer: Apple Mail (2.619) X-MailScanner-Information: This email message has been scanned for viruses X-MailScanner-HostGo: Found to be clean X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - sparta.hostgo.com X-AntiAbuse: Original Domain - jakarta.apache.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [47 12] X-AntiAbuse: Sender Address Domain - backspaces.net X-Source: X-Source-Args: X-Source-Dir: X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Bingo! I used the InputStreamReader and that fixed the index. Boy, tough to catch all the holes through which unicode leaks occur! Owen From: aurora Date: February 9, 2005 11:04:35 PM MST To: lucene-user@jakarta.apache.org Subject: Re: Lucene Unicode Usage So you got a utf8 encoded text file. But how do you read the file into Java? The default encoding of Java is likely to be something other than utf8. Make sure you specify the encoding like: InputStreamReader( new FileInputStream(filename), "UTF-8"); -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ From: Andrzej Bialecki Date: February 10, 2005 2:54:56 AM MST To: Lucene Users List Subject: Re: Lucene Unicode Usage Owen Densmore wrote: > I'm building an index from a FileMaker database by dumping the data to > a tab-separated file. Because the FileMaker output is encoded in > MacRoman, and uses Mac line separators, I run a script across the tab > file to clean it up: > tr '\r\v' '\n ' | iconv -f MAC -t UTF-8 > This basically converts the Mac \r's to \n's, replaces FileMaker's > vtabs (for inter-field CRs) with blanks, and runs a character > converter to build utf-8 data for Java to use. Looks fine in jEdit > and BBEdit, both of which understand UTF. However, it matters how you have read in the files in your Java application. Did you use InputStreamReader with the default platform encoding (probably 8859-1), or did you specify UTF-8 explicitly? > BUT -- when I look at the indexes created in Lucene using Luke, I get > unprintable letters! Writing programs to dump the terms (using Writer By default Luke uses the standard platform-specific font "dialog". On Windows this font doesn't support Unicode glyphs, so you will see just blanks (or rectangles). In the upcoming release you will be able to select the display font. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org