Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 12037 invoked from network); 8 Sep 2004 14:19:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 8 Sep 2004 14:19:55 -0000 Received: (qmail 92262 invoked by uid 500); 8 Sep 2004 14:19:46 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 92235 invoked by uid 500); 8 Sep 2004 14:19:45 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 92205 invoked by uid 99); 8 Sep 2004 14:19:45 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [207.58.142.18] (HELO snowtide.com) (207.58.142.18) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 08 Sep 2004 07:19:41 -0700 Received: from [127.0.0.1] (snowtide.com [127.0.0.1]) by snowtide.com (8.12.8/8.12.8) with ESMTP id i88ELSAG025889 for ; Wed, 8 Sep 2004 10:21:28 -0400 Mime-Version: 1.0 (Apple Message framework v619) In-Reply-To: <044d01c4958a$62058190$0f6ea8c0@lithos> References: <040b01c49566$a8ffb670$0f6ea8c0@lithos> <005d01c4957b$fc3feb60$994033ca@neplaptop> <044d01c4958a$62058190$0f6ea8c0@lithos> Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <1F5690C4-01A2-11D9-8B7A-000A95B336F2@snowtide.com> Content-Transfer-Encoding: 7bit From: Chas Emerick Subject: Re: pdf in Chinese Date: Wed, 8 Sep 2004 10:19:40 -0400 To: "Lucene Users List" X-Mailer: Apple Mail (2.619) X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I'm not aware of any Java library that can reliably extract Chinese text from PDF documents. We're planning on supporting Chinese, Japanese, and Korean in version 2 of PDFTextStream, but there's no doubt that it's a huge challenge. Chas Emerick | cemerick@snowtide.com PDFTextStream: fast PDF text extraction for Java applications http://snowtide.com/home/PDFTextStream/ On Sep 8, 2004, at 5:58 AM, WuDG@infoPro.cn wrote: > it is not about analyzer ,i need to read text from pdf file first. > > ----- Original Message ----- > From: "Chandan Tamrakar" > To: "Lucene Users List" > Sent: Wednesday, September 08, 2004 4:15 PM > Subject: Re: pdf in Chinese > > >> which analyzer you are using to index chinese pdf documents ? >> I think you should use cjkanalyzer >> ----- Original Message ----- >> From: "WuDG@infoPro.cn" >> To: >> Sent: Wednesday, September 08, 2004 11:27 AM >> Subject: pdf in Chinese >> >> >>> Hi all, >>> i use pdfbox to parse pdf file to lucene document.when i parse >> Chinese >>> pdf file,pdfbox is not always success. >>> Is anyone have some advice? >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org >>> >>> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >> For additional commands, e-mail: lucene-user-help@jakarta.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org