Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 67796 invoked from network); 23 May 2004 13:31:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 23 May 2004 13:31:05 -0000 Received: (qmail 79914 invoked by uid 500); 23 May 2004 13:30:58 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 79883 invoked by uid 500); 23 May 2004 13:30:58 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 79859 invoked by uid 98); 23 May 2004 13:30:57 -0000 Received: from chandan@ccnep.com.np by hermes.apache.org by uid 82 with qmail-scanner-1.20 (clamuko: 0.70. Clear:RC:0(202.51.64.130):. Processed in 1.476027 secs); 23 May 2004 13:30:57 -0000 X-Qmail-Scanner-Mail-From: chandan@ccnep.com.np via hermes.apache.org X-Qmail-Scanner: 1.20 (Clear:RC:0(202.51.64.130):. Processed in 1.476027 secs) Received: from unknown (HELO idlewild.ccnep.com.np) (202.51.64.130) by hermes.apache.org with SMTP; 23 May 2004 13:30:56 -0000 Received: from neplaptop ([202.51.76.86]) by idlewild.ccnep.com.np (8.12.5/8.12.5) with SMTP id i4NDsmjp006075; Sun, 23 May 2004 19:39:53 +0545 Message-ID: <000901c440ca$1eb4bfb0$564c33ca@neplaptop> From: "Chandan Tamrakar" To: "Lucene Users List" , References: <200405211146.i4LBkJF30040@plain.rackshack.net> Subject: Re: Searching Microsoft Word , Excel and PPT files for Japanese Date: Sun, 23 May 2004 19:15:26 +0545 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1409 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409 X-Spam-Rating: hermes.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N hi, i used a library texmining.org for extracting word docs using Apache POI . Its fairly simpe API . so converted into some.txt with UTF16 encoding before indexing . it should work. Let me update abt ur progress.. org.textmining.text.extraction.WordExtractor extractor = new org.textmining.text.extraction.WordExtractor(); String s = extractor.extractText(new FileInputStream(fileToindex)); String encoding="UTF-16LE"; OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream( new File( "some.txt")),encoding); out.write(s); out.close(); regards ----- Original Message ----- From: "Ankur Goel" To: "'Chandan Tamrakar'" ; "'Lucene Users List'" Sent: Friday, May 21, 2004 4:17 PM Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese > Thanks chandan .. > I am tried using POI for text extraction . I used The > WordDocument.writeAllText method but it didn't worked for Japanese. > Is there any other way also for extracting the Japanese text? > Regards, > Ankur > > -----Original Message----- > From: Chandan Tamrakar [mailto:chandan@ccnep.com.np] > Sent: Friday, May 21, 2004 3:51 PM > To: Lucene Users List; ankurg@brickred.com > Subject: Re: Searching Microsoft Word , Excel and PPT files for Japanese > > for miscrosoft word documents and excel use POI API's from jakarta > apache. > First you need to extract the test and convert inot suitable encoding > before you put into lucene for index. > It worked for me. > > > ----- Original Message ----- > From: "Ankur Goel" > To: "'Lucene Users List'" > Sent: Thursday, May 20, 2004 10:55 PM > Subject: Searching Microsoft Word , Excel and PPT files for Japanese > > > > Hi, > > > > I am using CJK Tokenzier for searching the Japanese documents. I am able > to > > search japanese documents which are text files. But I am not able to > search > > from Microsoft word, excel files with content in Japanese. > > > > Can you tell me how can search on Japanese content for Microsoft word, > excel > > and ppt files. > > > > Thanks, > > Ankur > > > > -----Original Message----- > > From: Ankur Goel [mailto:ankurg@brickred.com] > > Sent: Sunday, April 04, 2004 1:36 AM > > To: 'Lucene Users List' > > Subject: RE: Boolean Phrase Query question > > > > Thanks Eric for the solution. I have to filename field as I have to give > the > > end user facility to search on File Name also. That's why I am using > TEXT > > for file Name also. > > > > "By using true on the finalQuery.add calls, you have said that both fields > > must have the word "temp" in them. Is that what you meant? Or did you > mean > > an OR type of query?" > > > > I need an OR type of query. I mean the word can be in the filename or in > the > > contents of the filename. But i am not able to do this. Can you tell me > how > > to do it? > > > > Regards, > > Ankur > > > > -----Original Message----- > > From: Erik Hatcher [mailto:erik@ehatchersolutions.com] > > Sent: Sunday, April 04, 2004 1:27 AM > > To: Lucene Users List > > Subject: Re: Boolean Phrase Query question > > > > On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: > > > > > > Hi, > > > I have to provide a functionality which provides search on both file > > > name and contents of the file. > > > > > > For indexing I use the following code: > > > > > > > > > org.apache.lucene.document.Document doc = new org.apache. > > > lucene.document.Document(); > > > doc.add(Field.Keyword("fileId","" + document.getFileId())); > > > doc.add(Field.Text("fileName",fileName); > > > doc.add(Field.Text("contents", new FileReader(new File(fileName))); > > > > I'm not sure what you plan on doing with the fileName field, but you > > probably want to use a Keyword field for it. > > > > And you may want to glue the file name and contents together into a single > > field to facilitate searches to span both. (be sure to put a space in > > between if you do this) > > > > > For searching a text say "temp" I use the following code to look both > > > in file Name and contents of the file: > > > > > > BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = > > > QueryParser.parse("temp","fileName",analyzer); > > > Query mainQuery = QueryParser.parse("temp","contents",analyzer); > > > > > > finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, > > > true, false); > > > > > > Hits hits = is.search(finalQuery); > > > > By using true on the finalQuery.add calls, you have said that both fields > > must have the word "temp" in them. Is that what you meant? Or did you > mean > > an OR type of query? > > > > Erik > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org