Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 63224 invoked from network); 10 Nov 2004 17:24:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 10 Nov 2004 17:24:32 -0000 Received: (qmail 51861 invoked by uid 500); 10 Nov 2004 17:24:07 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 51833 invoked by uid 500); 10 Nov 2004 17:24:06 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 51785 invoked by uid 99); 10 Nov 2004 17:24:04 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [81.255.81.93] (HELO serveur1.itldev.info) (81.255.81.93) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 10 Nov 2004 09:24:02 -0800 Received: from localhost (localhost.localdomain [127.0.0.1]) by serveur1.itldev.info (Postfix) with ESMTP id AFFD527E1B for ; Wed, 10 Nov 2004 18:23:52 +0100 (CET) Received: from serveur1.itldev.info ([127.0.0.1]) by localhost (serveur1.itldev.info [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 23843-05 for ; Wed, 10 Nov 2004 18:23:52 +0100 (CET) Received: from thierry (unknown [192.168.2.1]) by serveur1.itldev.info (Postfix) with ESMTP id 788E627E19 for ; Wed, 10 Nov 2004 18:23:52 +0100 (CET) Message-ID: <00b101c4c74a$099259b0$0102a8c0@thierry> From: "Thierry Ferrero" To: "Lucene Users List" References: <20041110165406.41443.qmail@web12704.mail.yahoo.com> <031e01c4c746$9abca930$7703d00a@hypermedia.com> Subject: Re: Indexing MS Files Date: Wed, 10 Nov 2004 18:23:47 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1437 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441 X-Virus-Scanned: by amavisd-new at itldev.info X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I used OpenOffice API to convert all Word and Excel version. For me it's "the solution" for complex Word and Excel document. http://api.openoffice.org/ Good luck ! // UNO API import com.sun.star.bridge.XUnoUrlResolver; import com.sun.star.uno.XComponentContext; import com.sun.star.uno.UnoRuntime; import com.sun.star.frame.XComponentLoader; import com.sun.star.frame.XStorable; import com.sun.star.beans.PropertyValue; import com.sun.star.beans.XPropertySet; import com.sun.star.lang.XComponent; import com.sun.star.lang.XMultiComponentFactory; import com.sun.star.connection.NoConnectException; import com.sun.star.io.IOException; /** This class implements a http servlet in order to convert an incoming document * with help of a running OpenOffice.org and to push the converted file back * to the client. */ public class DocConverter { private String stringHost; private String stringPort; private Xcontext xcontext; private Xbase xbase; public DocConverter(Xbase xbase,Xcontext xcontext,ServletContext sc) { this.xbase=xbase; this.xcontext=xcontext; stringHost=ApplicationUtil.getParameter(sc,"openoffice.oohost"); stringPort=ApplicationUtil.getParameter(sc,"openoffice.ooport"); } public synchronized String convertToTxt(String namedoc, String pathdoc, String stringConvertType, String stringExtension) { String stringConvertedFile = this.convertDocument(namedoc, pathdoc, stringConvertType, stringExtension); return stringConvertedFile; } /** This method converts a document to a given type by using a running * OpenOffice.org and saves the converted document to the specified * working directory. * @param stringDocumentName The full path name of the file on the server to be converted. * @param stringConvertType Type to convert to. * @param stringExtension This string will be appended to the file name of the converted file. * @return The full path name of the converted file will be returned. * @see stringWorkingDirectory */ private String convertDocument(String namedoc, String pathdoc, String stringConvertType, String stringExtension ) { String tagerr=""; String stringUrl=""; String stringConvertedFile = ""; // Converting the document to the favoured type try { tagerr="0"; // Composing the URL - suppression de l'extension stringUrl = pathdoc+"/"+namedoc; stringUrl=stringUrl.replace( '\\', '/' ); /* Bootstraps a component context with the jurt base components registered. Component context to be granted to a component for running. Arbitrary values can be retrieved from the context. */ XComponentContext xcomponentcontext = com.sun.star.comp.helper.Bootstrap.createInitialComponentContext( null ); /* Gets the service manager instance to be used (or null). This method has been added for convenience, because the service manager is a often used object. */ XMultiComponentFactory xmulticomponentfactory = xcomponentcontext.getServiceManager(); tagerr="2"; /* Creates an instance of the component UnoUrlResolver which supports the services specified by the factory. */ Object objectUrlResolver = xmulticomponentfactory.createInstanceWithContext( "com.sun.star.bridge.UnoUrlResolver", xcomponentcontext ); // Create a new url resolver XUnoUrlResolver xurlresolver = ( XUnoUrlResolver ) UnoRuntime.queryInterface( XUnoUrlResolver.class, objectUrlResolver ); // Resolves an object that is specified as follow: // uno:;; Object objectInitial = xurlresolver.resolve( "uno:socket,host=" + stringHost + ",port=" + stringPort + ";urp;StarOffice.ServiceManager" ); // Create a service manager from the initial object xmulticomponentfactory = ( XMultiComponentFactory ) UnoRuntime.queryInterface( XMultiComponentFactory.class, objectInitial ); // Query for the XPropertySet interface. XPropertySet xpropertysetMultiComponentFactory = ( XPropertySet ) UnoRuntime.queryInterface( XPropertySet.class, xmulticomponentfactory ); // Get the default context from the office server. Object objectDefaultContext = xpropertysetMultiComponentFactory.getPropertyValue( "DefaultContext" ); // Query for the interface XComponentContext. xcomponentcontext = ( XComponentContext ) UnoRuntime.queryInterface( XComponentContext.class, objectDefaultContext ); /* A desktop environment contains tasks with one or more frames in which components can be loaded. Desktop is the environment for components which can instanciate within frames. */ XComponentLoader xcomponentloader = ( XComponentLoader ) UnoRuntime.queryInterface( XComponentLoader.class, xmulticomponentfactory.createInstanceWithContext( "com.sun.star.frame.Desktop", xcomponentcontext ) ); // Preparing properties for loading the document PropertyValue propertyvalue[] = new PropertyValue[ 1 ]; // Setting the flag for hidding the open document propertyvalue[ 0 ] = new PropertyValue(); propertyvalue[ 0 ].Name = "Hidden"; propertyvalue[ 0 ].Value = new Boolean(true); // Loading the wanted document Object objectDocumentToStore = xcomponentloader.loadComponentFromURL( stringUrl, "_blank", 0, propertyvalue ); // Getting an object that will offer a simple way to store a document to a URL. XStorable xstorable = ( XStorable ) UnoRuntime.queryInterface( XStorable.class, objectDocumentToStore ); // Preparing properties for converting the document propertyvalue = new PropertyValue[2]; // Setting the flag for overwriting propertyvalue[0] = new PropertyValue(); propertyvalue[0].Name = "Overwrite"; propertyvalue[0].Value = new Boolean(true); // Setting the filter name propertyvalue[1] = new PropertyValue(); propertyvalue[1].Name = "FilterName"; propertyvalue[1].Value = stringConvertType; // Appending the favoured extension to the origin document name //if(stringUrl.lastIndexOf(".")!=0){ //stringUrl=stringUrl.substring(0,stringUrl.lastIndexOf(".")); //} if(namedoc.lastIndexOf(".")!=-1){ namedoc=namedoc.substring(0,namedoc.lastIndexOf(".")); } //stringConvertedFile = stringUrl + "." + stringExtension; stringConvertedFile=xbase.getAlias("local")+"/oo_tmp/"+namedoc+"."+stringExt ension; stringConvertedFile=stringConvertedFile.replace( '\\', '/' ); // Storing and converting the document xstorable.storeToURL( stringConvertedFile, propertyvalue ); // Getting the method dispose() for closing the document XComponent xcomponent = ( XComponent ) UnoRuntime.queryInterface( XComponent.class, xstorable ); // Closing the converted document xcomponent.dispose(); } catch(NoConnectException ex ) { return( "" ); } catch( IOException ex ) { return( "" ); } catch( Exception ex ) { return( "" ); } // Returning the name of the converted file return( stringConvertedFile ); } ----- Original Message ----- From: "Luke Shannon" To: "Lucene Users List" Sent: Wednesday, November 10, 2004 5:59 PM Subject: Re: Indexing MS Files > Thanks Otis. I am looking forward to this book. Any idea when it may be > released? > > ----- Original Message ----- > From: "Otis Gospodnetic" > To: "Lucene Users List" > Sent: Wednesday, November 10, 2004 11:54 AM > Subject: Re: Indexing MS Files > > > > That's one place to start. The other one would be textmining.org, at > > least for Word files. > > I used both POI and Textmining API in Lucene in Action, and the latter > > was much simpler to use. You can also find some comments about both > > libs in lucene-user archives. People tend to like Textmining API > > better. > > > > Otis > > > > --- Luke Shannon wrote: > > > > > I need to index Word, Excel and Power Point files. > > > > > > Is this the place to start? > > > > > > http://jakarta.apache.org/poi/ > > > > > > Is there something better? > > > > > > Thanks, > > > > > > Luke > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org