Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 15322 invoked from network); 30 May 2002 20:46:08 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 30 May 2002 20:46:08 -0000 Received: (qmail 17832 invoked by uid 97); 30 May 2002 20:46:09 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 17790 invoked by uid 97); 30 May 2002 20:46:08 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 17778 invoked by uid 98); 30 May 2002 20:46:07 -0000 X-Antivirus: nagoya (v4198 created Apr 24 2002) Message-ID: <3CF68FB2.8@apache.org> Date: Thu, 30 May 2002 16:46:42 -0400 From: "Andrew C. Oliver" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0rc3) Gecko/20020523 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: MS Word Search ?? References: <4.3.2.7.2.20020530153552.01eadd98@mail.hq.nasa.gov> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Bruce Altner wrote: > This is a good lead but it prompts me to ask this: if tools like > openoffice and others (like Acrobat Distiller) know how to reformat > Excel, PowerPoint, and Word it means that the data formats of these > files, as streams, must be public knowledge. If so, where do you get > this information? I would use it to try to build my own parser with > JavaCC if I knew what bytes in the the input stream I needed to tokenize. I'm not sure what JavaCC has to do with word files. > > Obviously it's not that easy, otherwise we wouldn't need projects like > POI. But is this correct, at least conceptually? POI is a bit more ambitious than that. We look to create full APIs for read/write and modify, not just convert to ASCII. > > Disclaimer: I'm new to JavaCC but the topic interests me so I'm > willing to put in the time to learn it. It would be easier to take Ryan's early code for POI::HDF and create a parser from it. Once HDF is in Beta or so I plan to do so. -Andy > > Bruce > > At 10:50 AM 5/30/2002 -0500, you wrote: > >> This might be worth looking into for those who need to parse word, >> excel, >> powerpoint, or other MS file types of microsofts. >> >> openoffice - www.openoffice.org knows how to parse all of the microsoft >> formats (at least all that I've tried so far) - and then, you can a do a >> save as, and write out the open office format, which is a couple of xml >> files zipped together. So, this makes me think of two possible ways >> that >> you could get at the content of the MS files in a text form you can >> index >> (neither of which I have tried or even looked to see if they are >> possible) >> >> #1 - get the code for openoffice - it is open source - and use it for >> parsing the MS documents into xml which could then be indexed >> >> #2 - if open office is programmatically drivable (which I don't know >> if it >> is), fire up a copy of open office and use it to convert the files as >> necessary. >> >> Just some suggestions. Does anyone know much more about openoffice? I >> would be interested in knowing if either of these would be feasible. >> >> Dan >> >> >> >> >> -----Original Message----- >> From: Ewout Prangsma [mailto:e.prangsma@daisysoftware.com] >> Sent: Wednesday, May 29, 2002 1:00 PM >> To: Lucene Users List >> Subject: Re: MS Word Search ?? >> >> >> Op Wednesday 29 May 2002 11:56, Karl �ie schreef: >> > b: convert the documents to something that is accessable through >> java like >> > xml, etc... >> >> We're using wvWare (wvware.com) to convert word to html (or text) and >> index >> that and xpdf for converting PDF to text and index that. Any links on >> indexing using POI converters (or other java converters) are very >> welcome! >> >> Ewout >> >> > >> > the best way is to convert as the java api's for MSOffice documents >> still >> > are under development >> > >> > mvh karl �ie >> > >> > On Wednesday 29 May 2002 11:48, Rama Krishna wrote: >> > > Hi, >> > > >> > > I am trying to build a search engine which search in MS Word, >> excel, ppt >> > > and adobe pdf. I am not sure whether i can use Lucene for this or >> not. >> > > pl. help me out in this regard. >> > > >> > > >> > > Regards, >> > > Ramakrishna >> > > >> > > >> > > _________________________________________________________________ >> > > Chat with friends online, try MSN Messenger: >> http://messenger.msn.com >> >> -- >> Ewout Prangsma, Directeur >> Daisy Software >> Telefoon/fax: +31-77-3270305/3270306 >> Email: e.prangsma@daisysoftware.com >> Website: www.daisysoftware.com >> KvK Venlo nr. 12046144 >> >> >> >> >> -- >> To unsubscribe, e-mail: >> >> For additional commands, e-mail: >> >> >> -- >> To unsubscribe, e-mail: >> >> For additional commands, e-mail: >> > > > > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > > -- To unsubscribe, e-mail: For additional commands, e-mail: