lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chandan Tamrakar" <chan...@ccnep.com.np>
Subject Re: Searching Microsoft Word , Excel and PPT files for Japanese
Date Sun, 23 May 2004 13:30:26 GMT
hi,
  i used a library  texmining.org for extracting word docs using Apache POI
. Its fairly simpe API . so converted into some.txt with UTF16 encoding
before
indexing .

 it should work. Let me update abt ur progress..



 org.textmining.text.extraction.WordExtractor extractor = new
      org.textmining.text.extraction.WordExtractor();
      String s = extractor.extractText(new FileInputStream(fileToindex));
      String encoding="UTF-16LE";
      OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(
          new File(
          "some.txt")),encoding);
          out.write(s);
          out.close();


regards


----- Original Message ----- 
From: "Ankur Goel" <ankurg@brickred.com>
To: "'Chandan Tamrakar'" <chandan@ccnep.com.np>; "'Lucene Users List'"
<lucene-user@jakarta.apache.org>
Sent: Friday, May 21, 2004 4:17 PM
Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese


> Thanks chandan ..
> I am tried  using POI for text extraction . I used The
> WordDocument.writeAllText method but it didn't worked for Japanese.
> Is there any other way also for extracting the Japanese text?
> Regards,
> Ankur
>
> -----Original Message-----
> From: Chandan Tamrakar [mailto:chandan@ccnep.com.np]
> Sent: Friday, May 21, 2004 3:51 PM
> To: Lucene Users List; ankurg@brickred.com
> Subject: Re: Searching Microsoft Word , Excel and PPT files for Japanese
>
>     for miscrosoft word documents and excel use POI API's  from jakarta
> apache.
>    First you need to extract the test and convert inot suitable encoding
> before you put into lucene for index.
>    It worked for me.
>
>
> ----- Original Message -----
> From: "Ankur Goel" <ankurg@brickred.com>
> To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
> Sent: Thursday, May 20, 2004 10:55 PM
> Subject: Searching Microsoft Word , Excel and PPT files for Japanese
>
>
> > Hi,
> >
> > I am using CJK Tokenzier for searching the Japanese documents.  I am
able
> to
> > search japanese documents which are text files. But I am not able to
> search
> > from Microsoft word, excel files with content in Japanese.
> >
> > Can you tell me how can search on Japanese content for Microsoft word,
> excel
> > and ppt files.
> >
> > Thanks,
> > Ankur
> >
> > -----Original Message-----
> > From: Ankur Goel [mailto:ankurg@brickred.com]
> > Sent: Sunday, April 04, 2004 1:36 AM
> > To: 'Lucene Users List'
> > Subject: RE: Boolean Phrase Query question
> >
> > Thanks Eric for the solution. I have to filename field as I have to give
> the
> > end user facility to search on File Name also. That's   why I am using
> TEXT
> > for file Name also.
> >
> > "By using true on the finalQuery.add calls, you have said that both
fields
> > must have the word "temp" in them.  Is that what you meant?  Or did you
> mean
> > an OR type of query?"
> >
> > I need an OR type of query. I mean the word can be in the filename or in
> the
> > contents of the filename. But i am not able to do this. Can you tell me
> how
> > to do it?
> >
> > Regards,
> > Ankur
> >
> > -----Original Message-----
> > From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
> > Sent: Sunday, April 04, 2004 1:27 AM
> > To: Lucene Users List
> > Subject: Re: Boolean Phrase Query question
> >
> > On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote:
> > >
> > > Hi,
> > > I have to provide a functionality which provides search on both file
> > > name and contents of the file.
> > >
> > > For indexing I use the following code:
> > >
> > >
> > > org.apache.lucene.document.Document doc = new org.apache.
> > > lucene.document.Document();
> > > doc.add(Field.Keyword("fileId","" + document.getFileId()));
> > > doc.add(Field.Text("fileName",fileName);
> > > doc.add(Field.Text("contents", new FileReader(new File(fileName)));
> >
> > I'm not sure what you plan on doing with the fileName field, but you
> > probably want to use a Keyword field for it.
> >
> > And you may want to glue the file name and contents together into a
single
> > field to facilitate searches to span both.  (be sure to put a space in
> > between if you do this)
> >
> > > For searching a text say  "temp" I use the following code to look both
> > > in file Name and contents of the file:
> > >
> > > BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery =
> > > QueryParser.parse("temp","fileName",analyzer);
> > > Query mainQuery = QueryParser.parse("temp","contents",analyzer);
> > >
> > > finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery,
> > > true, false);
> > >
> > > Hits hits = is.search(finalQuery);
> >
> > By using true on the finalQuery.add calls, you have said that both
fields
> > must have the word "temp" in them.  Is that what you meant?  Or did you
> mean
> > an OR type of query?
> >
> > Erik
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message