lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankur Goel" <ank...@brickred.com>
Subject RE: Searching Microsoft Word , Excel and PPT files for Japanese
Date Thu, 20 May 2004 17:18:26 GMT
Hi 

Can you tell me how to convert the Windows-1252 characters/entities to
unicode (UTF-8 or UTF-16). Sorry I am new to this.

Looks like first I will have to parse the text out of these files. I tried
Jakarta PI also But for japans it was also not very good.

Regards,
Ankur 

-----Original Message-----
From: wallen@Cyveillance.com [mailto:wallen@Cyveillance.com] 
Sent: Thursday, May 20, 2004 10:43 PM
To: lucene-user@jakarta.apache.org
Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese

I believe MS apps store non-ascii characters as entities internally instead
of using unicode.  You can see evidence of this if you save your file as an
HTML file and look at the source.  You will have to adjust your parser to
convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16).

-Will

-----Original Message-----
From: Ankur Goel [mailto:ankurg@brickred.com]
Sent: Thursday, May 20, 2004 1:10 PM
To: 'Lucene Users List'
Subject: Searching Microsoft Word , Excel and PPT files for Japanese


Hi,

I am using CJK Tokenzier for searching the Japanese documents.  I am able to
search japanese documents which are text files. But I am not able to search
from Microsoft word, excel files with content in Japanese. 

Can you tell me how can search on Japanese content for Microsoft word, excel
and ppt files.

Thanks,
Ankur  

-----Original Message-----
From: Ankur Goel [mailto:ankurg@brickred.com]
Sent: Sunday, April 04, 2004 1:36 AM
To: 'Lucene Users List'
Subject: RE: Boolean Phrase Query question

Thanks Eric for the solution. I have to filename field as I have to give the
end user facility to search on File Name also. That's   why I am using TEXT
for file Name also.

"By using true on the finalQuery.add calls, you have said that both fields
must have the word "temp" in them.  Is that what you meant?  Or did you mean
an OR type of query?"

I need an OR type of query. I mean the word can be in the filename or in the
contents of the filename. But i am not able to do this. Can you tell me how
to do it?

Regards,
Ankur 

-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
Sent: Sunday, April 04, 2004 1:27 AM
To: Lucene Users List
Subject: Re: Boolean Phrase Query question

On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote:
>
> Hi,
> I have to provide a functionality which provides search on both file 
> name and contents of the file.
>
> For indexing I use the following code:
>
>
> org.apache.lucene.document.Document doc = new org.apache.
> lucene.document.Document();
> doc.add(Field.Keyword("fileId","" + document.getFileId())); 
> doc.add(Field.Text("fileName",fileName);
> doc.add(Field.Text("contents", new FileReader(new File(fileName)));

I'm not sure what you plan on doing with the fileName field, but you
probably want to use a Keyword field for it.

And you may want to glue the file name and contents together into a single
field to facilitate searches to span both.  (be sure to put a space in
between if you do this)

> For searching a text say  "temp" I use the following code to look both 
> in file Name and contents of the file:
>
> BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = 
> QueryParser.parse("temp","fileName",analyzer);
> Query mainQuery = QueryParser.parse("temp","contents",analyzer);
>
> finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, 
> true, false);
>
> Hits hits = is.search(finalQuery);

By using true on the finalQuery.add calls, you have said that both fields
must have the word "temp" in them.  Is that what you meant?  Or did you mean
an OR type of query?

	Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message