lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Girish Naik <gir...@neevtech.com>
Subject Re: Help in Arabic Analysers with Lucene on Windows
Date Mon, 29 Dec 2008 16:25:05 GMT
Thanks Grant  I will check this out.

BTW, as far as Lucene version is concerned I had checked out the svn of 
lucene and created a build its version says as 2.9 :)  . And Luke is of 
version 0.9.1


------------------------------------------------------------------------

Regards,

Please do not print this email unless it is absolutely necessary.
*Girish Naik*
Development Lead

*Neev Information Technologies Pvt Ltd* <http://www.neevtech.com>
Bangalore, Karnataka India

Mozilla Store <http://www.spreadfirefox.com/node&id=182416&t=260> 
*Mobile:* 91 09740091638
*Email:* girish.naik@gmail.com <mailto:girish.naik@gmail.com>
*IM:* girish.naik (Skype)
*http://www.linkedin.com/in/girishnaik*

Mozilla Store <http://www.spreadfirefox.com/node&id=182416&t=262> 	Join 
Neev Information Technologies Private Limited Group in LinkedIn 
<http://www.linkedin.com/e/gis/68693/571D4D044006>

Fools rush in where angels fear to tread

See who we know in common <http://www.linkedin.com/e/wwk/4759877/> 	Want 
a signature like this? <http://www.linkedin.com/e/sig/4759877/>


------------------------------------------------------------------------
The information contained in this electronic message and any attachments 
to this message are intended for the exclusive use of the addressee(s) 
and may contain proprietary, confidential or privileged information. If 
you are not the intended recipient, you should not disseminate, 
distribute or copy this e-mail. Please notify the sender immediately and 
destroy all copies of this message and any attachments.
*WARNING:* Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.
------------------------------------------------------------------------

On 12/29/2008 9:19 PM, Grant Ingersoll wrote:
>
> On Dec 29, 2008, at 9:59 AM, Girish Naik wrote:
>
>> FIELD_BODY is defined as
>> public static final String FIELD_BODY = "AVS_FIELD_BODY";
>> and its indexed as
>>  ParsedDoc webdoc = ParsedDoc.getDoc(page);
>> ...
>> document.add(new Field(Constants.FIELD_BODY, webdoc.getContents(), 
>> Field.Store.NO, Field.Index.ANALYZED));
>>
>> -x-x-x-
>> public static ParsedDoc getDoc(Page page) {
>>         try {
>>             String raw = page.getContentType();
>>             int semicolon = raw.indexOf(";");
>>             if (semicolon > -1) {
>>                 raw = raw.substring(0, semicolon);
>>             }
>>             MimeType mime = new MimeType(raw);
>>             String contentType = mime.getBaseType();
>>             String classname = getImpl(contentType);
>>             if (classname == null) {
>>                 classname = getImpl(mime.getPrimaryType());
>>                 if (classname == null) {
>>                     return null;
>>                 }
>>             }
>>             Class webdocClass = Class.forName(classname);
>>             ParsedDoc webdoc = (ParsedDoc) webdocClass.newInstance();
>>             webdoc.page = page;
>>             webdoc.contentType = contentType;
>>             return webdoc;
>>         } catch (Exception e) {
>>             _log.error("Eror while parsing file: " + page.toURL());
>>             throw new SysException(e.getMessage());
>>         }
>> -x-x-x-
>>
>> And Luke is not able to open the Indexed files by Lucene currently. 
>> But on my colleague's System it opened but no arabic content was 
>> found instead some chanrecters like الاستراتيج  ..
etc were 
>> found.
>
> I am now pretty sure it is an encoding issue.  I'm guessing that 
> however you are getting the page, it is not in the right encoding.  
> How do you obtain the Page object?  I'm guessing you are crawling.  
> You need to make sure you are getting the encoding of the file and 
> opening it with that encoding.
>
> Something like:
> Reader reader = new InputStreamReader(new FileInputStream(file), 
> encoding);
>
> where encoding is the encoding of the file.
>
>
>>
>> In my local now its giving 'Unknown format version: -8' as it was 
>> giving when my colleague tried to open and index from a Linux system 
>> where search was working fine.
>
> What version of Lucene are you using and what version of Luke?  This 
> usually happens, I believe, when the Luke version is older than the 
> Lucene version used to create the index.
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
View raw message