lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Girish Naik <gir...@neevtech.com>
Subject Re: Help in Arabic Analysers with Lucene on Windows
Date Mon, 29 Dec 2008 14:59:05 GMT
FIELD_BODY is defined as

public static final String FIELD_BODY = "AVS_FIELD_BODY";

and its indexed as

  ParsedDoc webdoc = ParsedDoc.getDoc(page);
...
document.add(new Field(Constants.FIELD_BODY, webdoc.getContents(), Field.Store.NO, Field.Index.ANALYZED));

-x-x-x-
public static ParsedDoc getDoc(Page page) {
		try {
			String raw = page.getContentType();
			int semicolon = raw.indexOf(";");
			if (semicolon>  -1) {
				raw = raw.substring(0, semicolon);	
			}
			MimeType mime = new MimeType(raw);
			String contentType = mime.getBaseType();
			String classname = getImpl(contentType);
			if (classname == null) {
				classname = getImpl(mime.getPrimaryType());
				if (classname == null) {
					return null;	
				}
			}
			Class webdocClass = Class.forName(classname);
			ParsedDoc webdoc = (ParsedDoc) webdocClass.newInstance();
			webdoc.page = page;
			webdoc.contentType = contentType;
			return webdoc;
		} catch (Exception e) {
			_log.error("Eror while parsing file: " + page.toURL());
			throw new SysException(e.getMessage());
		}
-x-x-x-


And Luke is not able to open the Indexed files by Lucene currently. But 
on my colleague's System it opened but no arabic content was found 
instead some chanrecters like اÙ"استراتÙS(ج  .. etc were found.
In my local now its giving 'Unknown format version: -8' as it was giving 
when my colleague tried to open and index from a Linux system where 
search was working fine.




------------------------------------------------------------------------

Regards,

Please do not print this email unless it is absolutely necessary.
*Girish Naik*
Development Lead

*Neev Information Technologies Pvt Ltd* <http://www.neevtech.com>
Bangalore, Karnataka India

Mozilla Store <http://www.spreadfirefox.com/node&id=182416&t=260> 
*Mobile:* 91 09740091638
*Email:* girish.naik@gmail.com <mailto:girish.naik@gmail.com>
*IM:* girish.naik (Skype)
*http://www.linkedin.com/in/girishnaik*

Mozilla Store <http://www.spreadfirefox.com/node&id=182416&t=262> 	Join 
Neev Information Technologies Private Limited Group in LinkedIn 
<http://www.linkedin.com/e/gis/68693/571D4D044006>

Fools rush in where angels fear to tread

See who we know in common <http://www.linkedin.com/e/wwk/4759877/> 	Want 
a signature like this? <http://www.linkedin.com/e/sig/4759877/>


------------------------------------------------------------------------
The information contained in this electronic message and any attachments 
to this message are intended for the exclusive use of the addressee(s) 
and may contain proprietary, confidential or privileged information. If 
you are not the intended recipient, you should not disseminate, 
distribute or copy this e-mail. Please notify the sender immediately and 
destroy all copies of this message and any attachments.
*WARNING:* Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.
------------------------------------------------------------------------

On 12/29/2008 7:46 PM, Grant Ingersoll wrote:
> What does the FIELD_BODY look like?  You search is apparently going 
> against that Field, but you don't show how it is indexed.
>
> Have you looked at your index in Luke yet?  http://www.getopt.org/luke?
>
>
>
> On Dec 29, 2008, at 8:19 AM, Girish Naik wrote:
>
>> Sorry for that,
>>
>> Here is how the Analyzer is Selected:
>>  public static Analyzer getAnalyzerInstance(String localeKey) {
>>     Analyzer analyzer = null;
>>     if (localeKey == null || localeKey.trim().equals("")) {
>>         localeKey = AppContext.getSetting("defaultLocale");
>>         System.out.println("<><><>><><><><><Locale
key taken as 
>> Default ");
>>     } else {
>>         // localeKey may be a csv of locales, in which case picj the 
>> first
>>         // one.
>>         localeKey = StringUtils.split(localeKey, ",")[0].trim();
>>         System.out.println("<><><>><><><><><Locale
key is trimmed");
>>     }
>>     System.out.println("<><><>><><><><><Locale
is " + localeKey);
>>     String name = (String) _analyzerMap.get(localeKey);
>>     System.out.println("<><><>><><><><><Name
from Locale is " + name);
>>     if (name == null) {
>>         analyzer = new StandardAnalyzer();
>>     } else {
>>         // if (name.equalsIgnoreCase("Arabic")) {
>>         // analyzer = new ArabicAnalyzer();
>>         // } else {
>>         analyzer = new SnowballAnalyzer(name);
>>         // }
>>     }
>>     return analyzer;
>>     }
>>
>>
>> While Indexing some are analyzed and some are not...
>>  document.add(new Field(FIELD_DOCUMENT_CREATED_ON, LocaleUtils
>>             .convert8859_6ToUTF8(com.aurigalogic.activesite.field.Field
>>                 .indexableDate(avsDoc.getCreatedOn())),
>>             Field.Store.YES, Field.Index.NOT_ANALYZED));
>> ...
>> document.add(new Field(FIELD_CONTENT_TYPE, LocaleUtils
>>             .convert8859_6ToUTF8(version.getDocument()
>>                 .getContentDescriptor().getName()),
>>             Field.Store.YES, Field.Index.ANALYZED));
>> Currently the method LocaleUtils.convert8859_6ToUTF8 does nothing but 
>> returns the parameter as is.
>>
>> While seraching the Query parser  etc.  are created like
>> Analyzer analyzer = AnalyzerSelector.getAnalyzerInstance(locale);
>> ...
>> QueryParser qparser = new QueryParser(Constants.FIELD_BODY, analyzer);
>> ...
>>
>>
>> So while posting the form with a Arabic word does not fetch the 
>> results. An English word does work though!!
>>
>> I would be more that helpful if anything else is required.
>>
>>
>>
>>
>> Regards,
>>
>> Please do not print this email unless it is absolutely necessary.
>> Girish Naik
>> Development Lead
>>
>> Neev Information Technologies Pvt Ltd
>> Bangalore, Karnataka India
>>
>> <banner-5b.png>    Mobile: 91 09740091638
>> Email: girish.naik@gmail.com
>> IM: girish.naik (Skype)
>> http://www.linkedin.com/in/girishnaik
>>
>> <banner-2c.png> <neev_logo.gif>
>>
>> Fools rush in where angels fear to tread
>> See who we know in common    Want a signature like this?
>> The information contained in this electronic message and any 
>> attachments to this message are intended for the exclusive use of the 
>> addressee(s) and may contain proprietary, confidential or privileged 
>> information. If you are not the intended recipient, you should not 
>> disseminate, distribute or copy this e-mail. Please notify the sender 
>> immediately and destroy all copies of this message and any attachments.
>> WARNING: Computer viruses can be transmitted via email. The recipient 
>> should check this email and any attachments for the presence of 
>> viruses. The company accepts no liability for any damage caused by 
>> any virus transmitted by this email.
>>
>> On 12/29/2008 6:16 PM, Grant Ingersoll wrote:
>>>
>>> Hi Girish,
>>>
>>> Can you provide some sample code and info about what isn't working?  
>>> All you have said so far is that the Arabic Analyzer doesn't work 
>>> for you, but you have said nothing about how you are actually using 
>>> it.  Are you getting exceptions?  Do the tokens not look right?  Are 
>>> no results coming back?  Have you looked at your index in Luke?
>>>
>>> I'm going to take a wild stab in the dark and guess that you are not 
>>> reading in the input in the right encoding.
>>>
>>> -Grant
>>>
>>> On Dec 29, 2008, at 7:19 AM, Girish Naik wrote:
>>>
>>>> Hi,
>>>>     I am having a hard time in indexing the Arabic content and 
>>>> searching the same via Lucene. I have also used a Arabic Analyzer 
>>>> from the Lucene package but had no luck. I have also used a 
>>>> snowball jar but it doesn't contain an Arabic stemmer. So i had put 
>>>> the Lucene Arabic Stemmer in snowball jar (with modifications  :-X 
>>>> ) but still have not got any luck so far.
>>>>
>>>>     Moreover when i dont use any stemmers/ analyzer the search 
>>>> works perfectly on a Linux systems, but with Windows system the 
>>>> search does not work at all =-O
>>>>
>>>> If anybody has any kind of solution or ideas, then please send it 
>>>> across. I will be very happy to implement and test them.
>>>>
>>>> Thanks in Advance.
>>>>
>>>> -- 
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Please do not print this email unless it is absolutely necessary.
>>>> Girish Naik
>>>> Development Lead
>>>>
>>>> Neev Information Technologies Pvt Ltd
>>>> Bangalore, Karnataka India
>>>>
>>>> <banner-5b.png>    Mobile: 91 09740091638
>>>> Email: girish.naik@gmail.com
>>>> IM: girish.naik (Skype)
>>>> http://www.linkedin.com/in/girishnaik
>>>>
>>>> <banner-2c.png> <neev_logo.gif>
>>>>
>>>> Fools rush in where angels fear to tread
>>>> See who we know in common    Want a signature like this?
>>>> The information contained in this electronic message and any 
>>>> attachments to this message are intended for the exclusive use of 
>>>> the addressee(s) and may contain proprietary, confidential or 
>>>> privileged information. If you are not the intended recipient, you 
>>>> should not disseminate, distribute or copy this e-mail. Please 
>>>> notify the sender immediately and destroy all copies of this 
>>>> message and any attachments.
>>>> WARNING: Computer viruses can be transmitted via email. The 
>>>> recipient should check this email and any attachments for the 
>>>> presence of viruses. The company accepts no liability for any 
>>>> damage caused by any virus transmitted by this email.
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>> --------------------------
>>> Grant Ingersoll
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message