jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hamid Reza Sahlolbey" <sahlol...@gmail.com>
Subject RE: Lucene Analyzerr
Date Sun, 17 Feb 2008 05:32:10 GMT

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: 2008/02/17 01:38 ق.ظ
To: users@jackrabbit.apache.org
Subject: Re: Lucene Analyzerr


2008/2/15 Hamid Reza Sahlolbey <sahlolbey@gmail.com>:
> First I used StandardAnalyzer but when I looked in workspace index files I
> recognized that I it doesn't index Persian text so I change to
> SimpleAnalyzer, Now it seems that it index Persian text right, but don't
> find it(Consider that the query is the same for Msword and pdf files).

Could there be some character encoding confusion somewhere? You may
want to check that the Unicode character stream produced by the text
extractor looks valid.


Jukka Zitting

Hi Jukka;
Yes you are right ,last night I found that pdfbox extract my text wrong,but
we couldn't be able to understand as there is 2 set of Persian characters in
Unicode character map.It should be below \u06dc (what browser understand as
UTF-8 Persian characters) but pdfbox extract character above \uFB50, I don't
understand why pdfbox does not return the standard character which is common
for web. Is there any way to define what I want to be returned by pdfbox and
get the correct result? (I mean changing something like glyphlist in pdfbox
and get my desired result) does any body know about this.

Thanks in advance,

View raw message