uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaroslaw Cwiklik <cwik...@apache.org>
Subject Re: Problem in running DUCC Job for Arabic Language
Date Tue, 06 Nov 2018 14:54:53 GMT
Forgot to mention that if you have a shared file system the best practice
is not to serialize your content (SOFA)
from JD to service. Instead, in a CR add a path to the file containing
Subject of Analysis to the CAS and have
the CM in the pipeline read the content from the shared file system.
-jerry


On Tue, Nov 6, 2018 at 9:37 AM Jaroslaw Cwiklik <cwiklik@apache.org> wrote:

> Can you try setting -Dfile.encoding=ISO-8859-1 for the service (job)
> process and -Djavax.servlet.request.encoding=ISO-8859-1
> -Dfile.encoding=ISO-8859-1 for the JD process.
>
> The JD actually uses Jetty webserver to serve service requests over HTTP.
> I went as far as extracting Jetty server code from JD into a simple http
> server process and also extracted HttpClient related code from the service
> into a simple client process to be able to test.
>
> So on the server side I have:
> String text = new String("استعرض المتحدث باسم قوات «التحالف
العربي
> لدعم".getBytes("UTF-8"),"ISO-8859-1");
> response.setHeader("content-type", "text/xml");
> String body = marshall(text);   // XStream serialization
> response.getWriter().write(body);
>
> On the client side:
>       System.out.println("Default Locale:   " + Locale.getDefault());
>       System.out.println("Default Charset:  " + Charset.defaultCharset());
>       System. out.println("file.encoding;    " +
> System.getProperty("file.encoding"));
>
>             HttpResponse response = httpClient.execute(postMethod);
>             HttpEntity entity = response.getEntity();
>                 String content = EntityUtils.toString(entity);
>      String result = (String) unmarshall(content); //XStream unmarshall
>   String o = new String(result.getBytes() );
> System.out.println(o);
>
> When I run with the above -D settings the client console shows:
> Default Locale:   en_US
> Default Charset:  ISO-8859-1
> file.encoding;    ISO-8859-1
>
> استعرض المتحدث باسم قوات «التحالف العربي لدعم
>
> Without the -D's I dont see arabic text and instead see garbage on the
> console.
>
> On Fri, Jul 6, 2018 at 3:00 AM rohit14csu173@ncuindia.edu <
> rohit14csu173@ncuindia.edu> wrote:
>
>> Yes if i run the AE as a DUCC UIMA-AS Service and send it CASes from
>> UIMA-AS client it works fine.
>> Infact the enviornment i.e the LANG argument is same for UIMA-AS Service
>> and DUCC JOB.
>>
>> Environ[3] = LANG=en_IN
>>
>> And if i change the LANG=ar then while getting the data coming in JD the
>> arabic text is already replaced with ???(Question Mark) and the encoding of
>> the data coming in JD or CR  shows ASCII encoding.
>> I don't understand why is this happening.
>>
>> Best
>> Rohit
>>
>>
>> On 2018/07/05 13:35:11, Eddie Epstein <eaepstein@gmail.com> wrote:
>> > So if you run the AE as a DUCC UIMA-AS service and send it CASes from
>> some
>> > UIMA-AS client it works OK? The full environment for all processes that
>> > DUCC launches are available via ducc-mon under the Specification or
>> > Registry tab for that job or managed reservation or service. Please see
>> if
>> > the LANG setting for the service is different from the LANG setting for
>> the
>> > job.
>> >
>> > One can also see the LANG setting for a linux process-id by doing:
>> >
>> > cat /proc/<pid>/environ
>> >
>> > The LANG to be used for a DUCC process can be set by adding to the
>> > --environment argument "LANG=xxx" as needed
>> >
>> > Thanks,
>> > Eddie
>> >
>> >
>> >
>> > On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu173@ncuindia.edu <
>> > rohit14csu173@ncuindia.edu> wrote:
>> >
>> > > Hey,
>> > >  Yeah you got it right the first snippet comes in CR before the data
>> goes
>> > > in CAS.
>> > > And the second snippet is in the first annotator or analysis
>> engine(AE) of
>> > > my Aggregate Desciptor.
>> > > I am pretty sure this is an issue of the CAS used by DUCC because if
>> i use
>> > > service of DUCC in which we are supposed to send the CAS and receive
>> the
>> > > same CAS with added features from DUCC i get accurate results.
>> > >
>> > > But the only problem comes in submitting a job where the cas is
>> generated
>> > > by DUCC.
>> > > This can also be a issue of the enviornment(Language) of DUCC because
>> the
>> > > default language is english.
>> > >
>> > > Bets Regards
>> > > Rohit
>> > >
>> > > On 2018/07/03 13:11:50, Eddie Epstein <eaepstein@gmail.com> wrote:
>> > > > Rohit,
>> > > >
>> > > > Before sending the data into jcas if i force encode it :-
>> > > > >
>> > > > > String content2 = null;
>> > > > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
>> > > > > jcas.setDocumentText(content2);
>> > > > >
>> > > >
>> > > > Where is this code, in the job CR?
>> > > >
>> > > >
>> > > >
>> > > > >
>> > > > > And when i go in my first annotator i force decode it:-
>> > > > >
>> > > > > String content = null;
>> > > > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
>> > > > > "UTF-8");
>> > > > >
>> > > >
>> > > > And is this in the first annotator of the job process, i.e. the CM?
>> > > >
>> > > > Please be as specific as possible.
>> > > >
>> > > > Thanks,
>> > > > Eddie
>> > > >
>> > >
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message