uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaroslaw Cwiklik <cwik...@apache.org>
Subject Re: Problem in running DUCC Job for Arabic Language
Date Tue, 06 Nov 2018 14:37:13 GMT
Can you try setting -Dfile.encoding=ISO-8859-1 for the service (job)
process and -Djavax.servlet.request.encoding=ISO-8859-1
-Dfile.encoding=ISO-8859-1 for the JD process.

The JD actually uses Jetty webserver to serve service requests over HTTP. I
went as far as extracting Jetty server code from JD into a simple http
server process and also extracted HttpClient related code from the service
into a simple client process to be able to test.

So on the server side I have:
String text = new String("استعرض المتحدث باسم قوات «التحالف العربي
لدعم".getBytes("UTF-8"),"ISO-8859-1");
response.setHeader("content-type", "text/xml");
String body = marshall(text);   // XStream serialization
response.getWriter().write(body);

On the client side:
      System.out.println("Default Locale:   " + Locale.getDefault());
      System.out.println("Default Charset:  " + Charset.defaultCharset());
      System. out.println("file.encoding;    " +
System.getProperty("file.encoding"));

            HttpResponse response = httpClient.execute(postMethod);
            HttpEntity entity = response.getEntity();
                String content = EntityUtils.toString(entity);
     String result = (String) unmarshall(content); //XStream unmarshall
  String o = new String(result.getBytes() );
System.out.println(o);

When I run with the above -D settings the client console shows:
Default Locale:   en_US
Default Charset:  ISO-8859-1
file.encoding;    ISO-8859-1

استعرض المتحدث باسم قوات «التحالف العربي لدعم

Without the -D's I dont see arabic text and instead see garbage on the
console.

On Fri, Jul 6, 2018 at 3:00 AM rohit14csu173@ncuindia.edu <
rohit14csu173@ncuindia.edu> wrote:

> Yes if i run the AE as a DUCC UIMA-AS Service and send it CASes from
> UIMA-AS client it works fine.
> Infact the enviornment i.e the LANG argument is same for UIMA-AS Service
> and DUCC JOB.
>
> Environ[3] = LANG=en_IN
>
> And if i change the LANG=ar then while getting the data coming in JD the
> arabic text is already replaced with ???(Question Mark) and the encoding of
> the data coming in JD or CR  shows ASCII encoding.
> I don't understand why is this happening.
>
> Best
> Rohit
>
>
> On 2018/07/05 13:35:11, Eddie Epstein <eaepstein@gmail.com> wrote:
> > So if you run the AE as a DUCC UIMA-AS service and send it CASes from
> some
> > UIMA-AS client it works OK? The full environment for all processes that
> > DUCC launches are available via ducc-mon under the Specification or
> > Registry tab for that job or managed reservation or service. Please see
> if
> > the LANG setting for the service is different from the LANG setting for
> the
> > job.
> >
> > One can also see the LANG setting for a linux process-id by doing:
> >
> > cat /proc/<pid>/environ
> >
> > The LANG to be used for a DUCC process can be set by adding to the
> > --environment argument "LANG=xxx" as needed
> >
> > Thanks,
> > Eddie
> >
> >
> >
> > On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu173@ncuindia.edu <
> > rohit14csu173@ncuindia.edu> wrote:
> >
> > > Hey,
> > >  Yeah you got it right the first snippet comes in CR before the data
> goes
> > > in CAS.
> > > And the second snippet is in the first annotator or analysis
> engine(AE) of
> > > my Aggregate Desciptor.
> > > I am pretty sure this is an issue of the CAS used by DUCC because if i
> use
> > > service of DUCC in which we are supposed to send the CAS and receive
> the
> > > same CAS with added features from DUCC i get accurate results.
> > >
> > > But the only problem comes in submitting a job where the cas is
> generated
> > > by DUCC.
> > > This can also be a issue of the enviornment(Language) of DUCC because
> the
> > > default language is english.
> > >
> > > Bets Regards
> > > Rohit
> > >
> > > On 2018/07/03 13:11:50, Eddie Epstein <eaepstein@gmail.com> wrote:
> > > > Rohit,
> > > >
> > > > Before sending the data into jcas if i force encode it :-
> > > > >
> > > > > String content2 = null;
> > > > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> > > > > jcas.setDocumentText(content2);
> > > > >
> > > >
> > > > Where is this code, in the job CR?
> > > >
> > > >
> > > >
> > > > >
> > > > > And when i go in my first annotator i force decode it:-
> > > > >
> > > > > String content = null;
> > > > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> > > > > "UTF-8");
> > > > >
> > > >
> > > > And is this in the first annotator of the job process, i.e. the CM?
> > > >
> > > > Please be as specific as possible.
> > > >
> > > > Thanks,
> > > > Eddie
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message