manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yossi Nachum <>
Subject Re: Problem with directories in hebrew and jcifs
Date Tue, 21 May 2013 05:14:18 GMT
Thanks again for your help.
I used the metadata option that you suggested and it works fine. Now people
know where are the documents that they find in solr. The URL is not working
when I click on it but at least we know where the docs are.
Sorry about the late response I was on a vacation
On May 6, 2013 3:36 PM, "Karl Wright" <> wrote:

> Hi Yossi,
> I looked into this further over the weekend, to try and recall some of the
> thinking that went into how our file IRI's are constructed.
> (1) There is a constraint, which comes from certain output connectors, and
> which may no longer be valid, that all file IRI's must be legal URI's.  If
> that is still true, it REQUIRES us to %-encode non-ASCII characters.  The
> standard URI encoding is UTF-8, which is why we use that encoding.
> (2) For other characters that cannot be legally put in a URI, such as "+"
> and " " and "#", browsers I have access to behave as follows:
> <?xml version="1.0" encoding="utf-8"?>
> <html>
> <body>
>     <a href="file:///c:/test/test.html">click here for test</a><br/>
>     <a href="file:///c:/test/hi#there.html">click here for
> hi#there</a>(works on IE 8, but not Firefox - base form on IE8)<br/>
>     <a href="file:///c:/test/hi%23there.html">click here for
> hi%23there</a>(works on both, base form on Firefox)<br/>
>     <a href="file:///c:/test/hi<there.html">click here for
> hi&lt;there</a>(works on both but represents a file that can't be
> loaded)<br/>
>     <a href="file:///c:/test/hi%3cthere.html">click here for
> hi%3cthere</a>(works on both but represents a file that can't be loaded -
> base form on both)<br/>
>     <a href="file:///c:/test/hi there.html">click here for hi
> there</a>(works on both, base form on both)<br/>
>     <a href="file:///c:/test/hi%20there.html">click here for
> hi%20there</a>(works on both)<br/>
> </body>
> </html>
> As you can see, there's some common ground, but always the common ground
> requires more encoding rather than less.
> (3) Even assuming we relax the URI requirement, non-encoded, non-ASCII
> characters are interpreted in the encoding of the document they are
> embedded in.  So, for instance, if you wanted to include Hebrew characters
> in a file IRI, you will have to have a web page that is encoded in
> something that can represent Hebrew characters.
> Karl
> On Fri, May 3, 2013 at 10:05 AM, Karl Wright <> wrote:
>> I should clarify.  IF you can propose a better IRI form than the one the
>> connector generates, AND it will work for all languages/encodings and most
>> modern browsers, we should consider changing the connector code.
>> Karl
>> On Fri, May 3, 2013 at 8:32 AM, Karl Wright <> wrote:
>>> Here is the code in the JCIFS connector:
>>>               String pathAttributeValue = documentIdentifier;
>>>               // 3/13/2008
>>>               // In looking at what comes into the path metadata
>>> attribute by default, and cogitating a bit, I've concluded that
>>>               // the smb:// and the server/domain name at the start of
>>> the path are just plain old noise, and should be stripped.
>>>               // This changes a behavior that has been around for a
>>> while, so there is a risk, but a quick back-and-forth with the
>>>               // SE's leads me to believe that this is safe.
>>>               if (pathAttributeValue.startsWith("smb://"))
>>>               {
>>>                 int index =
>>> pathAttributeValue.indexOf("/","smb://".length());
>>>                 if (index == -1)
>>>                   index = pathAttributeValue.length();
>>>                 pathAttributeValue = pathAttributeValue.substring(index);
>>>               }
>>>               // Now, translate
>>>               pathAttributeValue =
>>> matchMap.translate(pathAttributeValue);
>>>               pack(sb,pathAttributeValue,'+');
>>>             }
>>>             else
>>>               sb.append('-');
>>> Since the JCIFS connection determines the server name, the document
>>> identifier does not need to repeat that information.  If you need to send
>>> the server name to Solr for some reason, you can certainly do that on a
>>> per-job basis by putting in yet another bit of metadata, via the "Forced
>>> Metadata" tab in your job.  If you have a really strong reason for
>>> including the server name in the same path, it would also be possible to
>>> add another feature to the JCIFS connector to do it based on a checkbox or
>>> some such; but this would complicate further an already very complicated
>>> user interface.
>>> It looks, however, like you are trying to construct an IRI, which the
>>> JCIFS connector is supposed to be doing.  Can you explain what your needs
>>> are here?  What do you believe is the correct form of an IRI?
>>> Karl
>>> On Fri, May 3, 2013 at 7:49 AM, Yossi Nachum <>wrote:
>>>> That is working. I created a path field in my schema and use the "path
>>>> attribute".
>>>> I have one problem, I don't see the name of the cifs server, just the
>>>> path inside it.
>>>> I try to use "Match Regexp" in the metadata tab with the following
>>>> values:
>>>> Match regexp: "(.*)"
>>>> Replace string: "file:////server_name/$(1)"
>>>> but it did not work. Still seeing the path only.
>>>> What am I doing wrong? How can I add my server name to the path?
>>>> Thanks
>>>> Yossi
>>>> On Wed, May 1, 2013 at 4:10 PM, Yossi Nachum <>wrote:
>>>>> Thanks I will try that
>>>>> On May 1, 2013 3:54 PM, "Karl Wright" <> wrote:
>>>>>> There is also a different way to do this entirely - there is a path
>>>>>> attribute you can send as metadata to Solr.  Just include the entire
>>>>>> and put it into a different field that you declare in your schema.
>>>>>> "path attribute" in the end-user documentation for the JCIFS connector.
>>>>>> On Wed, May 1, 2013 at 8:52 AM, Karl Wright <>wrote:
>>>>>>> IE 6 is extremely old and I believe we developed for IE 7 at
>>>>>>> minimum (there were two different versions with different functionality
>>>>>>> had to support there), and made further changes for IE 8 when
it came out.
>>>>>>> I have no idea what IE 9 or IE 10 do.
>>>>>>> The only way to change the encoding of the IRI is to modify the
>>>>>>> JCIFS connector code.  But please bear in mind that unless you
can show
>>>>>>> your modifications will work across a wide variety of browsers,
we are
>>>>>>> unlikely to accept these changes back into the code base.
>>>>>>> The alternative is, since the encoding IS deterministic and
>>>>>>> reversible, you could readily write a Tika plugin that would
modify at
>>>>>>> least the URL field in the manner you desire.  But you could
not modify the
>>>>>>> ID field since ManifoldCF uses this to delete documents that
>>>>>>> disappeared.
>>>>>>> Karl
>>>>>>> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <>wrote:
>>>>>>>> The IRI is not working in my IE. I am using old version of
>>>>>>>> SP3.
>>>>>>>> But what I realy want is to display the correct name of the
>>>>>>>> with hebrew characters.
>>>>>>>> If I understand you right, then I need to change the representation
>>>>>>>> of the IRI. How can I do that?
>>>>>>>> On May 1, 2013 3:14 PM, "Karl Wright" <>
>>>>>>>>> Right, that is exactly what I would expect.
>>>>>>>>> ManifoldCF uses a URL (which is constructed by the connector)
>>>>>>>>> the primary key for every document as indexed in the
search engine.  The
>>>>>>>>> URL has two purposes: first, it is supposed to be unique,
and second, it is
>>>>>>>>> supposed to allow someone who browses to that result
to locate the
>>>>>>>>> document.  In the case of JCIFS, the environment is presumed
to be the
>>>>>>>>> local active directory domain(s), and the "URL" generated
is really a file
>>>>>>>>> IRI, usually of the form "file://///server.domain/path/filename".
 You thus
>>>>>>>>> should be able to paste the "URL" of the document from
Solr into a browser
>>>>>>>>> on a machine in the domain, and see the document load.
>>>>>>>>> As I said before, however, there are already certain
problems with
>>>>>>>>> this because each version of IE differs somewhat in how
it deals with
>>>>>>>>> non-ASCII characters.  IRI legal character rules are
somewhat different
>>>>>>>>> than URL rules, but IRI's are still nevertheless escaped
in various ways.
>>>>>>>>> There are also multiple equivalent ways of representing
the same file path
>>>>>>>>> with different IRI's.
>>>>>>>>> It is not typical that the ID and URL fields of a document
>>>>>>>>> presented to the user in any meaningful way, so your
question is usually
>>>>>>>>> academic in most settings.  If you have a problem with
the IRI's not
>>>>>>>>> actually working in a browser, that's of more immediate
interest.  Please
>>>>>>>>> let us know if that's the case.
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum <>wrote:
>>>>>>>>>> Thanks for your response
>>>>>>>>>> I am seeing these characters in solr when I search
these files.
>>>>>>>>>> I am using the solr example site and these characters
show up in
>>>>>>>>>> the ID field and URL field.
>>>>>>>>>> BTW I am running solr and mcf on a linux server
>>>>>>>>>>  On May 1, 2013 1:11 PM, "Karl Wright" <>
>>>>>>>>>> wrote:
>>>>>>>>>>> Where are you seeing these characters?  Are you
talking about
>>>>>>>>>>> the file IRI's that the JCIFS connector generates?
 Those IRI's are
>>>>>>>>>>> supposed to be constructed so that your browser
would find them if you
>>>>>>>>>>> paste them into the browser URL window.  Unfortunately,
there is no good
>>>>>>>>>>> standard, and people follow IE's behavior, and
IE has changed multiple
>>>>>>>>>>> times in how it deals with non-latin-1 characters.
>>>>>>>>>>> Please provide a bit more information so that
we can provide a
>>>>>>>>>>> better answer.
>>>>>>>>>>> Karl
>>>>>>>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> I install search server with solr and manifoldcf.
>>>>>>>>>>>> I want to index my netapp files over cifs
and I have a problem
>>>>>>>>>>>> with hebrew files and directories.
>>>>>>>>>>>> When I search for these files in solr I see
>>>>>>>>>>>> instead of the directory path that contain
hebrew characters .
>>>>>>>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255"
>>>>>>>>>>>> but it didn't help.
>>>>>>>>>>>> Can anyone help and tell me how can I index
>>>>>>>>>>>> in hebrew?
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Yossi

View raw message