manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject Re: Problem with directories in hebrew and jcifs
Date Fri, 03 May 2013 12:32:42 GMT
Here is the code in the JCIFS connector:

              String pathAttributeValue = documentIdentifier;
              // 3/13/2008
              // In looking at what comes into the path metadata attribute
by default, and cogitating a bit, I've concluded that
              // the smb:// and the server/domain name at the start of the
path are just plain old noise, and should be stripped.
              // This changes a behavior that has been around for a while,
so there is a risk, but a quick back-and-forth with the
              // SE's leads me to believe that this is safe.

              if (pathAttributeValue.startsWith("smb://"))
                int index =
                if (index == -1)
                  index = pathAttributeValue.length();
                pathAttributeValue = pathAttributeValue.substring(index);
              // Now, translate
              pathAttributeValue = matchMap.translate(pathAttributeValue);

Since the JCIFS connection determines the server name, the document
identifier does not need to repeat that information.  If you need to send
the server name to Solr for some reason, you can certainly do that on a
per-job basis by putting in yet another bit of metadata, via the "Forced
Metadata" tab in your job.  If you have a really strong reason for
including the server name in the same path, it would also be possible to
add another feature to the JCIFS connector to do it based on a checkbox or
some such; but this would complicate further an already very complicated
user interface.

It looks, however, like you are trying to construct an IRI, which the JCIFS
connector is supposed to be doing.  Can you explain what your needs are
here?  What do you believe is the correct form of an IRI?


On Fri, May 3, 2013 at 7:49 AM, Yossi Nachum <> wrote:

> That is working. I created a path field in my schema and use the "path
> attribute".
> I have one problem, I don't see the name of the cifs server, just the path
> inside it.
> I try to use "Match Regexp" in the metadata tab with the following values:
> Match regexp: "(.*)"
> Replace string: "file:////server_name/$(1)"
> but it did not work. Still seeing the path only.
> What am I doing wrong? How can I add my server name to the path?
> Thanks
> Yossi
> On Wed, May 1, 2013 at 4:10 PM, Yossi Nachum <> wrote:
>> Thanks I will try that
>> On May 1, 2013 3:54 PM, "Karl Wright" <> wrote:
>>> There is also a different way to do this entirely - there is a path
>>> attribute you can send as metadata to Solr.  Just include the entire path,
>>> and put it into a different field that you declare in your schema.  See
>>> "path attribute" in the end-user documentation for the JCIFS connector.
>>> On Wed, May 1, 2013 at 8:52 AM, Karl Wright <> wrote:
>>>> IE 6 is extremely old and I believe we developed for IE 7 at a minimum
>>>> (there were two different versions with different functionality we had to
>>>> support there), and made further changes for IE 8 when it came out.  I have
>>>> no idea what IE 9 or IE 10 do.
>>>> The only way to change the encoding of the IRI is to modify the JCIFS
>>>> connector code.  But please bear in mind that unless you can show your
>>>> modifications will work across a wide variety of browsers, we are unlikely
>>>> to accept these changes back into the code base.
>>>> The alternative is, since the encoding IS deterministic and reversible,
>>>> you could readily write a Tika plugin that would modify at least the URL
>>>> field in the manner you desire.  But you could not modify the ID field
>>>> since ManifoldCF uses this to delete documents that have disappeared.
>>>> Karl
>>>> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <>wrote:
>>>>> The IRI is not working in my IE. I am using old version of IE V6 SP3.
>>>>> But what I realy want is to display the correct name of the path with
>>>>> hebrew characters.
>>>>> If I understand you right, then I need to change the representation of
>>>>> the IRI. How can I do that?
>>>>> On May 1, 2013 3:14 PM, "Karl Wright" <> wrote:
>>>>>> Right, that is exactly what I would expect.
>>>>>> ManifoldCF uses a URL (which is constructed by the connector) as
>>>>>> primary key for every document as indexed in the search engine. 
>>>>>> has two purposes: first, it is supposed to be unique, and second,
it is
>>>>>> supposed to allow someone who browses to that result to locate the
>>>>>> document.  In the case of JCIFS, the environment is presumed to be
>>>>>> local active directory domain(s), and the "URL" generated is really
a file
>>>>>> IRI, usually of the form "file://///server.domain/path/filename".
 You thus
>>>>>> should be able to paste the "URL" of the document from Solr into
a browser
>>>>>> on a machine in the domain, and see the document load.
>>>>>> As I said before, however, there are already certain problems with
>>>>>> this because each version of IE differs somewhat in how it deals
>>>>>> non-ASCII characters.  IRI legal character rules are somewhat different
>>>>>> than URL rules, but IRI's are still nevertheless escaped in various
>>>>>> There are also multiple equivalent ways of representing the same
file path
>>>>>> with different IRI's.
>>>>>> It is not typical that the ID and URL fields of a document are
>>>>>> presented to the user in any meaningful way, so your question is
>>>>>> academic in most settings.  If you have a problem with the IRI's
>>>>>> actually working in a browser, that's of more immediate interest.
>>>>>> let us know if that's the case.
>>>>>> Thanks,
>>>>>> Karl
>>>>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum <>wrote:
>>>>>>> Thanks for your response
>>>>>>> I am seeing these characters in solr when I search these files.
>>>>>>> I am using the solr example site and these characters show up
in the
>>>>>>> ID field and URL field.
>>>>>>> BTW I am running solr and mcf on a linux server
>>>>>>>  On May 1, 2013 1:11 PM, "Karl Wright" <>
>>>>>>>> Where are you seeing these characters?  Are you talking about
>>>>>>>> file IRI's that the JCIFS connector generates?  Those IRI's
are supposed to
>>>>>>>> be constructed so that your browser would find them if you
paste them into
>>>>>>>> the browser URL window.  Unfortunately, there is no good
standard, and
>>>>>>>> people follow IE's behavior, and IE has changed multiple
times in how it
>>>>>>>> deals with non-latin-1 characters.
>>>>>>>> Please provide a bit more information so that we can provide
>>>>>>>> better answer.
>>>>>>>> Karl
>>>>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum <>wrote:
>>>>>>>>> Hello,
>>>>>>>>> I install search server with solr and manifoldcf.
>>>>>>>>> I want to index my netapp files over cifs and I have
a problem
>>>>>>>>> with hebrew files and directories.
>>>>>>>>> When I search for these files in solr I see "%D7%91%D7%..."
>>>>>>>>> instead of the directory path that contain hebrew characters
>>>>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255"
>>>>>>>>> it didn't help.
>>>>>>>>> Can anyone help and tell me how can I index directories/files
>>>>>>>>> hebrew?
>>>>>>>>> Thanks
>>>>>>>>> Yossi

View raw message