manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Problem with directories in hebrew and jcifs
Date Fri, 03 May 2013 14:05:55 GMT
I should clarify.  IF you can propose a better IRI form than the one the
connector generates, AND it will work for all languages/encodings and most
modern browsers, we should consider changing the connector code.

Karl


On Fri, May 3, 2013 at 8:32 AM, Karl Wright <daddywri@gmail.com> wrote:

> Here is the code in the JCIFS connector:
>
>               String pathAttributeValue = documentIdentifier;
>               // 3/13/2008
>               // In looking at what comes into the path metadata attribute
> by default, and cogitating a bit, I've concluded that
>               // the smb:// and the server/domain name at the start of the
> path are just plain old noise, and should be stripped.
>               // This changes a behavior that has been around for a while,
> so there is a risk, but a quick back-and-forth with the
>               // SE's leads me to believe that this is safe.
>
>               if (pathAttributeValue.startsWith("smb://"))
>               {
>                 int index =
> pathAttributeValue.indexOf("/","smb://".length());
>                 if (index == -1)
>                   index = pathAttributeValue.length();
>                 pathAttributeValue = pathAttributeValue.substring(index);
>               }
>               // Now, translate
>               pathAttributeValue = matchMap.translate(pathAttributeValue);
>               pack(sb,pathAttributeValue,'+');
>             }
>             else
>               sb.append('-');
>
> Since the JCIFS connection determines the server name, the document
> identifier does not need to repeat that information.  If you need to send
> the server name to Solr for some reason, you can certainly do that on a
> per-job basis by putting in yet another bit of metadata, via the "Forced
> Metadata" tab in your job.  If you have a really strong reason for
> including the server name in the same path, it would also be possible to
> add another feature to the JCIFS connector to do it based on a checkbox or
> some such; but this would complicate further an already very complicated
> user interface.
>
> It looks, however, like you are trying to construct an IRI, which the
> JCIFS connector is supposed to be doing.  Can you explain what your needs
> are here?  What do you believe is the correct form of an IRI?
>
> Karl
>
>
>
> On Fri, May 3, 2013 at 7:49 AM, Yossi Nachum <nachum234@gmail.com> wrote:
>
>> That is working. I created a path field in my schema and use the "path
>> attribute".
>> I have one problem, I don't see the name of the cifs server, just the
>> path inside it.
>> I try to use "Match Regexp" in the metadata tab with the following values:
>> Match regexp: "(.*)"
>> Replace string: "file:////server_name/$(1)"
>>
>> but it did not work. Still seeing the path only.
>>
>> What am I doing wrong? How can I add my server name to the path?
>>
>> Thanks
>> Yossi
>>
>>
>>
>> On Wed, May 1, 2013 at 4:10 PM, Yossi Nachum <nachum234@gmail.com> wrote:
>>
>>> Thanks I will try that
>>> On May 1, 2013 3:54 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>>>
>>>> There is also a different way to do this entirely - there is a path
>>>> attribute you can send as metadata to Solr.  Just include the entire path,
>>>> and put it into a different field that you declare in your schema.  See
>>>> "path attribute" in the end-user documentation for the JCIFS connector.
>>>>
>>>>
>>>>
>>>> On Wed, May 1, 2013 at 8:52 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> IE 6 is extremely old and I believe we developed for IE 7 at a minimum
>>>>> (there were two different versions with different functionality we had
to
>>>>> support there), and made further changes for IE 8 when it came out. 
I have
>>>>> no idea what IE 9 or IE 10 do.
>>>>>
>>>>> The only way to change the encoding of the IRI is to modify the JCIFS
>>>>> connector code.  But please bear in mind that unless you can show your
>>>>> modifications will work across a wide variety of browsers, we are unlikely
>>>>> to accept these changes back into the code base.
>>>>>
>>>>> The alternative is, since the encoding IS deterministic and
>>>>> reversible, you could readily write a Tika plugin that would modify at
>>>>> least the URL field in the manner you desire.  But you could not modify
the
>>>>> ID field since ManifoldCF uses this to delete documents that have
>>>>> disappeared.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <nachum234@gmail.com>wrote:
>>>>>
>>>>>> The IRI is not working in my IE. I am using old version of IE V6
SP3.
>>>>>> But what I realy want is to display the correct name of the path
with
>>>>>> hebrew characters.
>>>>>> If I understand you right, then I need to change the representation
>>>>>> of the IRI. How can I do that?
>>>>>> On May 1, 2013 3:14 PM, "Karl Wright" <daddywri@gmail.com>
wrote:
>>>>>>
>>>>>>> Right, that is exactly what I would expect.
>>>>>>>
>>>>>>> ManifoldCF uses a URL (which is constructed by the connector)
as the
>>>>>>> primary key for every document as indexed in the search engine.
 The URL
>>>>>>> has two purposes: first, it is supposed to be unique, and second,
it is
>>>>>>> supposed to allow someone who browses to that result to locate
the
>>>>>>> document.  In the case of JCIFS, the environment is presumed
to be the
>>>>>>> local active directory domain(s), and the "URL" generated is
really a file
>>>>>>> IRI, usually of the form "file://///server.domain/path/filename".
 You thus
>>>>>>> should be able to paste the "URL" of the document from Solr into
a browser
>>>>>>> on a machine in the domain, and see the document load.
>>>>>>>
>>>>>>> As I said before, however, there are already certain problems
with
>>>>>>> this because each version of IE differs somewhat in how it deals
with
>>>>>>> non-ASCII characters.  IRI legal character rules are somewhat
different
>>>>>>> than URL rules, but IRI's are still nevertheless escaped in various
ways.
>>>>>>> There are also multiple equivalent ways of representing the same
file path
>>>>>>> with different IRI's.
>>>>>>>
>>>>>>> It is not typical that the ID and URL fields of a document are
>>>>>>> presented to the user in any meaningful way, so your question
is usually
>>>>>>> academic in most settings.  If you have a problem with the IRI's
not
>>>>>>> actually working in a browser, that's of more immediate interest.
 Please
>>>>>>> let us know if that's the case.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum <nachum234@gmail.com>wrote:
>>>>>>>
>>>>>>>> Thanks for your response
>>>>>>>> I am seeing these characters in solr when I search these
files.
>>>>>>>> I am using the solr example site and these characters show
up in
>>>>>>>> the ID field and URL field.
>>>>>>>> BTW I am running solr and mcf on a linux server
>>>>>>>>  On May 1, 2013 1:11 PM, "Karl Wright" <daddywri@gmail.com>
wrote:
>>>>>>>>
>>>>>>>>> Where are you seeing these characters?  Are you talking
about the
>>>>>>>>> file IRI's that the JCIFS connector generates?  Those
IRI's are supposed to
>>>>>>>>> be constructed so that your browser would find them if
you paste them into
>>>>>>>>> the browser URL window.  Unfortunately, there is no good
standard, and
>>>>>>>>> people follow IE's behavior, and IE has changed multiple
times in how it
>>>>>>>>> deals with non-latin-1 characters.
>>>>>>>>>
>>>>>>>>> Please provide a bit more information so that we can
provide a
>>>>>>>>> better answer.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum <nachum234@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>> I install search server with solr and manifoldcf.
>>>>>>>>>> I want to index my netapp files over cifs and I have
a problem
>>>>>>>>>> with hebrew files and directories.
>>>>>>>>>> When I search for these files in solr I see "%D7%91%D7%..."
>>>>>>>>>> instead of the directory path that contain hebrew
characters .
>>>>>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255"
but
>>>>>>>>>> it didn't help.
>>>>>>>>>> Can anyone help and tell me how can I index directories/files
in
>>>>>>>>>> hebrew?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Yossi
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message