manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: [Windows Shares Connector] Un-expected removal of all documents
Date Tue, 07 Apr 2015 12:42:50 GMT
Hi Karl,
just back to the issue, I think I solved it in a quick way ( not so much
intrusive) :

org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector#getDocumentVersions
org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.java:706

...

catch ( jcifs.smb.SmbAuthException e )
{
    Logging.connectors.warn(
        "JCIFS: Authorization exception reading version information
for " + documentIdentifier
            + " - skipping" );
    if(e.getMessage().equals("Logon failure: unknown user name or bad
password."))
        throw new ManifoldCFException( "SmbAuthException thrown: " +
e.getMessage(), e );
    else
        rval[i] = null;
}

...

In this way the message is checked, and if it is a Login failure we
throw the manifoldCFException breaking the iteration ( because login
failure means no documents will be accessible but we don't have to
erase them) .

If it is another Authorization exception ( like permissions changed
for the folder/file) the behaviour is the same than before.

I think should be enough to be safe, what do you think ?

Is any other method affected by this problem ?

I think should be limited to the versions check.


Cheers


2015-04-02 16:32 GMT+01:00 Alessandro Benedetti <benedetti.alex85@gmail.com>
:

>
>
> 2015-04-02 15:58 GMT+01:00 Karl Wright <daddywri@gmail.com>:
>
>> Hi Alessandro,
>>
>> Yes, you interpreted my reply correctly.
>>
>> I think we therefore have to perform any checking operations on the actual
>> file being accessed.  This is actually pretty easy to do without
>> sacrificing performance.  All you need to do is the following:
>>
>> try {
>>   ... do the file access operation ...
>> } catch (SmbException e) {
>>   ... figure out from the exception whether to throw a ManifoldCFException
>> or a ServiceInterruption ...
>>   ... If the exception does not include enough to distinguish between bad
>> credentials and insufficient privs, then do a check RIGHT HERE for bad
>> credentials ...
>> }
>>
>> What do you think?  The new code would only ever be called if the document
>> cannot be read.
>>
>
> I think we can proceed like you said, I am investigating right now the
> details returned for the exception ( to understand if there is any
> difference between wrong credentials or access denied)
> In the case we find the "wrong credential" we have to throw the exception
> and stop the iteration ( this will happen the very first time assuming none
> is playing server side) .
> In this way we save the time of checking all the files ( in the case of
> wrong credentials no one will be accessible) .
>
> Another way can be to do this credential check at the beginning and stop
> only if we have wrong credential ( leaving the permission check file by
> file) .
>
> Quite a confused scenario, but we can sort this out with little changes :)
>
>
>
>>
>> Karl
>>
>>
>> On Thu, Apr 2, 2015 at 10:42 AM, Alessandro Benedetti <
>> benedetti.alex85@gmail.com> wrote:
>>
>> > OkI am currently working on that, and I will work on that next tuesday
>> as
>> > well .
>> > But what about point 2 :
>> > " (2) the check itself is
>> > specific to the ROOT of the tree, which the user may not have access
>> to."
>> >
>> > I think I got your problem, you mean that a possible scenario can happen
>> > when you configure the repository connector with a user that  is not
>> able
>> > to access the root but is able to access the directories we want to
>> crawl.
>> > In such a case the repository connector will appear to be not able to
>> > connect, while the crawling will be still possible if you configure the
>> > accessible directories in the job.
>> > If this is correct , the situation is more complicated ...
>> >
>> > Cheers
>> >
>> >
>> > 2015-03-31 16:44 GMT+01:00 Karl Wright <daddywri@gmail.com>:
>> >
>> > > Hi Alessandro,
>> > >
>> > > Your code snippet has two problems: (1) it doesn't distinguish between
>> > > service interruptions and bad credentials,
>> >
>> >
>> > Should not be the difference between the IOException and the Smb one ?
>> >
>> >
>> > > and (2) the check itself is
>> > > specific to the ROOT of the tree, which the user may not have access
>> to.
>> > >
>> >
>> >
>> >
>> > > In check() we can get away with this but if you wire up the check()
>> logic
>> > > into the crawl processing it will break some people.
>> > >
>> > > The first problem, (1), is exactly what we need to figure out anyway.
>> > >
>> > > Karl
>> > >
>> > >
>> > > On Tue, Mar 31, 2015 at 11:30 AM, Alessandro Benedetti <
>> > > benedetti.alex85@gmail.com> wrote:
>> > >
>> > > > Hi karl comments follow :
>> > > >
>> > > > 2015-03-31 16:18 GMT+01:00 Karl Wright <daddywri@gmail.com>:
>> > > >
>> > > > > Hi Alessandro,
>> > > > >
>> > > > > There are situations where the check() method does not succeed
but
>> > you
>> > > > can
>> > > > > still crawl.  So I would not do it that way, since it
>> fundamentally
>> > > > changes
>> > > > > the contract.
>> > > > >
>> > > >
>> > > > Am I wrong or we should assume the "check()" method to work as it's
>> > built
>> > > > for.
>> > > > I mean if in some case, this method is wrongly implemented , this
>> can
>> > not
>> > > > break another assumption.
>> > > >
>> > > > >
>> > > > > My proposal is to have processDocuments ABORT the job when it
>> finds
>> > bad
>> > > > > credentials.  That's very fast and will not permit a job to run
>> for a
>> > > > long
>> > > > > time.
>> > > > >
>> > > > > The trick is to determine if there are bad credentials WITHOUT
>> doing
>> > > any
>> > > > > more work in the processDocuments pathway than we currently are.
>> An
>> > > > > exception will be thrown either way, but we need to figure out
>> > whether
>> > > > > there is any information in the exception that we can use to
>> decide
>> > > > between
>> > > > > bad credentials and no access permissions.
>> > > > >
>> > > > > You can help provide that by doing a simple experiment on your
>> > client's
>> > > > > hardware (or yours, if you have such hardware in house).  Change
>> the
>> > > > > credential to an invalid one and see what the exception details
>> are.
>> > > > Then
>> > > > > change to valid credentials and try to crawl a directory that
is
>> not
>> > > > > visible to the credentialed user you supplied, and make a note
of
>> the
>> > > > > exception details in that case too.
>> > > > >
>> > > >
>> > > > I was thinking to slightly modifying the getSession() method adding
>> the
>> > > > file exist check , something like this :
>> > > >
>> > > > ...
>> > > >
>> > > > try
>> > > > {
>> > > >     // use NtlmPasswordAuthentication so that we can reuse
>> credential
>> > > > for DFS support
>> > > >     pa = new NtlmPasswordAuthentication( domain, username, password
>> );
>> > > >     SmbFile smbconnection = new SmbFile( "smb://" + server + "/",
>> pa );
>> > > >     smbconnectionPath = getFileCanonicalPath( smbconnection );
>> > > >     smbconnection.exists();
>> > > > }
>> > > > catch ( MalformedURLException e )
>> > > > {
>> > > >     Logging.connectors.error(
>> > > >         "Unable to access SMB/CIFS share: " + "smb://" + ( ( domain
>> ==
>> > > > null ) ? "" : domain ) + ";"
>> > > >             + username + ":<password>@" + server + "/\n" + e
);
>> > > >     throw new ManifoldCFException( "Unable to access SMB/CIFS
>> share: "
>> > > > + server, e,
>> > > >
>> > > > ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
>> > > > } catch (SmbException e) {
>> > > >     Logging.connectors.error(
>> > > >             "Unable to access SMB/CIFS share: Credential not valid
>> - "
>> > > > + "smb://" + ((domain == null) ? "" : domain) + ";"
>> > > >                     + username + ":<password>@" + server + "/\n"
+
>> e);
>> > > >     throw new ManifoldCFException( "Unable to access SMB/CIFS share:
>> > > > Credential not valid - " + server, e,
>> > > >             ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
>> > > > }
>> > > >
>> > > > Catching the smbException should make the trick.
>> > > > Anyway I will go more in details.
>> > > >
>> > > > Cheers
>> > > >
>> > > >
>> > > > > Karl
>> > > > >
>> > > > >
>> > > > > On Tue, Mar 31, 2015 at 10:50 AM, Alessandro Benedetti <
>> > > > > benedetti.alex85@gmail.com> wrote:
>> > > > >
>> > > > > > Currently we are checking each of the String[] oldVersions
,
>> trying
>> > > to
>> > > > > > access it ...
>> > > > > > So in the scenario I described the current performances
are
>> quite
>> > > > bad...
>> > > > > > We would need to avoid at all the scan of the oldDocs if
we know
>> > the
>> > > > > > provided credential are not valid anymore .
>> > > > > >
>> > > > > > Let me be extreme, but what about not allowing the job to
start
>> at
>> > > all
>> > > > if
>> > > > > > the Repository Connector is currently broken ( i.e. the
>> connection
>> > is
>> > > > not
>> > > > > > working, and we know that because of the check method) .
>> > > > > > In this way we avoid to destroy already existent indexes
and we
>> > > simply
>> > > > > > communicate a message in the job giving advice the job can
not
>> > start
>> > > > > > because Repository connector is currently offline ( and
showing
>> the
>> > > > > > explanation) .
>> > > > > >
>> > > > > > Does this make sense ?
>> > > > > >
>> > > > > > 2015-03-31 15:30 GMT+01:00 Karl Wright <daddywri@gmail.com>:
>> > > > > >
>> > > > > > > Hi Alessandro,
>> > > > > > >
>> > > > > > > If you put a check in the processDocuments method,
it will be
>> > > called
>> > > > > for
>> > > > > > > every group of documents.  That's fine, but if you
structure
>> it
>> > as
>> > > a
>> > > > > > > separate call it would impact performance.  That is
why I
>> suggest
>> > > > just
>> > > > > > > doing a better job of interpreting the existing exceptions.
>> > > > > > >
>> > > > > > > Karl
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Mar 31, 2015 at 10:27 AM, Alessandro Benedetti
<
>> > > > > > > benedetti.alex85@gmail.com> wrote:
>> > > > > > >
>> > > > > > > > As an addition, this should be quite simple, not
proceeding
>> > with
>> > > > the
>> > > > > > > > processDocuments method, if the RepositoryConnector
is not
>> able
>> > > to
>> > > > > > > connect(
>> > > > > > > > check method return not a proper message).
>> > > > > > > >
>> > > > > > > > Right ?
>> > > > > > > > Wondering where is the proper point to enter the
action :)
>> > > > > > > >
>> > > > > > > > Cheers
>> > > > > > > >
>> > > > > > > > 2015-03-31 14:59 GMT+01:00 Alessandro Benedetti
<
>> > > > > > > > benedetti.alex85@gmail.com>
>> > > > > > > > :
>> > > > > > > >
>> > > > > > > > > Yes Karl,
>> > > > > > > > >  I was thinking exactly that, to first check
if the
>> > credentials
>> > > > are
>> > > > > > > > valid,
>> > > > > > > > > before scanning all the documents.
>> > > > > > > > > This because permissions per files depend
on users/groups,
>> > but
>> > > > the
>> > > > > > > > current
>> > > > > > > > > scenario is not in-validating the user, but
invalidating
>> the
>> > > > access
>> > > > > > of
>> > > > > > > > that
>> > > > > > > > > user.
>> > > > > > > > >
>> > > > > > > > > An error must be thrown, but the docs not
deleted ( not
>> even
>> > > > > > scanned) .
>> > > > > > > > >
>> > > > > > > > > Furthermore, what will happen, in the case
the server is
>> > down ?
>> > > > > > > > > Are we safe in that scenario ?
>> > > > > > > > >
>> > > > > > > > > Cheers
>> > > > > > > > >
>> > > > > > > > > 2015-03-31 14:42 GMT+01:00 Karl Wright <
>> daddywri@gmail.com>:
>> > > > > > > > >
>> > > > > > > > >> This is actually pretty standard behavior
across our
>> > connector
>> > > > > > family,
>> > > > > > > > and
>> > > > > > > > >> has been true since Day One.  The behavior
comes from the
>> > > basic
>> > > > > > broad
>> > > > > > > > >> requirement that the crawler should keep
going and skip
>> the
>> > > > > document
>> > > > > > > > when
>> > > > > > > > >> the permissions do not allow it to be
fetched.  With the
>> > > Windows
>> > > > > > Share
>> > > > > > > > >> connector, it's sometimes the case (when
DFS is used a
>> lot)
>> > > that
>> > > > > > whole
>> > > > > > > > >> subtrees of documents are not fetchable
using the
>> > credentials
>> > > > > > > supplied.
>> > > > > > > > >> So
>> > > > > > > > >> it is not so easy to just check for valid
credentials at
>> the
>> > > > > > > beginning.
>> > > > > > > > >>
>> > > > > > > > >> For a solution, I'd be inclined to look
for a way to
>> figure
>> > > out
>> > > > if
>> > > > > > the
>> > > > > > > > >> credentials are actually *invalid*, and
abort the job if
>> so.
>> > > > This
>> > > > > > is
>> > > > > > > > >> distinct from the case where the credentials
are valid
>> but
>> > the
>> > > > > > > connector
>> > > > > > > > >> doesn't have permissions to read the
document.  It will
>> take
>> > > > some
>> > > > > > > > >> experimentation to see if we get back
different exception
>> > text
>> > > > in
>> > > > > > the
>> > > > > > > > two
>> > > > > > > > >> situations.
>> > > > > > > > >>
>> > > > > > > > >> Karl
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> On Tue, Mar 31, 2015 at 9:30 AM, Alessandro
Benedetti <
>> > > > > > > > >> abenedetti@apache.org
>> > > > > > > > >> > wrote:
>> > > > > > > > >>
>> > > > > > > > >> > Hi guys,
>> > > > > > > > >> > playing with the Windows Shares
Connector in ManifoldCF
>> > 1.8
>> > > I
>> > > > > > > > >> encountered
>> > > > > > > > >> > this problem :
>> > > > > > > > >> >
>> > > > > > > > >> > *Scenario*
>> > > > > > > > >> > *1)* Indexing windows Shares server
>> > > > > > > > >> > *2)* Indexing successfully finished
with N docs indexed
>> > > > > > > > >> > *3)* Offline ,while no indexing
is happening, Shares
>> > server
>> > > > > side,
>> > > > > > > the
>> > > > > > > > >> > Administrator password changes
>> > > > > > > > >> > *4) *Repository Connector is not
able to connect
>> > anymore(of
>> > > > > course
>> > > > > > > > >> because
>> > > > > > > > >> > the password has changed)
>> > > > > > > > >> > *5)* Next indexing cycle, ALL docs
are removed from the
>> > > index
>> > > > .
>> > > > > > > > >> >
>> > > > > > > > >> > *Expected Behaviour*
>> > > > > > > > >> > As I user I would like to see an
error message, that
>> will
>> > > let
>> > > > me
>> > > > > > > > >> understand
>> > > > > > > > >> > the issue, not losing all my N indexed
docs .
>> > > > > > > > >> >
>> > > > > > > > >> > *Reason*
>> > > > > > > > >> > Taking a look into the code, the
problems seems to be
>> in
>> > > the :
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector#getDocumentVersions
>> > > > > > > > >> > where it tries to access each document
singularly
>> through
>> > > > Samba,
>> > > > > > and
>> > > > > > > > >> > removing them one by one if not
reachable anymore.
>> > > > > > > > >> >
>> > > > > > > > >> > *Solution*
>> > > > > > > > >> > Before scanning each document, we
have to be sure the
>> > > > connection
>> > > > > > is
>> > > > > > > > >> > working.
>> > > > > > > > >> > If not this is only armful.
>> > > > > > > > >> >
>> > > > > > > > >> > I will continue investigating, but
I would like your
>> > opinion
>> > > > as
>> > > > > > well
>> > > > > > > > >> >
>> > > > > > > > >> > Cheers
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > --
>> > > > > > > > >> > --------------------------
>> > > > > > > > >> >
>> > > > > > > > >> > Benedetti Alessandro
>> > > > > > > > >> > Visiting card : http://about.me/alessandro_benedetti
>> > > > > > > > >> >
>> > > > > > > > >> > "Tyger, tyger burning bright
>> > > > > > > > >> > In the forests of the night,
>> > > > > > > > >> > What immortal hand or eye
>> > > > > > > > >> > Could frame thy fearful symmetry?"
>> > > > > > > > >> >
>> > > > > > > > >> > William Blake - Songs of Experience
-1794 England
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > --------------------------
>> > > > > > > > >
>> > > > > > > > > Benedetti Alessandro
>> > > > > > > > > Visiting card : http://about.me/alessandro_benedetti
>> > > > > > > > >
>> > > > > > > > > "Tyger, tyger burning bright
>> > > > > > > > > In the forests of the night,
>> > > > > > > > > What immortal hand or eye
>> > > > > > > > > Could frame thy fearful symmetry?"
>> > > > > > > > >
>> > > > > > > > > William Blake - Songs of Experience -1794
England
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > --------------------------
>> > > > > > > >
>> > > > > > > > Benedetti Alessandro
>> > > > > > > > Visiting card : http://about.me/alessandro_benedetti
>> > > > > > > >
>> > > > > > > > "Tyger, tyger burning bright
>> > > > > > > > In the forests of the night,
>> > > > > > > > What immortal hand or eye
>> > > > > > > > Could frame thy fearful symmetry?"
>> > > > > > > >
>> > > > > > > > William Blake - Songs of Experience -1794 England
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > --------------------------
>> > > > > >
>> > > > > > Benedetti Alessandro
>> > > > > > Visiting card : http://about.me/alessandro_benedetti
>> > > > > >
>> > > > > > "Tyger, tyger burning bright
>> > > > > > In the forests of the night,
>> > > > > > What immortal hand or eye
>> > > > > > Could frame thy fearful symmetry?"
>> > > > > >
>> > > > > > William Blake - Songs of Experience -1794 England
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > --------------------------
>> > > >
>> > > > Benedetti Alessandro
>> > > > Visiting card : http://about.me/alessandro_benedetti
>> > > >
>> > > > "Tyger, tyger burning bright
>> > > > In the forests of the night,
>> > > > What immortal hand or eye
>> > > > Could frame thy fearful symmetry?"
>> > > >
>> > > > William Blake - Songs of Experience -1794 England
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > --------------------------
>> >
>> > Benedetti Alessandro
>> > Visiting card : http://about.me/alessandro_benedetti
>> >
>> > "Tyger, tyger burning bright
>> > In the forests of the night,
>> > What immortal hand or eye
>> > Could frame thy fearful symmetry?"
>> >
>> > William Blake - Songs of Experience -1794 England
>> >
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message