manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [Windows Shares Connector] Un-expected removal of all documents
Date Tue, 31 Mar 2015 15:44:40 GMT
Hi Alessandro,

Your code snippet has two problems: (1) it doesn't distinguish between
service interruptions and bad credentials, and (2) the check itself is
specific to the ROOT of the tree, which the user may not have access to.
In check() we can get away with this but if you wire up the check() logic
into the crawl processing it will break some people.

The first problem, (1), is exactly what we need to figure out anyway.

Karl


On Tue, Mar 31, 2015 at 11:30 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> Hi karl comments follow :
>
> 2015-03-31 16:18 GMT+01:00 Karl Wright <daddywri@gmail.com>:
>
> > Hi Alessandro,
> >
> > There are situations where the check() method does not succeed but you
> can
> > still crawl.  So I would not do it that way, since it fundamentally
> changes
> > the contract.
> >
>
> Am I wrong or we should assume the "check()" method to work as it's built
> for.
> I mean if in some case, this method is wrongly implemented , this can not
> break another assumption.
>
> >
> > My proposal is to have processDocuments ABORT the job when it finds bad
> > credentials.  That's very fast and will not permit a job to run for a
> long
> > time.
> >
> > The trick is to determine if there are bad credentials WITHOUT doing any
> > more work in the processDocuments pathway than we currently are.  An
> > exception will be thrown either way, but we need to figure out whether
> > there is any information in the exception that we can use to decide
> between
> > bad credentials and no access permissions.
> >
> > You can help provide that by doing a simple experiment on your client's
> > hardware (or yours, if you have such hardware in house).  Change the
> > credential to an invalid one and see what the exception details are.
> Then
> > change to valid credentials and try to crawl a directory that is not
> > visible to the credentialed user you supplied, and make a note of the
> > exception details in that case too.
> >
>
> I was thinking to slightly modifying the getSession() method adding the
> file exist check , something like this :
>
> ...
>
> try
> {
>     // use NtlmPasswordAuthentication so that we can reuse credential
> for DFS support
>     pa = new NtlmPasswordAuthentication( domain, username, password );
>     SmbFile smbconnection = new SmbFile( "smb://" + server + "/", pa );
>     smbconnectionPath = getFileCanonicalPath( smbconnection );
>     smbconnection.exists();
> }
> catch ( MalformedURLException e )
> {
>     Logging.connectors.error(
>         "Unable to access SMB/CIFS share: " + "smb://" + ( ( domain ==
> null ) ? "" : domain ) + ";"
>             + username + ":<password>@" + server + "/\n" + e );
>     throw new ManifoldCFException( "Unable to access SMB/CIFS share: "
> + server, e,
>
> ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
> } catch (SmbException e) {
>     Logging.connectors.error(
>             "Unable to access SMB/CIFS share: Credential not valid - "
> + "smb://" + ((domain == null) ? "" : domain) + ";"
>                     + username + ":<password>@" + server + "/\n" + e);
>     throw new ManifoldCFException( "Unable to access SMB/CIFS share:
> Credential not valid - " + server, e,
>             ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
> }
>
> Catching the smbException should make the trick.
> Anyway I will go more in details.
>
> Cheers
>
>
> > Karl
> >
> >
> > On Tue, Mar 31, 2015 at 10:50 AM, Alessandro Benedetti <
> > benedetti.alex85@gmail.com> wrote:
> >
> > > Currently we are checking each of the String[] oldVersions , trying to
> > > access it ...
> > > So in the scenario I described the current performances are quite
> bad...
> > > We would need to avoid at all the scan of the oldDocs if we know the
> > > provided credential are not valid anymore .
> > >
> > > Let me be extreme, but what about not allowing the job to start at all
> if
> > > the Repository Connector is currently broken ( i.e. the connection is
> not
> > > working, and we know that because of the check method) .
> > > In this way we avoid to destroy already existent indexes and we simply
> > > communicate a message in the job giving advice the job can not start
> > > because Repository connector is currently offline ( and showing the
> > > explanation) .
> > >
> > > Does this make sense ?
> > >
> > > 2015-03-31 15:30 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> > >
> > > > Hi Alessandro,
> > > >
> > > > If you put a check in the processDocuments method, it will be called
> > for
> > > > every group of documents.  That's fine, but if you structure it as a
> > > > separate call it would impact performance.  That is why I suggest
> just
> > > > doing a better job of interpreting the existing exceptions.
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Tue, Mar 31, 2015 at 10:27 AM, Alessandro Benedetti <
> > > > benedetti.alex85@gmail.com> wrote:
> > > >
> > > > > As an addition, this should be quite simple, not proceeding with
> the
> > > > > processDocuments method, if the RepositoryConnector is not able to
> > > > connect(
> > > > > check method return not a proper message).
> > > > >
> > > > > Right ?
> > > > > Wondering where is the proper point to enter the action :)
> > > > >
> > > > > Cheers
> > > > >
> > > > > 2015-03-31 14:59 GMT+01:00 Alessandro Benedetti <
> > > > > benedetti.alex85@gmail.com>
> > > > > :
> > > > >
> > > > > > Yes Karl,
> > > > > >  I was thinking exactly that, to first check if the credentials
> are
> > > > > valid,
> > > > > > before scanning all the documents.
> > > > > > This because permissions per files depend on users/groups, but
> the
> > > > > current
> > > > > > scenario is not in-validating the user, but invalidating the
> access
> > > of
> > > > > that
> > > > > > user.
> > > > > >
> > > > > > An error must be thrown, but the docs not deleted ( not even
> > > scanned) .
> > > > > >
> > > > > > Furthermore, what will happen, in the case the server is down
?
> > > > > > Are we safe in that scenario ?
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > 2015-03-31 14:42 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> > > > > >
> > > > > >> This is actually pretty standard behavior across our connector
> > > family,
> > > > > and
> > > > > >> has been true since Day One.  The behavior comes from the
basic
> > > broad
> > > > > >> requirement that the crawler should keep going and skip
the
> > document
> > > > > when
> > > > > >> the permissions do not allow it to be fetched.  With the
Windows
> > > Share
> > > > > >> connector, it's sometimes the case (when DFS is used a lot)
that
> > > whole
> > > > > >> subtrees of documents are not fetchable using the credentials
> > > > supplied.
> > > > > >> So
> > > > > >> it is not so easy to just check for valid credentials at
the
> > > > beginning.
> > > > > >>
> > > > > >> For a solution, I'd be inclined to look for a way to figure
out
> if
> > > the
> > > > > >> credentials are actually *invalid*, and abort the job if
so.
> This
> > > is
> > > > > >> distinct from the case where the credentials are valid but
the
> > > > connector
> > > > > >> doesn't have permissions to read the document.  It will
take
> some
> > > > > >> experimentation to see if we get back different exception
text
> in
> > > the
> > > > > two
> > > > > >> situations.
> > > > > >>
> > > > > >> Karl
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Mar 31, 2015 at 9:30 AM, Alessandro Benedetti <
> > > > > >> abenedetti@apache.org
> > > > > >> > wrote:
> > > > > >>
> > > > > >> > Hi guys,
> > > > > >> > playing with the Windows Shares Connector in ManifoldCF
1.8 I
> > > > > >> encountered
> > > > > >> > this problem :
> > > > > >> >
> > > > > >> > *Scenario*
> > > > > >> > *1)* Indexing windows Shares server
> > > > > >> > *2)* Indexing successfully finished with N docs indexed
> > > > > >> > *3)* Offline ,while no indexing is happening, Shares
server
> > side,
> > > > the
> > > > > >> > Administrator password changes
> > > > > >> > *4) *Repository Connector is not able to connect anymore(of
> > course
> > > > > >> because
> > > > > >> > the password has changed)
> > > > > >> > *5)* Next indexing cycle, ALL docs are removed from
the index
> .
> > > > > >> >
> > > > > >> > *Expected Behaviour*
> > > > > >> > As I user I would like to see an error message, that
will let
> me
> > > > > >> understand
> > > > > >> > the issue, not losing all my N indexed docs .
> > > > > >> >
> > > > > >> > *Reason*
> > > > > >> > Taking a look into the code, the problems seems to
be in the :
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector#getDocumentVersions
> > > > > >> > where it tries to access each document singularly through
> Samba,
> > > and
> > > > > >> > removing them one by one if not reachable anymore.
> > > > > >> >
> > > > > >> > *Solution*
> > > > > >> > Before scanning each document, we have to be sure the
> connection
> > > is
> > > > > >> > working.
> > > > > >> > If not this is only armful.
> > > > > >> >
> > > > > >> > I will continue investigating, but I would like your
opinion
> as
> > > well
> > > > > >> >
> > > > > >> > Cheers
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > --------------------------
> > > > > >> >
> > > > > >> > Benedetti Alessandro
> > > > > >> > Visiting card : http://about.me/alessandro_benedetti
> > > > > >> >
> > > > > >> > "Tyger, tyger burning bright
> > > > > >> > In the forests of the night,
> > > > > >> > What immortal hand or eye
> > > > > >> > Could frame thy fearful symmetry?"
> > > > > >> >
> > > > > >> > William Blake - Songs of Experience -1794 England
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > --------------------------
> > > > > >
> > > > > > Benedetti Alessandro
> > > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > > >
> > > > > > "Tyger, tyger burning bright
> > > > > > In the forests of the night,
> > > > > > What immortal hand or eye
> > > > > > Could frame thy fearful symmetry?"
> > > > > >
> > > > > > William Blake - Songs of Experience -1794 England
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > --------------------------
> > > > >
> > > > > Benedetti Alessandro
> > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > >
> > > > > "Tyger, tyger burning bright
> > > > > In the forests of the night,
> > > > > What immortal hand or eye
> > > > > Could frame thy fearful symmetry?"
> > > > >
> > > > > William Blake - Songs of Experience -1794 England
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message