manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: [Windows Shares Connector] Un-expected removal of all documents
Date Thu, 02 Apr 2015 15:32:56 GMT
2015-04-02 15:58 GMT+01:00 Karl Wright <daddywri@gmail.com>:

> Hi Alessandro,
>
> Yes, you interpreted my reply correctly.
>
> I think we therefore have to perform any checking operations on the actual
> file being accessed.  This is actually pretty easy to do without
> sacrificing performance.  All you need to do is the following:
>
> try {
>   ... do the file access operation ...
> } catch (SmbException e) {
>   ... figure out from the exception whether to throw a ManifoldCFException
> or a ServiceInterruption ...
>   ... If the exception does not include enough to distinguish between bad
> credentials and insufficient privs, then do a check RIGHT HERE for bad
> credentials ...
> }
>
> What do you think?  The new code would only ever be called if the document
> cannot be read.
>

I think we can proceed like you said, I am investigating right now the
details returned for the exception ( to understand if there is any
difference between wrong credentials or access denied)
In the case we find the "wrong credential" we have to throw the exception
and stop the iteration ( this will happen the very first time assuming none
is playing server side) .
In this way we save the time of checking all the files ( in the case of
wrong credentials no one will be accessible) .

Another way can be to do this credential check at the beginning and stop
only if we have wrong credential ( leaving the permission check file by
file) .

Quite a confused scenario, but we can sort this out with little changes :)



>
> Karl
>
>
> On Thu, Apr 2, 2015 at 10:42 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > OkI am currently working on that, and I will work on that next tuesday as
> > well .
> > But what about point 2 :
> > " (2) the check itself is
> > specific to the ROOT of the tree, which the user may not have access to."
> >
> > I think I got your problem, you mean that a possible scenario can happen
> > when you configure the repository connector with a user that  is not able
> > to access the root but is able to access the directories we want to
> crawl.
> > In such a case the repository connector will appear to be not able to
> > connect, while the crawling will be still possible if you configure the
> > accessible directories in the job.
> > If this is correct , the situation is more complicated ...
> >
> > Cheers
> >
> >
> > 2015-03-31 16:44 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> >
> > > Hi Alessandro,
> > >
> > > Your code snippet has two problems: (1) it doesn't distinguish between
> > > service interruptions and bad credentials,
> >
> >
> > Should not be the difference between the IOException and the Smb one ?
> >
> >
> > > and (2) the check itself is
> > > specific to the ROOT of the tree, which the user may not have access
> to.
> > >
> >
> >
> >
> > > In check() we can get away with this but if you wire up the check()
> logic
> > > into the crawl processing it will break some people.
> > >
> > > The first problem, (1), is exactly what we need to figure out anyway.
> > >
> > > Karl
> > >
> > >
> > > On Tue, Mar 31, 2015 at 11:30 AM, Alessandro Benedetti <
> > > benedetti.alex85@gmail.com> wrote:
> > >
> > > > Hi karl comments follow :
> > > >
> > > > 2015-03-31 16:18 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> > > >
> > > > > Hi Alessandro,
> > > > >
> > > > > There are situations where the check() method does not succeed but
> > you
> > > > can
> > > > > still crawl.  So I would not do it that way, since it fundamentally
> > > > changes
> > > > > the contract.
> > > > >
> > > >
> > > > Am I wrong or we should assume the "check()" method to work as it's
> > built
> > > > for.
> > > > I mean if in some case, this method is wrongly implemented , this can
> > not
> > > > break another assumption.
> > > >
> > > > >
> > > > > My proposal is to have processDocuments ABORT the job when it finds
> > bad
> > > > > credentials.  That's very fast and will not permit a job to run
> for a
> > > > long
> > > > > time.
> > > > >
> > > > > The trick is to determine if there are bad credentials WITHOUT
> doing
> > > any
> > > > > more work in the processDocuments pathway than we currently are.
> An
> > > > > exception will be thrown either way, but we need to figure out
> > whether
> > > > > there is any information in the exception that we can use to decide
> > > > between
> > > > > bad credentials and no access permissions.
> > > > >
> > > > > You can help provide that by doing a simple experiment on your
> > client's
> > > > > hardware (or yours, if you have such hardware in house).  Change
> the
> > > > > credential to an invalid one and see what the exception details
> are.
> > > > Then
> > > > > change to valid credentials and try to crawl a directory that is
> not
> > > > > visible to the credentialed user you supplied, and make a note of
> the
> > > > > exception details in that case too.
> > > > >
> > > >
> > > > I was thinking to slightly modifying the getSession() method adding
> the
> > > > file exist check , something like this :
> > > >
> > > > ...
> > > >
> > > > try
> > > > {
> > > >     // use NtlmPasswordAuthentication so that we can reuse credential
> > > > for DFS support
> > > >     pa = new NtlmPasswordAuthentication( domain, username, password
> );
> > > >     SmbFile smbconnection = new SmbFile( "smb://" + server + "/", pa
> );
> > > >     smbconnectionPath = getFileCanonicalPath( smbconnection );
> > > >     smbconnection.exists();
> > > > }
> > > > catch ( MalformedURLException e )
> > > > {
> > > >     Logging.connectors.error(
> > > >         "Unable to access SMB/CIFS share: " + "smb://" + ( ( domain
> ==
> > > > null ) ? "" : domain ) + ";"
> > > >             + username + ":<password>@" + server + "/\n" + e );
> > > >     throw new ManifoldCFException( "Unable to access SMB/CIFS share:
> "
> > > > + server, e,
> > > >
> > > > ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
> > > > } catch (SmbException e) {
> > > >     Logging.connectors.error(
> > > >             "Unable to access SMB/CIFS share: Credential not valid -
> "
> > > > + "smb://" + ((domain == null) ? "" : domain) + ";"
> > > >                     + username + ":<password>@" + server + "/\n"
+
> e);
> > > >     throw new ManifoldCFException( "Unable to access SMB/CIFS share:
> > > > Credential not valid - " + server, e,
> > > >             ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
> > > > }
> > > >
> > > > Catching the smbException should make the trick.
> > > > Anyway I will go more in details.
> > > >
> > > > Cheers
> > > >
> > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Tue, Mar 31, 2015 at 10:50 AM, Alessandro Benedetti <
> > > > > benedetti.alex85@gmail.com> wrote:
> > > > >
> > > > > > Currently we are checking each of the String[] oldVersions ,
> trying
> > > to
> > > > > > access it ...
> > > > > > So in the scenario I described the current performances are
quite
> > > > bad...
> > > > > > We would need to avoid at all the scan of the oldDocs if we
know
> > the
> > > > > > provided credential are not valid anymore .
> > > > > >
> > > > > > Let me be extreme, but what about not allowing the job to start
> at
> > > all
> > > > if
> > > > > > the Repository Connector is currently broken ( i.e. the
> connection
> > is
> > > > not
> > > > > > working, and we know that because of the check method) .
> > > > > > In this way we avoid to destroy already existent indexes and
we
> > > simply
> > > > > > communicate a message in the job giving advice the job can not
> > start
> > > > > > because Repository connector is currently offline ( and showing
> the
> > > > > > explanation) .
> > > > > >
> > > > > > Does this make sense ?
> > > > > >
> > > > > > 2015-03-31 15:30 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> > > > > >
> > > > > > > Hi Alessandro,
> > > > > > >
> > > > > > > If you put a check in the processDocuments method, it will
be
> > > called
> > > > > for
> > > > > > > every group of documents.  That's fine, but if you structure
it
> > as
> > > a
> > > > > > > separate call it would impact performance.  That is why
I
> suggest
> > > > just
> > > > > > > doing a better job of interpreting the existing exceptions.
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Mar 31, 2015 at 10:27 AM, Alessandro Benedetti
<
> > > > > > > benedetti.alex85@gmail.com> wrote:
> > > > > > >
> > > > > > > > As an addition, this should be quite simple, not proceeding
> > with
> > > > the
> > > > > > > > processDocuments method, if the RepositoryConnector
is not
> able
> > > to
> > > > > > > connect(
> > > > > > > > check method return not a proper message).
> > > > > > > >
> > > > > > > > Right ?
> > > > > > > > Wondering where is the proper point to enter the action
:)
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > >
> > > > > > > > 2015-03-31 14:59 GMT+01:00 Alessandro Benedetti <
> > > > > > > > benedetti.alex85@gmail.com>
> > > > > > > > :
> > > > > > > >
> > > > > > > > > Yes Karl,
> > > > > > > > >  I was thinking exactly that, to first check
if the
> > credentials
> > > > are
> > > > > > > > valid,
> > > > > > > > > before scanning all the documents.
> > > > > > > > > This because permissions per files depend on
users/groups,
> > but
> > > > the
> > > > > > > > current
> > > > > > > > > scenario is not in-validating the user, but invalidating
> the
> > > > access
> > > > > > of
> > > > > > > > that
> > > > > > > > > user.
> > > > > > > > >
> > > > > > > > > An error must be thrown, but the docs not deleted
( not
> even
> > > > > > scanned) .
> > > > > > > > >
> > > > > > > > > Furthermore, what will happen, in the case the
server is
> > down ?
> > > > > > > > > Are we safe in that scenario ?
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > > 2015-03-31 14:42 GMT+01:00 Karl Wright <daddywri@gmail.com
> >:
> > > > > > > > >
> > > > > > > > >> This is actually pretty standard behavior
across our
> > connector
> > > > > > family,
> > > > > > > > and
> > > > > > > > >> has been true since Day One.  The behavior
comes from the
> > > basic
> > > > > > broad
> > > > > > > > >> requirement that the crawler should keep
going and skip
> the
> > > > > document
> > > > > > > > when
> > > > > > > > >> the permissions do not allow it to be fetched.
 With the
> > > Windows
> > > > > > Share
> > > > > > > > >> connector, it's sometimes the case (when
DFS is used a
> lot)
> > > that
> > > > > > whole
> > > > > > > > >> subtrees of documents are not fetchable using
the
> > credentials
> > > > > > > supplied.
> > > > > > > > >> So
> > > > > > > > >> it is not so easy to just check for valid
credentials at
> the
> > > > > > > beginning.
> > > > > > > > >>
> > > > > > > > >> For a solution, I'd be inclined to look for
a way to
> figure
> > > out
> > > > if
> > > > > > the
> > > > > > > > >> credentials are actually *invalid*, and abort
the job if
> so.
> > > > This
> > > > > > is
> > > > > > > > >> distinct from the case where the credentials
are valid but
> > the
> > > > > > > connector
> > > > > > > > >> doesn't have permissions to read the document.
 It will
> take
> > > > some
> > > > > > > > >> experimentation to see if we get back different
exception
> > text
> > > > in
> > > > > > the
> > > > > > > > two
> > > > > > > > >> situations.
> > > > > > > > >>
> > > > > > > > >> Karl
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Tue, Mar 31, 2015 at 9:30 AM, Alessandro
Benedetti <
> > > > > > > > >> abenedetti@apache.org
> > > > > > > > >> > wrote:
> > > > > > > > >>
> > > > > > > > >> > Hi guys,
> > > > > > > > >> > playing with the Windows Shares Connector
in ManifoldCF
> > 1.8
> > > I
> > > > > > > > >> encountered
> > > > > > > > >> > this problem :
> > > > > > > > >> >
> > > > > > > > >> > *Scenario*
> > > > > > > > >> > *1)* Indexing windows Shares server
> > > > > > > > >> > *2)* Indexing successfully finished
with N docs indexed
> > > > > > > > >> > *3)* Offline ,while no indexing is happening,
Shares
> > server
> > > > > side,
> > > > > > > the
> > > > > > > > >> > Administrator password changes
> > > > > > > > >> > *4) *Repository Connector is not able
to connect
> > anymore(of
> > > > > course
> > > > > > > > >> because
> > > > > > > > >> > the password has changed)
> > > > > > > > >> > *5)* Next indexing cycle, ALL docs are
removed from the
> > > index
> > > > .
> > > > > > > > >> >
> > > > > > > > >> > *Expected Behaviour*
> > > > > > > > >> > As I user I would like to see an error
message, that
> will
> > > let
> > > > me
> > > > > > > > >> understand
> > > > > > > > >> > the issue, not losing all my N indexed
docs .
> > > > > > > > >> >
> > > > > > > > >> > *Reason*
> > > > > > > > >> > Taking a look into the code, the problems
seems to be in
> > > the :
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector#getDocumentVersions
> > > > > > > > >> > where it tries to access each document
singularly
> through
> > > > Samba,
> > > > > > and
> > > > > > > > >> > removing them one by one if not reachable
anymore.
> > > > > > > > >> >
> > > > > > > > >> > *Solution*
> > > > > > > > >> > Before scanning each document, we have
to be sure the
> > > > connection
> > > > > > is
> > > > > > > > >> > working.
> > > > > > > > >> > If not this is only armful.
> > > > > > > > >> >
> > > > > > > > >> > I will continue investigating, but I
would like your
> > opinion
> > > > as
> > > > > > well
> > > > > > > > >> >
> > > > > > > > >> > Cheers
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > --
> > > > > > > > >> > --------------------------
> > > > > > > > >> >
> > > > > > > > >> > Benedetti Alessandro
> > > > > > > > >> > Visiting card : http://about.me/alessandro_benedetti
> > > > > > > > >> >
> > > > > > > > >> > "Tyger, tyger burning bright
> > > > > > > > >> > In the forests of the night,
> > > > > > > > >> > What immortal hand or eye
> > > > > > > > >> > Could frame thy fearful symmetry?"
> > > > > > > > >> >
> > > > > > > > >> > William Blake - Songs of Experience
-1794 England
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > --------------------------
> > > > > > > > >
> > > > > > > > > Benedetti Alessandro
> > > > > > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > > > > > >
> > > > > > > > > "Tyger, tyger burning bright
> > > > > > > > > In the forests of the night,
> > > > > > > > > What immortal hand or eye
> > > > > > > > > Could frame thy fearful symmetry?"
> > > > > > > > >
> > > > > > > > > William Blake - Songs of Experience -1794 England
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > --------------------------
> > > > > > > >
> > > > > > > > Benedetti Alessandro
> > > > > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > > > > >
> > > > > > > > "Tyger, tyger burning bright
> > > > > > > > In the forests of the night,
> > > > > > > > What immortal hand or eye
> > > > > > > > Could frame thy fearful symmetry?"
> > > > > > > >
> > > > > > > > William Blake - Songs of Experience -1794 England
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > --------------------------
> > > > > >
> > > > > > Benedetti Alessandro
> > > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > > >
> > > > > > "Tyger, tyger burning bright
> > > > > > In the forests of the night,
> > > > > > What immortal hand or eye
> > > > > > Could frame thy fearful symmetry?"
> > > > > >
> > > > > > William Blake - Songs of Experience -1794 England
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card : http://about.me/alessandro_benedetti
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message