brooklyn-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [2/4] incubator-brooklyn git commit: rejig titles of troubleshooting sections
Date Tue, 28 Jul 2015 15:45:21 GMT
diff --git a/docs/guide/ops/troubleshooting/ b/docs/guide/ops/troubleshooting/
deleted file mode 100644
index 07874c0..0000000
--- a/docs/guide/ops/troubleshooting/
+++ /dev/null
@@ -1,143 +0,0 @@
-layout: website-normal
-title: Troubleshooting Server Connectivity Issues in the Cloud
-toc: /guide/toc.json
-A common problem when setting up an application in the cloud is getting the basic connectivity
right - how
-do I get my service (e.g. a TCP host:port) publicly accessible over the internet?
-This varies a lot - e.g. Is the VM public or in a private network? Is the service only accessible
-a load balancer? Should the service be globally reachable or only to a particular CIDR?
-This guide gives some general tips for debugging connectivity issues, which are applicable
to a 
-range of different service types. Choose those that are appropriate for your use-case.
-## VM reachable
-If the VM is supposed to be accessible directly (e.g. from the public internet, or if in
a private network
-then from a jump host)...
-### ping
-Can you `ping` the VM from the machine you are trying to reach it from?
-However, ping is over ICMP. If the VM is unreachable, it could be that the firewall forbids
ICMP but still
-lets TCP traffic through.
-### telnet to TCP port
-You can check if a given TCP port is reachable and listening using `telnet <host> <port>`,
such as
-`telnet 80`, which gives output like:
-    Trying
-    Connected to
-    Escape character is '^]'.
-If this is very slow to respond, it can be caused by a firewall blocking access. If it is
fast, it could
-be that the server is just not listening on that port.
-### DNS and routing
-If using a hostname rather than IP, then is it resolving to a sensible IP?
-Is the route to the server sensible? (e.g. one can hit problems with proxy servers in a corporate
-network, or ISPs returning a default result for unknown hosts).
-The following commands can be useful:
-* `host` is a DNS lookup utility. e.g. `host`.
-* `dig` stands for "domain information groper". e.g. `dig`.
-* `traceroute` prints the route that packets take to a network host. e.g. `traceroute`.
-## Service is listening
-### Service responds
-Try connecting to the service from the VM itself. For example, `curl http://localhost:8080`
for a
-On dev/test VMs, don't be afraid to install the utilities you need such as `curl`, `telnet`,
-etc. Cloud VMs often have a very cut-down set of packages installed. For example, execute
-`sudo apt-get update; sudo apt-get install -y curl` or `sudo yum install -y curl`.
-### Listening on port
-Check that the service is listening on the port, and on the correct NIC(s).
-Execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports in use (or
-`-anup` for UDP). You should expect to see the something like the output below for a service.
-Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program
-tcp        0      0 :::8080                     :::*                        LISTEN      8276/java
-In this case a Java process with pid 8276 is listening on port 8080. The local address `:::8080`
-format means all NICs (in IPv6 address format). You may also see `` for IPv4
-If it says then your service will most likely not be reachable externally.
-Use `ip addr show` (or the obsolete `ifconfig -a`) to see the network interfaces on your
-For `netstat`, run with `sudo` to see the pid for all listed ports.
-## Firewalls
-On Linux, check if `iptables` is preventing the remote connection. On Windows, check the
Windows Firewall.
-If it is acceptable (e.g. it is not a server in production), try turning off the firewall
-and testing connectivity again. Remember to re-enable it afterwards! On CentOS, this is `sudo
-iptables stop`. On Ubuntu, use `sudo ufw disable`. On Windows, press the Windows key and
type 'Windows
-Firewall with Advanced Security' to open the firewall tools, then click 'Windows Firewall
-and set the firewall state to 'Off' in the Domain, Public and Private profiles.
-If you cannot temporarily turn off the firewall, then look carefully at the firewall settings.
-example, execute `sudo iptables -n --list` and `iptables -t nat -n --list`.
-## Cloud firewalls
-Some clouds offer a firewall service, where ports need to be explicitly listed to be reachable.
-For example, [security groups for EC2-classic]
-have rules for the protocols and ports to be reachable from specific CIDRs.
-Check these settings via the cloud provider's web-console (or API).
-## Quick test of a listener port
-It can be useful to start listening on a given port, and to then check if that port is reachable.
-This is useful for testing basic connectivity when your service is not yet running, or to
-different port to compare behaviour, or to compare with another VM in the network.
-The `nc` netcat tool is useful for this. For example, `nc -l 8080` will listen on
-TCP 8080 on all network interfaces. On another server, you can then run `echo hello from
-| nc <hostname> 8080`. If all works well, this will send "hello from client" over the
TCP port 8080,
-which will be written out by the `nc -l` process before exiting.
-Similarly for UDP, you use `-lU`.
-You may first have to install `nc`, e.g. with `sudo yum install -y nc` or `sudo apt-get install
-### Cloud load balancers
-For some use-cases, it is good practice to use the load balancer service offered by the cloud
-(e.g. [ELB in AWS]( or the [Cloudstack Load Balancer]
-The VMs can all be isolated within a private network, with access only through the load balancer
-Debugging techniques here include ensuring connectivity from another jump server within the
-network, and careful checking of the load-balancer configuration from the Cloud Provider's
-### DNAT
-Use of DNAT is appropriate for some use-cases, where a particular port on a particular VM
is to be
-made available.
-Debugging connectivity issues here is similar to the steps for a cloud load balancer. Ensure
-connectivity from another jump server within the private network. Carefully check the NAT
rules from
-the Cloud Provider's web-console.
-### Guest wifi
-It is common for guest wifi to restrict access to only specific ports (e.g. 80 and 443, restricting
-ssh over port 22 etc).
-Normally your best bet is then to abandon the guest wifi (e.g. to tether to a mobile phone
-There are some unconventional workarounds such as [configuring sshd to listen on port 80
so you can
-use an ssh tunnel](
-However, the firewall may well inspect traffic so sending non-http traffic over port 80 may
still fail.
diff --git a/docs/guide/ops/troubleshooting/ b/docs/guide/ops/troubleshooting/
deleted file mode 100644
index c343762..0000000
--- a/docs/guide/ops/troubleshooting/
+++ /dev/null
@@ -1,88 +0,0 @@
-layout: website-normal
-title: Troubleshooting Deployment
-toc: /guide/toc.json
-This guide describes common problems encountered when deploying applications.
-## YAML deployment errors
-The error `Invalid YAML: Plan not in acceptable format: Cannot convert ...` means that the
text is not 
-valid YAML. Common reasons include that the indentation is incorrect, or that there are non-matching
-The error `Unrecognized application blueprint format: no services defined` means that the
-section is missing.
-An error like `Deployment plan item[name=<null>,description=<null>,serviceType=com.acme.Foo,characteristics=[],customAttributes={}]
cannot be matched` means that the given entity type (in this case com.acme.Foo) is not in
the catalog or on the classpath.
-An error like `Illegal parameter for 'location' (aws-ec3); not resolvable: java.util.NoSuchElementException:
Unknown location 'aws-ec3': either this location is not recognised or there is a problem with
location resolver configuration` means that the given location (in this case aws-ec3) 
-was unknown. This means it does not match any of the named locations in,
nor any of the
-clouds enabled in the jclouds support, nor any of the locations added dynamically through
the catalog API.
-## VM Provisioning Failures
-There are many stages at which VM provisioning can fail! An error `Failure running task provisioning`

-means there was some problem obtaining or connecting to the machine.
-An error like `... Not authorized to access cloud ...` usually means the wrong identity/credential
was used.
-An error like `Unable to match required VM template constraints` means that a matching image
(e.g. AMI in AWS terminology) could not be found. This 
-could be because an incorrect explicit image id was supplied, or because the match-criteria
could not
-be satisfied using the given images available in the given cloud. The first time this error
-encountered, a listing of all images in that cloud/region will be written to the debug log.
-Failure to form an ssh connection to the newly provisioned VM can be reported in several
different ways, 
-depending on the nature of the error. This breaks down into failures at different points:
-* Failure to reach the ssh port (e.g. `... could not connect to any ip address port 22 on
node ...`).
-* Failure to do the very initial ssh login (e.g. `... Exhausted available authentication
methods ...`).
-* Failure to ssh using the newly created user.
-There are many possible reasons for this ssh failure, which include:
-* The VM was "dead on arrival" (DOA) - sometimes a cloud will return an unusable VM. One
can work around
-  this using the `machineCreateAttempts` configuration option, to automatically retry with
a new VM.
-* Local network restrictions. On some guest wifis, external access to port 22 is forbidden.
-  Check by manually trying to reach port 22 on a different machine that you have access it.
-* NAT rules not set up correctly. On some clouds that have only private IPs, Brooklyn can
-  create NAT rules to provide access to port 22. If this NAT rule creation fails for some
-  then Brooklyn will not be able to reach the VM. If NAT rules are being created for your
cloud, then
-  check the logs for warnings or errors about the NAT rule creation.
-* ssh credentials incorrectly configured. The Brooklyn configuration is very flexible in
how ssh
-  credentials can be configured. However, if a more advanced configuration is used incorrectly
-  the wrong login user, or invalid ssh keys) then this will fail.
-* Wrong login user. The initial login user to use when first logging into the new VM is inferred
-  the metadata provided by the cloud provider about that image. This can sometimes be incomplete,
-  the wrong user may be used. This can be explicitly set using the `loginUser` configuration
-  An example of this is with some Ubuntu VMs, where the "ubuntu" user should be used. However,
on some clouds
-  it defaults to trying to ssh as "root".
-* Bad choice of user. By default, Brooklyn will create a user with the same name as the user
running the
-  Brooklyn process; the choice of user name is configurable. If this user already exists
on the machine, 
-  then the user setup will not behave as expected. Subsequent attempts to ssh using this
user could then fail.
-* Custom credentials on the VM. Most clouds will automatically set the ssh login details
(e.g. in AWS using  
-  the key-pair, or in CloudStack by auto-generating a password). However, with some custom
images the VM
-  will have hard-coded credentials that must be used. If Brooklyn's configuration does not
match that,
-  then it will fail.
-* Guest customisation by the cloud. On some clouds (e.g. vCloud Air), the VM can be configured
to do
-  guest customisation immediately after the VM starts. This can include changing the root
-  If Brooklyn is not configured with the expected changed password, then the VM provisioning
may fail
-  (depending if Brooklyn connects before or after the password is changed!).
-A very useful debug configuration is to set `destroyOnFailure` to false. This will allow
ssh failures to
-be more easily investigated.
-## Timeout Waiting For Service-Up
-A common generic error message is that there was a timeout waiting for service-up.
-This just means that the entity did not get to service-up in the pre-defined time period
(the default is 
-two minutes, and can be configured using the `start.timeout` config key; the timer begins
after the 
-start tasks are completed).
-See the guide on [runtime errors](troubleshooting-runtime-errors.html) for where to find
additional information, especially the section on
-"Entity's Error Status".
diff --git a/docs/guide/ops/troubleshooting/ b/docs/guide/ops/troubleshooting/
deleted file mode 100644
index 8b657fc..0000000
--- a/docs/guide/ops/troubleshooting/
+++ /dev/null
@@ -1,116 +0,0 @@
-layout: website-normal
-title: Troubleshooting Runtime Errors
-toc: /guide/toc.json
-This guide describes sources of information for runtime errors.
-Whether you're customizing out-of-the-box blueprints, or developing your own custom blueprints,
you will
-inevitably have to deal with entity failure. Thankfully Brooklyn provides plenty of information
to help 
-you locate and resolve any issues you may encounter.
-## Web-console Runtime Error Information
-### Entity Hierarchy
-The Brooklyn web-console includes a tree view of the entities within an application. Errors
within the
-application are represented visually, showing a "fire" image on the entity.
-When an error causes an entire application to be unexpectedly down, the error is generally
propagated to the
-top-level entity - i.e. marking it as "on fire". To find the underlying error, one should
expand the entity
-hierarchy tree to find the specific entities that have actually failed.
-### Entity's Error Status
-Many entities have some common sensors (i.e. attributes) that give details of the error status:
-* `service.isUp` (often referred to as "service up") is a boolean, saying whether the service
is up. For many 
-  software processes, this is inferred from whether the "service.notUp.indicators" is empty.
It is also
-  possible for some entities to set this attribute directly.
-* `service.notUp.indicators` is a map of errors. This often gives much more information than
the single 
-  `service.isUp` attribute. For example, there may be many health-check indicators for a
-  is the root URL reachable, it the management api reporting healthy, is the process running,
-* `service.problems` is a map of namespaced indicators of problems with a service.
-* `service.state` is the actual state of the service - e.g. CREATED, STARTING, RUNNING, STOPPING,
-* `service.state.expected` indicates the state the service is expected to be in (and when
it transitioned to that).
-  For example, is the service expected to be starting, running, stopping, etc.
-These sensor values are shown in the "sensors" tab - see below.
-### Sensors View
-The "Sensors" tab in the Brooklyn web-console shows the attribute values of a particular
-This gives lots of runtime information, including about the health of the entity - the 
-set of attributes will vary between different entity types.
-[![Sensors view in the Brooklyn debug console.](images/jmx-sensors.png)](images/jmx-sensors-large.png)
-Note that null (or not set) sensors are hidden by default. You can click on the `Show/hide
empty records` 
-icon (highlighted in yellow above) to see these sensors as well.
-The sensors view is also tabulated. You can configure the numbers of sensors shown per page

-(at the bottom). There is also a search bar (at the top) to filter the sensors shown.
-### Activity View
-The activity view shows the tasks executed by a given entity. The top-level tasks are the
-(i.e. operations) invoked on that entity. This view allows one to drill into the task, to

-see details of errors.
-Select the entity, and then click on the `Activities` tab.
-In the table showing the tasks, each row is a link - clicking on the row will drill into
the details of that task, 
-including sub-tasks:
-[![Task failure error in the Brooklyn debug console.](images/failed-task.png)](images/failed-task-large.png)
-For ssh tasks, this allows one to drill down to see the env, stdin, stdout and stderr. That
is, you can see the
-commands executed (stdin) and environment variables (env), and the output from executing
that (stdout and stderr). 
-For tasks that did not fail, one can still drill into the tasks to see what was done.
-It's always worth looking at the Detailed Status section as sometimes that will give you
the information you need.
-For example, it can show the exception stack trace in the thread that was executing the task
that failed.
-## Log Files
-Brooklyn's logging is configurable, for the files created, the logging levels, etc. 
-See [Logging docs](/guide/ops/logging.html).
-With out-of-the-box logging, `` and `brooklyn.debug.log` files are created.
These are by default 
-rolling log files: when the log reaches a given size, it is compressed and a new log file
is started.
-Therefore check the timestamps of the log files to ensure you are looking in the correct
file for the 
-time of your error.
-With out-of-the-box logging, info, warnings and errors are written to the ``
file. This gives
-a summary of the important actions and errors. However, it does not contain full stacktraces
for errors.
-To find the exception, we'll need to look in Brooklyn's debug log file. By default, the debug
log file
-is named `brooklyn.debug.log`. You can use your favourite tools for viewing large text files.

-One possible tool is `less`, e.g. `less brooklyn.debug.log`. We can quickly find the last
-by navigating to the end of the log file (using `Shift-G`), then performing a reverse-lookup
by typing `?Exception` 
-and pressing `Enter`. Sometimes an error results in multiple exceptions being logged (e.g.
first for the
-entity, then for the cluster, then for the app). If you know the text of the error message
(e.g. copy-pasted
-from the Activities view of the web-console) then one can search explicitly for that text.
-The `grep` command is also extremely helpful. Useful things to grep for include:
-* The entity id (see the "summary" tab of the entity in the web-console for the id).
-* The entity type name (if there are only a small number of entities of that type). 
-* The VM IP address.
-* A particular error message (e.g. copy-pasted from the Activities view of the web-console).
-* The word WARN etc, such as `grep -E "WARN|ERROR"`.
-Grep'ing for particular log messages is also useful. Some examples are shown below:
-* INFO: "Started application", "Stopping application" and "Stopped application"
-* INFO: "Creating VM "
-* DEBUG: "Finished VM "
diff --git a/docs/guide/ops/troubleshooting/ b/docs/guide/ops/troubleshooting/
deleted file mode 100644
index a09f902..0000000
--- a/docs/guide/ops/troubleshooting/
+++ /dev/null
@@ -1,50 +0,0 @@
-layout: website-normal
-title: Troubleshooting SoftwareProcess Entities
-toc: /guide/toc.json
-The [guide for troubleshooting runtime errors](troubleshooting-runtime-errors.html) in Brooklyn
-information for how to find more information about errors.
-If that doesn't give enough information to diagnose, fix or workaround the problem, then
it can be required
-to login to the machine, to investigate further. This guide applies to entities that are
-of "SoftwareProcess" in Brooklyn, or that follows those conventions.
-## VM connection details
-The ssh connection details for an entity is published to a sensor `host.sshAddress`. The
-credentials will depend on the Brooklyn configuration. The default is to use the `~/.ssh/id_rsa`

-or `~/.ssh/id_dsa` on the Brooklyn host (uploading the associated `~/.ssh/` to
the machine's 
-authorised_keys). However, this can be overridden (e.g. with specific passwords etc) in the

-location's configuration.
-For Windows, there is a similar sensor with the name `host.winrmAddress`. (TODO sensor for
-## Install and Run Directories
-For ssh-based software processes, the install directory and the run directory are published
as sensors
-`install.dir` and `run.dir` respectively.
-For some entities, files are unpacked into the install dir; configuration files are written
to the
-run dir along with log files. For some other entities, these directories may be mostly empty
-e.g. if installing RPMs, and that software writes its logs to a different standard location.
-Most entities have a sensor `log.location`. It is generally worth checking this, along with
other files
-in the run directory (such as console output).
-## Process and OS Health
-It is worth checking that the process is running, e.g. using `ps aux` to look for the desired
-Some entities also write the pid of the process to `pid.txt` in the run directory.
-It is also worth checking if the required port is accessible. This is discussed in the guide

-"Troubleshooting Server Connectivity Issues in the Cloud", including listing the ports in
-execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports in use (or
-`-anup` for UDP).
-It is also worth checking the disk space on the server, e.g. using `df -m`, to check that
-is sufficient space on each of the required partitions.

View raw message