Addons, Plugins, Tweaks & Customizations

Below are lists of Nagios Addons/Plugins as well as some common Nagios Tweaks / Customizations I have tried with my various Nagios installations.  If you have some ideas, suggestions, etc. please register and post comments.  Thanks.

Checking Drupal Status with Nagios and WebInject

Summary

A few weeks back I found a post on the Drupal forums about monitoring the status report page with Nagios and Webinject.  Having lots of practice with Nagios and Webinject, I knew this was possible but noone had provided an example.  So I finally got around to creating the Webinject script today and posted it.  Below is the complete process including the Nagios info I used.

 

Webinect XML Script

<testcases repeat="1">
<testvar varname="BASE_URL">http://www.domain.com/</testvar>
<testvar varname="LOGIN1">username</testvar>
<testvar varname="PASSWD1">password</testvar>
<case
id="1"
description1="Connecting to Login Page"
method="get"
url="${BASE_URL}?q=user"
verifypositive="Enter the password that accompanies your username"
errormessage="Unable to load login page"
/>
<case
id="2"
description1="Authentication"
method="post"
url="${BASE_URL}?q=user"
postbody="name=${LOGIN1}&pass=${PASSWD1}&form_id=user_login&op=Log+in"
verifypositive="${BASE_URL}\?q=users/${LOGIN1}"
errormessage="Login Post Problem"
/>
<case
id="3"
description1="Status Report Page"
method="get"
url="${BASE_URL}?q=admin/reports/status"
verifynegative="Out of date"
verifypositive="Drupal core update status"
errormessage="Status Page Alert!"
/>
</testcases>

Nagios Command File Entry

define command {
  command_name webinject
command_line /usr/local/nagios/webinject/webinject.pl -c nagios/$ARG1$ nagios/$ARG2$
}

Nagios Check Entry

define service {
  use template1
  host_name server
  service_description  status-report
  check_command webinject!nagios.xml!drupal_status.xml
}

Thoughts

The webinject script looks for something that is "Out of date" on the Status Report and will alert appropriately based on your Nagios configuration. The first step is not necessarily required, but it helps in troubleshooting if the login page for the site is not loading correctly and preventing the check from executing correctly.

AddOn - NRPE / NSClient

NRPE and NSClient allow you to remotely execute either pre-configured tasks or custome scripts to trigger alerts or as the result of an event (eg. an EventHandler).

NRPE Plugin
http://sourceforge.net/project/showfiles.php?group_id=26589

NRPE allows you to remotely execute Nagios plugins on other Linux/Unix machines. This allows you to monitor remote machine metrics (disk usage, CPU load, etc.). NRPE can also communicate with some of the Windows agent addons, so you can execute scripts and check metrics on remote Windows machines as well. A windows utility called NSClient is also available to accomplish the same thing on Windows hosts.

NSClient Plugin
http://trac.nakednuns.org/nscp/downloads
NSClient++, aka NSCP, aims to be a simple yet powerful and secure monitoring daemon for Windows operating systems. It is built for Nagios, but nothing in the daemon is actually Nagios specific and could probably, with little or no change, be integrated into any monitoring software that supports running user tools for polling.

ERROR: CHECK_NRPE: Socket timeout after 10 seconds.

Several conditions can trigger this error with your Nagios checks.  Many of them are obvious, but this one had me stumped for awhile.

Problem

All my nagios checks with NRPE to a given host were failing with the "CHECK_NRPE: Socket timeout after 10 seconds." message.  I logged into the host and made sure NRPE was running, even restarted it.  Double checked the firewall rules to make sure the port was open.  I went to my nagios server, did an NSLOOKUP, PING and TELNET to the port to ensure I was resolving the correct IP address and could connect.  The machine in question was a Virtual Private Server (VPS) so it does sometimes become sluggish and non-responsive, but poking around it all seemed fine.  I tested from the command line of my Nagios server and got the same results.

Solution

What got me looking in the right direction was when I pinged my Nagios server from my host.  It worked fine, but I noticed it took a few seconds to resolve the host.  So then I checked the DNS servers of my Linux VPS.  The first server listed was not pingable.  I quickly flip-flopped the servers in my resolv.conf and VOILA!  My command-line check from my Nagios server fixed it.

ERROR: Could not fetch information from server

While setting up several new servers and installing NSCLIENT, I ran into the following error message:

could not fetch information from server

The most logical first step is to re-verify the Nagios server config file.  Check to make sure DNS resolution is correct.  Second, take a look at the NSC.log on the client system.  In my case, I saw:

2009-03-30 10:52:23: error:.\NSClientListener.cpp:307: Unauthorized access from: 172.20.16.182

Well, that could definitely be a problem.  My fault this time was in editing the NSC.ini after installation.  The allowed_hosts line of:

allowed_hosts=172.20.16/23

needed to be like:

allowed_hosts=172.20.16.0/23

AddOn - Nagios Event Log aka NagEventLog

NagEventLog is a windows agent that examines the EventLog, filters it, and forwards passive alerts to Nagios via NSCA. Now with encryption support! Supports Windows 2000 and later.

More information can be found here:

NagEventLog allows you to have windows event log entries filtered and passed back to your Nagios server.  Two methods I have used are:

  • Report ALL errors in ALL logs and filter select EventIDs we don't need to worry about.
  • Report a -specific- error that we use to trigger an event script.  Eg a "cleanup and restart" process upon a service failure.

Updating NagEventLog Filters via GPO

When you have alot of Windows Servers and would like to add an EventID to the Filter, it is a real pain to update on a server by server basis.  So using a GPO object, you can control the filters directly from a policy without having to manually update each individual server.

Assumptions

  • You install NagEventLog in a consistent fashion on all servers
  • You want to filter the same items across ALL your servers
  • All your servers are members of the local domain

Instructions

  1. Create a custom administrative policy template.  Below is the "nageventlog.adm" file I used to filter out select Event IDs.
    ; nageventlog.adm
    ;;;;;;;;;;;;;;;;;;;;;
    CLASS MACHINE  ;;;;;;
    ;;;;;;;;;;;;;;;;;;;;;
     
    CATEGORY !!nagiosfilter
    KEYNAME "SOFTWARE\Wow6432Node\Cheshire Cat\Nagios\Filter0"
        POLICY !!changenagiosfilter
            PART !!NotEventID CHECKBOX
                VALUENAME "notID"
                VALUEON NUMERIC 1
                VALUEOFF NUMERIC 0
            END PART
            PART !!ChangeFilter0IDs EDITTEXT REQUIRED
                VALUENAME "ID"
                DEFAULT !!filterdefault    
            END PART
            PART !!changefilter0IDstext TEXT END PART
        END POLICY
    END CATEGORY

    [STRINGS]
    nagiosfilter="Nagios Filtering"
    changenagiosfilter="Change Nagios Filter0"
    ChangeFilter0IDs="Event IDs that are ignored by Nagios"
    changefilter0IDstext="Comma seperated list of Event IDs to exclude"
    filterdefault="21293,21248,26020,26009"

  2. Add the new nageventlog.adm file to C:\windows\inf folder of your domain controller.
  3. Next, we need to add the template to our default policy.  Launch the GPO Editor by clicking Start > Run > mmc.   Add the "Group Policy Object Editor" Snap-in, click Browse, and choose the Default Domain Policy.
  4. Right-click "Administrative Templates" and choose Add/Remove templates.  Select the template file, nageventlog.adm, we created.
  5. You should now see an item appear as "Nagios Filtering".  If you select it and the "Change Nagios Filter0" does not appear, click View > Filtering and DE-select the "Only show policy settings that can be fully managed".
  6. Select "Enabled" option, click the checkbox to enable the EXCLUSION of the IDs and enter the comma delimited list of EventIDs.
  7. Servers will update automatically with their regular policy refresh.  To force a policy update, you can use "gpupdate" from the command line.

You can use the technique above to do a variety of things and tweak things from a central location across the domain environment.

References

Windows Server 2008 NagEventLog Compatibility

While the 64bit version of NagEventLog v1.9.1 installed on my 64bit Windows 2008 server, I was unable to use the GUI to configure the filters.  However if you visit Steve Shipway's NagEventLog site directly, you can download replacement executables that allow it to properly run in Server2008.  I replaced the files, restarted the service and then GUI tool worked correctly.

Addon - Nagios Passive Checks with NSCA

Using Nagios with NSCA, you can configure some complex scripts / tasks to output status codes and messages to be sent to your Nagios server for collection / reporting.  To start, you will need to install NSCA package on your Nagios server and configure the listening server as outlined in the documenation.

NOTE: You will need libmcrypt and libmcrypt-devel packages installed to compile successfully.

You will most likely want to create a template or two to use with your passive checks.  Below is the example template I created for testing passive checks...

define service{
        name                    passive-service
        use                     generic-service
        check_freshness         1
        passive_checks_enabled  1
        active_checks_enabled   0
        is_volatile             0
        flap_detection_enabled  0
        notification_options    w,u,c,s
        freshness_threshold     57600     ;12hr
}

Then configure a service like so...

define service{
     use                     passive-service   
     host_name               localhost
     service_description     test
     check_command           check_dummy!3!"No Data Received"
}

On the remote server, you will need to do the same to compile the components.  You will only need the send_nsca binary and the send_nsca.cfg file.  You will need to tweak your send_nsca config file to match the information you configured on your NSCA server.

Now the fun begins where you can create/modify scripts to send these passive check results to Nagios via the NSCA server.  I used a simple perl script below for my testing.

#!/usr/bin/perl
#############################################################
# RETURN CODES:
# 0-OK, 1-WARNING, 2-CRITICAL, 3-UNKNOWN
#############################################################
#CONFIG FILES
#$debug=1;
$config="/usr/local/nagios/etc/send_nsca.cfg";
# LOCAL SYSTEM CONFIG OPTIONS
$nsca_host="nagios.hubteam.com";
$host="host_name";
$service="service_name";
# DEFAULT RETURNS
$code=3;
$result="WHAT THE HECK?";
# COMMAND LINE
$send_nsca="/usr/local/nagios/bin/send_nsca -c $config -H $nsca_host";
# Start
# INSERT YOUR FUN CODE HERE, Setting a $code and $result value
# End
if ($debug) {print "SENDING:  $host\t$service\t$code\t$result\n";}
open(SEND,"|$send_nsca") || die "Could not run $send_nsca: $!\n";
print SEND "$host\t$service\t$code\t$result\n";
close SEND;

There are several points to consider.

  • If the script takes  < 10 seconds, you may also consider running checks via NRPE and custom command defintions. 
  • You can have multiple checks report passive checks back to the SAME host/service combo.  Eg, running various nightly jobs and direct any errors go to a single "nightly-jobs" monitor.
  • Read the Nagios documenation on passive checks and freshness.

Nagios Custom Object Variables

In large Nagios environments, configuring everything at the host level can be cumbesome.  Nagios has nice grouping / templating features that make deploying checks alot faster as well as easier to manager.  Sometimes you may need to "customize" the check to the specific host.  For example, specify the databasename on the indivudal database server to query.  This is where Nagios "Custom Object Variables" come into play.

As always, you can find some very useful information in the Nagios documentation.

In my case, we will start with defining the custom object variable on the host object by adding a like in the "define host {" block like so:

define host{
        use             server-template
        host_name       dbserver1
        alias           DB Server 1
        address         dbserver1.domain.local
        _DATABASE1              DB01
        }

I have a hostgroup definition for "Database Servers" and a list of common checks for each database server.  You can see how I have my Nagios check configured to use the local variable in the hostgroup definition....

define hostgroup{
        hostgroup_name  database-servers
        alias           Database Servers
        members         dbserver1,dbserver2,dbserver3
        }
define service{
        use                     template
        hostgroup_name          database-servers
        service_description     database-test
        check_command           check_mssql!username!password!-p 1433 -D \
                                $_HOSTDATABASE1$ -w 3 -c 5 -q "exec \
                                $_HOSTDATABASE1$.dbo.sp_test" -s -W 10 -C 20
        }

Note that the backslashes are only for readability here, and the check is a single line in my definition.

Remeber when using the custom variables, they always start with the underscore and then prefixed with the type of variable... HOST, SERVICE, CONTACT, etc. 

Nagios Event Handler - Restart Remote Service

I wrote a few quick posts on using Nagios Event_Handlers to restart a service on the local system.  Mostly I followed the example from the Nagios documentation, but it was a little tricky using SUDO to restart a service.  Once I solved that, the logical next step was to be able to restart a service on a REMOTE system with the event_handlers and NRPE.

NAGIOSSVR runs nagios and monitors itself and WEBSVR.  I use the "check_linux_procs" script which is also known as "check_system_procs".  On the remote server WEBSVR, the script configuration lines look something like:

# Processes to check
PROCLIST_RED="httpd sendmail nrpe"
PROCLIST_YELLOW="crond"

# Ports to check
PORTLIST="25 80 5666"

The check_linux_procs is executed on the remote server via NRPE.  We can use NRPE to remotely execute event handlers as well as service checks.  Setup is a bit more complex than a local host configuration. 

Proper SUDO configuration is required on the remote system, WEBSVR.   Read my other post on the Nagios Local Sevice Restart with Event_Handlers for the more information on the SUDO settings.

On WEBSVR I created a very simple script that uses sudo to restart the services.  Something like:

#!/bin/sh
#
/usr/bin/sudo /sbin/service httpd restart
/usr/bin/sudo /sbin/service sendmail restart
exit 0

NRPE is not listed because... well, if NRPE crashes the event_handler cannot run since it uses NRPE to connect and execute the script.  Do not foget to add your script to your nrpe.cfg file like below and restart NRPE on WEBSVR.

command[remote_restart]=/usr/local/nagios/libexec/eventhandlers/remote-restart

On NAGIOSSVR, this service check has a max_check_attempts of 3.  So I had to tweak the script I used before.  The trick here is passing the right variables through.  In my Nagios commands.cfg I added the $HOSTADDRESS$ value to the end of the line like so:

command_line    $USER1$/eventhandlers/restart-services-remote \
$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$

And the local "event_handler" script on NAGIOSSVR looks similar to the localhost example, but the "sudo" restart command is replaced with:

/usr/local/nagios/libexec/check_nrpe -H $4 -c remote_restart

Don't forget to change the case logic if you need to adjust for a different max_check_attempts value in your config.

NOTES: 

  • You could break this into two checks and event_handlers, but I just restart both services to keep it as simple as possible. 
  • You may also try use key-based SSH w/o a password as an alternative to NRPE.  That may be my next tweak to work around NRPE itself crashing.

HINTS: 

  • Always double check script ownership and permissions.  I had forgot to make the script executable on WEBSVR and that held me up for a few trying to sort it out.

Nagios Event Handler - Restarting a Local Service

Using Nagios Event Handlers you can perform an action based on the results of a Nagios check.  A very straightforward example would be to restart a service.  However it is not as simple as you might think.

I use the "check_system_procs" on the localhost of my nagios server itself to check a few services and restart them all should one no longer be running.  Since my nagios server is a VPS with limited resources, it sometimes runs out of memory and well... things die.

We need to configure the check and the check's event-handler like so:

define service{
        use                             local-service
        host_name                       localhost
        service_description             daemons
        check_command                   check_nrpe!check_daemons
        event_handler                   restart-services
}

In your nagios.cfg, makes sure you have "enable_event_handlers=1" to enable the event handlers.  There are several other values in the config file you may wish to alter such as the event_handler_timeout.

In your commands.cfg file, make sure you have event_handler defined something like:

define command{
        command_name    restart-services
        command_line    /usr/local/nagios/libexec/eventhandlers/restart-services \
        $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
        }

The problem we have is that the event_handler runs as the Nagios user, which tyipcally will not be able to restart a service.  To test this, just "su - nagios" and try to restart sendmail or apache.  We can work around this by using SUDO.   Edit the SUDOERS file (visudo) and add something like the lines below to the end of the file.

User_Alias NAGIOS = nagios,nagcmd
Cmnd_Alias NAGIOSCOMMANDS = /sbin/service
Defaults:NAGIOS !requiretty
NAGIOS    ALL=(ALL)    NOPASSWD: NAGIOSCOMMANDS

Essentially we're defining users and commands that can be run via SUDO, without a password, and without a session. 

Attached is the script I use (found it on the web) for the scenario described above.  Do not forget to make sure the script has the appropriate ownership / permissions.  Try executing the script as the nagios user to test it prior to setting up the event_handler in Nagios.

AttachmentSize
event_handler_script.txt2.98 KB

Nagios, NagVis and PNP4Nagios Example

Nagios, NagVis and PNP4Nagios Example

A vanilla out of the box example of the Nagios/NagVis/PNP4Nagios integration. The usual installation pains of all the dependencies required for the packages. Setup was not too difficult following the documentation. I created a simple hardware diagram in Visio in this example. I added icons for HOST status and the CPU Load and Root Partition service checks. I updated the "hover" template for NagVis to show the PNP4Nagios graphs for the services.

As you can see, you have the ability to create some slick visuals. You can create a high level dashboard and drill down to more detailed maps. Of course this all works much better when your hardware and logical layouts are relatively static. In a very dynamic environment Nagios can be an administrative pain and this only increases the complexity.

Nagios - Switch Interface Traffic

I recently wanted to start monitoring some ports on my switch stack. Specifically several uplink ports and several trunk ports. Doing a little research I found the best plugin was the "check_iftraffic3" plugin available from the Nagios Plugin Exchange. Ref: http://exchange.nagios.org/directory/Plugins/Network-Connections%2C-Stats-and-Bandwidth/check_iftraffic3/details

I modified the perl script slightly to format the output a bit differently. The biggest trick to determine the interface ID. Using SNMPWALK on my Nagios server I was able to look at the various interfaces in my switching environment.

snmpwalk -v 2c -c public aaa.bbb.ccc.ddd ifTable

Configure a new check command in the standard fashion and off you go! Oh, I had tweaked the output slightly and created a PNP4NAGIOS template to better display the IN/OUT data on the same graph vs. individual graphs where the "scale" of the graph could be misleading. I'll attach that info as a TXT file.

AttachmentSize
check_traffic3_php.txt1.38 KB

Plugin: check_dns_secondary - Checking NS Servers

Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1948.html;d=1

NOTE: May require installation of additional perl modules.

I renamed the check to "check_ns_servers" on my install to be a little more obvious as to its function.  After several DNS hosting provider outages which would manifest a wide array of errors over odd periods of time thanks to DNS caching, I wanted a Nagios plugin to check to make sure our DNS was working correctly.  Sure there is "check_dns" which I also use to check the resolution of a name to correct IP, but I wanted something a bit more powerful.

"check_dns_secondary" will query for the name servers of the provided domain.  Each NS server is queried individually for the SOA record of the domain.  An error is generated if any server is not functioning, or not authoritative.  A warning is generated if any server lags the others in serial-number.

Plugin: check_http_requisites - Page Size, Files, and Loadtime

Summary

A Nagios module written in Python that downloads the page and embedded elements using 'wget' to measure a more realistic total page load time value.  The total number and size of elements as well as the time it took to load the them is returned.  A warn/critical alert can be triggered by the total load time.

Usage Example

Ideally you want your total page load time to be less than a few seconds.  This means making sure your images are sized correctly and in the correct format.  eg. a JPEG vs. a large BMP file.  Also any "embedded" objects like externally referenced image/media files do not slow your site down.  Or perhaps your website is under load and just not responding in a timely fashion.

Using this plugin we can relatively monitor the load time of select sites and/or pages within a site.  Note that the check is somewhat dependent on the system executing the check and its network bandwidth.  While unlikely to be running Nagios over dialup, bandwidth limitations and other traffic could definitely affect the total page load time.  Also this could alert you in the case of a sub-optimal routing, latency, and/or packet loss issue.  What I like to call the "TII", aka Transient Internet Issue.

Adding to NagiosGraph

Of course graphs are always visually pleasing and allow you to make your point about what happens when Marketing uploads 1meg BMP files instead of the recommended JPEGs.  Here is the NagiosGraph map file entry I added.


# Service type: check_http_complete
# output: OK - Downloaded: 149K bytes in 8 files in 0.83 seconds
# perfdata: time=0.83;size=149K;number=8
/perfdata:.*time=([.0-9]+);size=(\d+)K;number=(\d+)/ and
push @s, [ http_complete,
[ sec, GAUGE, $1 ],
[ KB, GAUGE, $2 ],
[ files, GAUGE, $3 ] ];

Available Here : http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1352.html;d=1

Plugin: check_mem - Linux Memory Usage

A plugin written in perl to monitor and check thresholds for memory based on the output of the 'free -mt' command.

Ref: Nagios Exchange - check_mem page

Installation

  1. Copy the file to your /usr/local/nagios/libexec directory of the host you are monitoring
  2. Set the file mode to 755
  3. Add a line to your nrpe.cfg file and restart the service.
  4. command[check_mem]=/usr/local/nagios/libexec/check_mem -w 80,20 -c 95,50
  5. Add the NRPE check to the appropriate configuration file on your Nagios server like:
define service{
        use                     servicetemplate2   
        hostgroup_name          linux-servers
        service_description     tmp
        check_command           check_nrpe!check_tmp
}

Command Line Syntax

# /usr/local/nagios/libexec/check_mem -w 50,20 -c 80,50
<b>WARNING: Memory Usage (W> 50, C> 80): 72% <br>Swap Usage (W> 20, C> 50): 0%</b> \
|MemUsed=72%;50;80 SwapUsed=0%;20;50

Display the Data

To add this to your "map" file for NagiosGraph, append the code to capture the data like below:

# Service type: check_mem
#   check command: check_nrpe!check_mem -w 50,10 -c 80,25
#   output: <b>CRITICAL: Memory Usage (W> 80, C> 95): 100% <br>Swap Usage (W> 20, C> 50): 0%</b>
#   perfdata: MemUsed=100%;80;95 SwapUsed=0%;20;50
/perfdata:.*MemUsed=(\d+)%;(\d+);(\d+).*?SwapUsed=(\d+)%;(\d+);(\d+)/
and push @s, [ memory,
       [ ramuse, GAUGE, $1 ],
       [ swapuse, GAUGE, $4 ] ];

Dumping Linux Buffer Cache

TOP screenshot

A useful command to check linux system resources is "top".  However with the buffer cache you may see almost no available memory (see attached screenshot).  But how can that be?  All you have running may be a java app, apache, and a few other services.  There is no way that should be using ALL of that RAM.  In my case, I have the Nagios "check_mem" plugin querying for available memory and throwing alerts quite regularly.

The "free -mt" command can show you how much memory is cached.  That eases my mind a bit.  While googling about buffer cache, I stumbled upon this article:

http://devcs.blogspot.com/2007/12/linux-buffer-cache-how-to-disable-it.html

This article gives a good rundown of what the buffer cache is all about.  Also it mentions a nice little trick to dump the entire cache.

echo 1 > /proc/sys/vm/drop_caches

VOILA!  Cache dumped and Nagios is happy.  Just dumping the cache shouldn't be taken lightly as it MAY have some adverse effects depending on your server.  However the cache should slowly start to build back up. 

Looking into the actual check_mem script, you can have it exclude the buffers in the calculation of free memory.  Check this line and make sure the value is "1":

my $DONT_INCLUDE_BUFFERS = 1;

Plugin: check_sql - Check MSSQL and MYSQL servers

Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1435.html;d=1

Written in Perl.  Requires FreeTDS to be install.

This plugin can query a Microsoft SQL Server or a MySQL Server. The plugin can also execute specific queries or stored procedures and return the results based. The results can then be compared via thresholds for numeric values or via regular expressions for string values.

Here is a good example of how I used the plugin in my blog.  Count Log Entries Stored Procedure and check by Nagios.

Plugin: check_svn - Check Subversion

Summary

Check_svn is a nagios check written in Python which will check the availability of your SVN repository from your Nagios server.

Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1554.html;d=1

ERROR: CHECK_SVN - Error Connecting

Summary

The "check_svn" plugin worked from the command line.  However when I attempted to configure the Nagios check, an error occurred.

Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1554.html;d=1

SVN CRITICAL: Error connecting to svn server - Can't open file '/root/.subversion/servers': Permission denied .

Solutions

Apparently this is caused by nagios environment issue.  The nagiosexchange page recommends one solution.  Alternatively you can modify the check_svn script.

  1. Edit the command line in the nagios commands.cfg file to export the HOME variable:
    command_line export HOME=/home/nagios && $USER1$/check_svn
  2. Edit the "check_svn" check script to pass a command line variable
    Add a variable like:  self.confdir    = "/home/nagios"
    and edit "cmd" line the script builds after the if statements for username/password:
            if self.confdir:
                cmd += " --config-dir=%s" % self.confdir

All should work well now.  I also make sure to use the "-T" option to output the test execution time which I can now graph with nagiosgraph.

Tweak - Nagios Jabber / XMPP Notifications

Image Nagios Openfire Notification

I wanted to add the feature to send nagios alerts via Jabber, aka XMPP protocol, instant messages.  Our office uses the Openfire platform for corporate instant messaging.  Some googling found this:

I modified the server connection and user variables to fit my Openfire installation.  However after running some simple command line tests, I could not get it to work.  The unable to connect errors were easy enough to understand and fix, but then I got an unauthorized error.

ERROR: Authorization failed: error - not-authorized

The link below has information on solution to the error.  OK, so now we assume you CAN send a test message via Jabber/XMPP.  Let's configure it.

There are two primary scenarios for configuring the Jabber/XMPP notifications. 

Scenario 1 - User wants to be contacted for ALL Nagios alerts via Jabber/XMPP. 

Define the contact command something like:

# 'notify-by-jabber' command definition
define command{
        command_name    notify-by-jabber
        command_line    /usr/local/nagios/bin/notify_via_jabber \
             $CONTACTADDRESS1$ "$NOTIFICATIONTYPE$ $HOSTNAME$ \
             $SERVICEDESC$ $SERVICESTATE$ $SERVICEOUTPUT$ $LONGDATETIME$"
        }

*note - I'm using the backslash for readability only, actually only one line.

Notice how I defined "$CONTACTADDRES1" in the command line.  Now check out  my Contact definition...

define contact{
        contact_name                    jdoe
        use                             generic-contact
        alias                           John Doe
        email                           jdoe@company.com
        address1                        jdoe
        service_notification_commands   +notify-by-jabber
        }

This will use the generic contact info for the email method -and- add the jabber contact method.  Note the "address1" line I am using for the appropriate jabber_id for the user.  Alternatively you could create a contact template called "jabber-contact" using the notify-by-jabber command and then apply both templates to the user contact definition.

Scenario 2 - User only needs specific Nagios service/host alerts via Jabber/XMPP.

Unfortunately this is simple.  You would need to define a second contact entirely to assign to the specific host/service.  Create a "jdoe-jabber" contact so on your Nagios host/service definition you would have a line like:

contacts          jdoe,jdoe-jabber

Maybe with an add-on or future version of Nagios we could define a user/contact method along the lines of:

contacts           jdoe:notify-by-jabber,jdoe:notify-by-email

Anyone listening?  Feel free to send me a note or comment.

AttachmentSize
jabber-xmpp-notications.txt2.45 KB

ERROR - Nagios XMPP Notification with Openfire

ERROR: Authorization failed: error - not-authorized

Obviously this means I am connected am communicating with the server.  So I flipped TLS variable on/off a few times.  I also turned on debug logging on my Openfire server and could see the connections.  With TLS "on" or set to "1", I received this error:

Can't use an undefined value as a HASH reference at /usr/lib/perl5/site_perl/5.8.8\
/XML/Stream.pm line 1165.

So I turned TLS off and started working with the not-authorized error.  Googled around and found the a fix that worked.  I had to edit the Protocol.pm perl module to fix the authentication error.  I found my file here...

/usr/lib/perl5/site_perl/5.8.8/Net/XMPP/Protocol.pm

and just commented out the line:

return $self->AuthSASL(%args);

Now I can annoy myself with all my Nagios alerts via IM as well as email!  Attached is the file with the text used for the commands.cfg entry and the perl script to send the notifications.

Tweak - Nagios SMS Messaging

Want to send SMS messages from Nagios?  SMS messages sometimes blocked when sending them via <phone#>@provider.com?  Sending alot of SMS messages?  Sounds like you need an SMS Gateway Provider.

In our corporate environment we wanted a more reliable/consistant SMS messaging system to work with our Nagios monitoring environment.   A little research quickly led us to Clickatell.  To keep things as simple as possible, we setup Nagios to use the SMS Gateway via SMTP API.

Now we had to configure Nagios.  First off, we needed to create a notification method/command.  Here is the command.cfg entry we created:

define command{
        command_name    notify-service-by-sms
        command_line    /usr/bin/printf "%b" "api_id:<API_ID> \nuser:<USERNAME> \npassword:<PASSWORD> \nto:$CONTACTPAGER$\nreply:<REPLY_ADDY> \ntext:$NOTIFICATIONTYPE$ $HOSTALIAS$-$SERVICEDESC$ $SERVICESTATE$\ntext:Address-$HOSTADDRESS$\ntext:Additional Info-$SERVICEOUTPUT$" | /bin/mail -s "$HOSTALIAS$-$SERVICEDESC$ is $SERVICESTATE$" $CONTACTEMAIL$
        }

Bold items in brackets you must specify based on your environment.  Notice how we had to significantly strip down the info sent via SMS text message.  We found the above was simple and communicated the required info. 

Then let's create a contact to use...

define contact {
        contact_name    sms
        use             generic-contact
        alias           SMS Alert
        email           sms@messaging.clickatell.com
        pager           16665551234,16665554321
        service_notification_commands   notify-service-by-sms
}

The phone numbers to SMS are just a comma seperated list.  It is important to note the phone numbers must have a "1" before them.

Now simply add the "sms" contact in the service definitions you want to alert by SMS text messages.    Reload Nagios and you should be off and running.

Tweak - check_file_age to check_file_modified

Out of the box the Nagios Plugins package has a check_file_age plugin.  Well, that only checks to see if the file has been modified in the last specified time period and alerts if the file has NOT changed recently.  I needed the exact opposite, to check to see if the file has been modified. 

The "reversal" can be accomplished by changing two ">" symbols to "<" in the comparisons of the function.  Of course, I changed the few name occurances of the plugin to a new name as well.  This is a good sanity check for when any critical config file that I designate is changed, it will alert the IT team.  Great for developers with sudo access touching key config files they should not be, such as the http.conf file.

Tweak: Using NagiosGraph's SHOW.CGI

Using NagiosGraph with Nagios can provide valueable information about your environment.  At times, you may want to show something on a different scale or limit the data seen within the graph.  Below are a few basics for manipulating NagiosGraph's show.cgi to customize the graph you are viewing.

Example

We use the "check_mssql_monitor 0.9.0" with our Microsoft SQL Clusters and graph the results.  My NagiosGraph "map" file graphs the CPU, IO, IDLE, and response time of checks and Nagios notes_url for the check links to the default graph.  However the scale of values typically render the IO nearly flat, yet a small change in IO can be significant.  Now we want to just generate IO graphs of each of our clusters to compare them to each other. 

Here's the default URL of the graph we are working with: 

http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo

mssql_monitor nagios nagiosgraph before

By manipulating the URL, we can do some handy tweaks.  Let's start with only graphing the IO.  To do that, we need to add the appropriate options to the URL.  Adding the datasource name and valuename to the URL like this:

http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo&db=mssql_monitor,io

My favorite option is make the graph bigger!  After all, bigger is better right?  Just add the geometry option to the URL likse so:

http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo&db=mssql_monitor,io&geom=700x200

mssql_monitor nagios nagiosgraph after

Now we have a much larger graph and are able to see the IO response increasing during the given time period.  Next step is to figure out why!

HINT:  for the db source name, check out the default graph and look for the name immediately under the graph.  Then add the value as it appears in the legend.

Tweak: check_sql - Allow decimal values

While building another stored procedure that I execute by check_mssql in Nagios, I noticed a little hiccup.  My stored procedure was returning a value like "85.67".  When I executed the check_sql on the command line to run the procedure, I got a strange error...

# ./check_mssql -H mssqlclus1 -U username -P password -p 1433 -D database -w 20 -c 35 \
> -q "exec database.dbo.sp_GetDatabaseFileMetrics database,Used,Log,1" -W 90.00 -C 95.00 -s
CHECK_MSSQL CRITICAL - Result is not numeric with result threshold defined (0.089992 seconds) \
| time=0.089992s;20;35

Now that does not make sense at all.  I removed the -W and -C constraints and got:

CHECK_MSSQL OK - SQL Server result: 98.10 (0.122316 seconds) | time=0.122316s;20;35

I do not know about you, but "98.10" looks like a numeric to me.  So I opened up the perl for the check_mssql and looked for the conditions that triggered the error.  This was the regular expression it was evaluating to determine if the value returned was a numeric instead of a string.

$result =~ /^[-+]?\d+$/

Well, that does not do the trick if I have a value like "98.10".  A value of "98" would have been fine.  I freely admit I am no "code guru" by any means, but I figured I shoudl be able to come up with a fix for this.  I copied the stored procedure to 'check_mssql2' and went to work.  I created an OR condition to look for the integer regular expression or a decimal regular expression.  There may be a better way, but this worked for me.  I changed this:

!($result =~ /^[-+]?\d+$/)) {

to this:

!(($result =~ /^[-+]?\d+$/) || ($result =~ /^[-+]?\d+\.\d+$/))) {

I ran my tests and it worked great!  A really useful link I found was this Regular Expression Validator at http://www.sweeting.org/mark/html/revalid.php.  It also has some very handy reference info on the bottom which I found useful since I do not have the pleasure of writing them on a daily basis.