Below are lists of Nagios Addons/Plugins as well as some common Nagios Tweaks / Customizations I have tried with my various Nagios installations. If you have some ideas, suggestions, etc. please register and post comments. Thanks.
A few weeks back I found a post on the Drupal forums about monitoring the status report page with Nagios and Webinject. Having lots of practice with Nagios and Webinject, I knew this was possible but noone had provided an example. So I finally got around to creating the Webinject script today and posted it. Below is the complete process including the Nagios info I used.
<testcases repeat="1">
<testvar varname="BASE_URL">http://www.domain.com/</testvar>
<testvar varname="LOGIN1">username</testvar>
<testvar varname="PASSWD1">password</testvar>
<case
id="1"
description1="Connecting to Login Page"
method="get"
url="${BASE_URL}?q=user"
verifypositive="Enter the password that accompanies your username"
errormessage="Unable to load login page"
/>
<case
id="2"
description1="Authentication"
method="post"
url="${BASE_URL}?q=user"
postbody="name=${LOGIN1}&pass=${PASSWD1}&form_id=user_login&op=Log+in"
verifypositive="${BASE_URL}\?q=users/${LOGIN1}"
errormessage="Login Post Problem"
/>
<case
id="3"
description1="Status Report Page"
method="get"
url="${BASE_URL}?q=admin/reports/status"
verifynegative="Out of date"
verifypositive="Drupal core update status"
errormessage="Status Page Alert!"
/>
</testcases>
define command {
command_name webinject
command_line /usr/local/nagios/webinject/webinject.pl -c nagios/$ARG1$ nagios/$ARG2$
}
define service {
use template1
host_name server
service_description status-report
check_command webinject!nagios.xml!drupal_status.xml
}
The webinject script looks for something that is "Out of date" on the Status Report and will alert appropriately based on your Nagios configuration. The first step is not necessarily required, but it helps in troubleshooting if the login page for the site is not loading correctly and preventing the check from executing correctly.
NRPE and NSClient allow you to remotely execute either pre-configured tasks or custome scripts to trigger alerts or as the result of an event (eg. an EventHandler).
NRPE Plugin
http://sourceforge.net/project/showfiles.php?group_id=26589
NRPE allows you to remotely execute Nagios plugins on other Linux/Unix machines. This allows you to monitor remote machine metrics (disk usage, CPU load, etc.). NRPE can also communicate with some of the Windows agent addons, so you can execute scripts and check metrics on remote Windows machines as well. A windows utility called NSClient is also available to accomplish the same thing on Windows hosts.
NSClient Plugin
http://trac.nakednuns.org/nscp/downloads
NSClient++, aka NSCP, aims to be a simple yet powerful and secure monitoring daemon for Windows operating systems. It is built for Nagios, but nothing in the daemon is actually Nagios specific and could probably, with little or no change, be integrated into any monitoring software that supports running user tools for polling.
Several conditions can trigger this error with your Nagios checks. Many of them are obvious, but this one had me stumped for awhile.
All my nagios checks with NRPE to a given host were failing with the "CHECK_NRPE: Socket timeout after 10 seconds." message. I logged into the host and made sure NRPE was running, even restarted it. Double checked the firewall rules to make sure the port was open. I went to my nagios server, did an NSLOOKUP, PING and TELNET to the port to ensure I was resolving the correct IP address and could connect. The machine in question was a Virtual Private Server (VPS) so it does sometimes become sluggish and non-responsive, but poking around it all seemed fine. I tested from the command line of my Nagios server and got the same results.
What got me looking in the right direction was when I pinged my Nagios server from my host. It worked fine, but I noticed it took a few seconds to resolve the host. So then I checked the DNS servers of my Linux VPS. The first server listed was not pingable. I quickly flip-flopped the servers in my resolv.conf and VOILA! My command-line check from my Nagios server fixed it.
While setting up several new servers and installing NSCLIENT, I ran into the following error message:
could not fetch information from server
The most logical first step is to re-verify the Nagios server config file. Check to make sure DNS resolution is correct. Second, take a look at the NSC.log on the client system. In my case, I saw:
2009-03-30 10:52:23: error:.\NSClientListener.cpp:307: Unauthorized access from: 172.20.16.182
Well, that could definitely be a problem. My fault this time was in editing the NSC.ini after installation. The allowed_hosts line of:
allowed_hosts=172.20.16/23
needed to be like:
allowed_hosts=172.20.16.0/23
NagEventLog is a windows agent that examines the EventLog, filters it, and forwards passive alerts to Nagios via NSCA. Now with encryption support! Supports Windows 2000 and later.
More information can be found here:
NagEventLog allows you to have windows event log entries filtered and passed back to your Nagios server. Two methods I have used are:
When you have alot of Windows Servers and would like to add an EventID to the Filter, it is a real pain to update on a server by server basis. So using a GPO object, you can control the filters directly from a policy without having to manually update each individual server.
Create a custom administrative policy template. Below is the "nageventlog.adm" file I used to filter out select Event IDs.
; nageventlog.adm
;;;;;;;;;;;;;;;;;;;;;
CLASS MACHINE ;;;;;;
;;;;;;;;;;;;;;;;;;;;;
CATEGORY !!nagiosfilter
KEYNAME "SOFTWARE\Wow6432Node\Cheshire Cat\Nagios\Filter0"
POLICY !!changenagiosfilter
PART !!NotEventID CHECKBOX
VALUENAME "notID"
VALUEON NUMERIC 1
VALUEOFF NUMERIC 0
END PART
PART !!ChangeFilter0IDs EDITTEXT REQUIRED
VALUENAME "ID"
DEFAULT !!filterdefault
END PART
PART !!changefilter0IDstext TEXT END PART
END POLICY
END CATEGORY
[STRINGS]
nagiosfilter="Nagios Filtering"
changenagiosfilter="Change Nagios Filter0"
ChangeFilter0IDs="Event IDs that are ignored by Nagios"
changefilter0IDstext="Comma seperated list of Event IDs to exclude"
filterdefault="21293,21248,26020,26009"
You can use the technique above to do a variety of things and tweak things from a central location across the domain environment.
While the 64bit version of NagEventLog v1.9.1 installed on my 64bit Windows 2008 server, I was unable to use the GUI to configure the filters. However if you visit Steve Shipway's NagEventLog site directly, you can download replacement executables that allow it to properly run in Server2008. I replaced the files, restarted the service and then GUI tool worked correctly.
Using Nagios with NSCA, you can configure some complex scripts / tasks to output status codes and messages to be sent to your Nagios server for collection / reporting. To start, you will need to install NSCA package on your Nagios server and configure the listening server as outlined in the documenation.
NOTE: You will need libmcrypt and libmcrypt-devel packages installed to compile successfully.
You will most likely want to create a template or two to use with your passive checks. Below is the example template I created for testing passive checks...
define service{
name passive-service
use generic-service
check_freshness 1
passive_checks_enabled 1
active_checks_enabled 0
is_volatile 0
flap_detection_enabled 0
notification_options w,u,c,s
freshness_threshold 57600 ;12hr
}
Then configure a service like so...
define service{
use passive-service
host_name localhost
service_description test
check_command check_dummy!3!"No Data Received"
}
On the remote server, you will need to do the same to compile the components. You will only need the send_nsca binary and the send_nsca.cfg file. You will need to tweak your send_nsca config file to match the information you configured on your NSCA server.
Now the fun begins where you can create/modify scripts to send these passive check results to Nagios via the NSCA server. I used a simple perl script below for my testing.
#!/usr/bin/perl
#############################################################
# RETURN CODES:
# 0-OK, 1-WARNING, 2-CRITICAL, 3-UNKNOWN
#############################################################
#CONFIG FILES
#$debug=1;
$config="/usr/local/nagios/etc/send_nsca.cfg";
# LOCAL SYSTEM CONFIG OPTIONS
$nsca_host="nagios.hubteam.com";
$host="host_name";
$service="service_name";
# DEFAULT RETURNS
$code=3;
$result="WHAT THE HECK?";
# COMMAND LINE
$send_nsca="/usr/local/nagios/bin/send_nsca -c $config -H $nsca_host";
# Start
# INSERT YOUR FUN CODE HERE, Setting a $code and $result value
# End
if ($debug) {print "SENDING: $host\t$service\t$code\t$result\n";}
open(SEND,"|$send_nsca") || die "Could not run $send_nsca: $!\n";
print SEND "$host\t$service\t$code\t$result\n";
close SEND;
There are several points to consider.
In large Nagios environments, configuring everything at the host level can be cumbesome. Nagios has nice grouping / templating features that make deploying checks alot faster as well as easier to manager. Sometimes you may need to "customize" the check to the specific host. For example, specify the databasename on the indivudal database server to query. This is where Nagios "Custom Object Variables" come into play.
As always, you can find some very useful information in the Nagios documentation.
In my case, we will start with defining the custom object variable on the host object by adding a like in the "define host {" block like so:
define host{
use server-template
host_name dbserver1
alias DB Server 1
address dbserver1.domain.local
_DATABASE1 DB01
}
I have a hostgroup definition for "Database Servers" and a list of common checks for each database server. You can see how I have my Nagios check configured to use the local variable in the hostgroup definition....
define hostgroup{
hostgroup_name database-servers
alias Database Servers
members dbserver1,dbserver2,dbserver3
}
define service{
use template
hostgroup_name database-servers
service_description database-test
check_command check_mssql!username!password!-p 1433 -D \
$_HOSTDATABASE1$ -w 3 -c 5 -q "exec \
$_HOSTDATABASE1$.dbo.sp_test" -s -W 10 -C 20
}
Note that the backslashes are only for readability here, and the check is a single line in my definition.
Remeber when using the custom variables, they always start with the underscore and then prefixed with the type of variable... HOST, SERVICE, CONTACT, etc.
I wrote a few quick posts on using Nagios Event_Handlers to restart a service on the local system. Mostly I followed the example from the Nagios documentation, but it was a little tricky using SUDO to restart a service. Once I solved that, the logical next step was to be able to restart a service on a REMOTE system with the event_handlers and NRPE.
NAGIOSSVR runs nagios and monitors itself and WEBSVR. I use the "check_linux_procs" script which is also known as "check_system_procs". On the remote server WEBSVR, the script configuration lines look something like:
# Processes to check PROCLIST_RED="httpd sendmail nrpe" PROCLIST_YELLOW="crond" # Ports to check PORTLIST="25 80 5666"
The check_linux_procs is executed on the remote server via NRPE. We can use NRPE to remotely execute event handlers as well as service checks. Setup is a bit more complex than a local host configuration.
Proper SUDO configuration is required on the remote system, WEBSVR. Read my other post on the Nagios Local Sevice Restart with Event_Handlers for the more information on the SUDO settings.
On WEBSVR I created a very simple script that uses sudo to restart the services. Something like:
#!/bin/sh # /usr/bin/sudo /sbin/service httpd restart /usr/bin/sudo /sbin/service sendmail restart exit 0
NRPE is not listed because... well, if NRPE crashes the event_handler cannot run since it uses NRPE to connect and execute the script. Do not foget to add your script to your nrpe.cfg file like below and restart NRPE on WEBSVR.
command[remote_restart]=/usr/local/nagios/libexec/eventhandlers/remote-restart
On NAGIOSSVR, this service check has a max_check_attempts of 3. So I had to tweak the script I used before. The trick here is passing the right variables through. In my Nagios commands.cfg I added the $HOSTADDRESS$ value to the end of the line like so:
command_line $USER1$/eventhandlers/restart-services-remote \ $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$
And the local "event_handler" script on NAGIOSSVR looks similar to the localhost example, but the "sudo" restart command is replaced with:
/usr/local/nagios/libexec/check_nrpe -H $4 -c remote_restart
Don't forget to change the case logic if you need to adjust for a different max_check_attempts value in your config.
NOTES:
HINTS:
Using Nagios Event Handlers you can perform an action based on the results of a Nagios check. A very straightforward example would be to restart a service. However it is not as simple as you might think.
I use the "check_system_procs" on the localhost of my nagios server itself to check a few services and restart them all should one no longer be running. Since my nagios server is a VPS with limited resources, it sometimes runs out of memory and well... things die.
We need to configure the check and the check's event-handler like so:
define service{
use local-service
host_name localhost
service_description daemons
check_command check_nrpe!check_daemons
event_handler restart-services
}In your nagios.cfg, makes sure you have "enable_event_handlers=1" to enable the event handlers. There are several other values in the config file you may wish to alter such as the event_handler_timeout.
In your commands.cfg file, make sure you have event_handler defined something like:
define command{
command_name restart-services
command_line /usr/local/nagios/libexec/eventhandlers/restart-services \
$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}
The problem we have is that the event_handler runs as the Nagios user, which tyipcally will not be able to restart a service. To test this, just "su - nagios" and try to restart sendmail or apache. We can work around this by using SUDO. Edit the SUDOERS file (visudo) and add something like the lines below to the end of the file.
User_Alias NAGIOS = nagios,nagcmd Cmnd_Alias NAGIOSCOMMANDS = /sbin/service Defaults:NAGIOS !requiretty NAGIOS ALL=(ALL) NOPASSWD: NAGIOSCOMMANDS
Essentially we're defining users and commands that can be run via SUDO, without a password, and without a session.
Attached is the script I use (found it on the web) for the scenario described above. Do not forget to make sure the script has the appropriate ownership / permissions. Try executing the script as the nagios user to test it prior to setting up the event_handler in Nagios.
| Attachment | Size |
|---|---|
| event_handler_script.txt | 2.98 KB |
A vanilla out of the box example of the Nagios/NagVis/PNP4Nagios integration. The usual installation pains of all the dependencies required for the packages. Setup was not too difficult following the documentation. I created a simple hardware diagram in Visio in this example. I added icons for HOST status and the CPU Load and Root Partition service checks. I updated the "hover" template for NagVis to show the PNP4Nagios graphs for the services.
As you can see, you have the ability to create some slick visuals. You can create a high level dashboard and drill down to more detailed maps. Of course this all works much better when your hardware and logical layouts are relatively static. In a very dynamic environment Nagios can be an administrative pain and this only increases the complexity.
I recently wanted to start monitoring some ports on my switch stack. Specifically several uplink ports and several trunk ports. Doing a little research I found the best plugin was the "check_iftraffic3" plugin available from the Nagios Plugin Exchange. Ref: http://exchange.nagios.org/directory/Plugins/Network-Connections%2C-Stats-and-Bandwidth/check_iftraffic3/details
I modified the perl script slightly to format the output a bit differently. The biggest trick to determine the interface ID. Using SNMPWALK on my Nagios server I was able to look at the various interfaces in my switching environment.
snmpwalk -v 2c -c public aaa.bbb.ccc.ddd ifTable
Configure a new check command in the standard fashion and off you go! Oh, I had tweaked the output slightly and created a PNP4NAGIOS template to better display the IN/OUT data on the same graph vs. individual graphs where the "scale" of the graph could be misleading. I'll attach that info as a TXT file.
| Attachment | Size |
|---|---|
| check_traffic3_php.txt | 1.38 KB |
Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1948.html;d=1
NOTE: May require installation of additional perl modules.
I renamed the check to "check_ns_servers" on my install to be a little more obvious as to its function. After several DNS hosting provider outages which would manifest a wide array of errors over odd periods of time thanks to DNS caching, I wanted a Nagios plugin to check to make sure our DNS was working correctly. Sure there is "check_dns" which I also use to check the resolution of a name to correct IP, but I wanted something a bit more powerful.
"check_dns_secondary" will query for the name servers of the provided domain. Each NS server is queried individually for the SOA record of the domain. An error is generated if any server is not functioning, or not authoritative. A warning is generated if any server lags the others in serial-number.
A Nagios module written in Python that downloads the page and embedded elements using 'wget' to measure a more realistic total page load time value. The total number and size of elements as well as the time it took to load the them is returned. A warn/critical alert can be triggered by the total load time.
Ideally you want your total page load time to be less than a few seconds. This means making sure your images are sized correctly and in the correct format. eg. a JPEG vs. a large BMP file. Also any "embedded" objects like externally referenced image/media files do not slow your site down. Or perhaps your website is under load and just not responding in a timely fashion.
Using this plugin we can relatively monitor the load time of select sites and/or pages within a site. Note that the check is somewhat dependent on the system executing the check and its network bandwidth. While unlikely to be running Nagios over dialup, bandwidth limitations and other traffic could definitely affect the total page load time. Also this could alert you in the case of a sub-optimal routing, latency, and/or packet loss issue. What I like to call the "TII", aka Transient Internet Issue.
Of course graphs are always visually pleasing and allow you to make your point about what happens when Marketing uploads 1meg BMP files instead of the recommended JPEGs. Here is the NagiosGraph map file entry I added.
# Service type: check_http_complete
# output: OK - Downloaded: 149K bytes in 8 files in 0.83 seconds
# perfdata: time=0.83;size=149K;number=8
/perfdata:.*time=([.0-9]+);size=(\d+)K;number=(\d+)/ and
push @s, [ http_complete,
[ sec, GAUGE, $1 ],
[ KB, GAUGE, $2 ],
[ files, GAUGE, $3 ] ];
Available Here : http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1352.html;d=1
A plugin written in perl to monitor and check thresholds for memory based on the output of the 'free -mt' command.
Ref: Nagios Exchange - check_mem page
Installation
command[check_mem]=/usr/local/nagios/libexec/check_mem -w 80,20 -c 95,50
define service{
use servicetemplate2
hostgroup_name linux-servers
service_description tmp
check_command check_nrpe!check_tmp
}
Command Line Syntax
# /usr/local/nagios/libexec/check_mem -w 50,20 -c 80,50 <b>WARNING: Memory Usage (W> 50, C> 80): 72% <br>Swap Usage (W> 20, C> 50): 0%</b> \ |MemUsed=72%;50;80 SwapUsed=0%;20;50
Display the Data
To add this to your "map" file for NagiosGraph, append the code to capture the data like below:
# Service type: check_mem
# check command: check_nrpe!check_mem -w 50,10 -c 80,25
# output: <b>CRITICAL: Memory Usage (W> 80, C> 95): 100% <br>Swap Usage (W> 20, C> 50): 0%</b>
# perfdata: MemUsed=100%;80;95 SwapUsed=0%;20;50
/perfdata:.*MemUsed=(\d+)%;(\d+);(\d+).*?SwapUsed=(\d+)%;(\d+);(\d+)/
and push @s, [ memory,
[ ramuse, GAUGE, $1 ],
[ swapuse, GAUGE, $4 ] ];
A useful command to check linux system resources is "top". However with the buffer cache you may see almost no available memory (see attached screenshot). But how can that be? All you have running may be a java app, apache, and a few other services. There is no way that should be using ALL of that RAM. In my case, I have the Nagios "check_mem" plugin querying for available memory and throwing alerts quite regularly.
The "free -mt" command can show you how much memory is cached. That eases my mind a bit. While googling about buffer cache, I stumbled upon this article:
http://devcs.blogspot.com/2007/12/linux-buffer-cache-how-to-disable-it.html
This article gives a good rundown of what the buffer cache is all about. Also it mentions a nice little trick to dump the entire cache.
echo 1 > /proc/sys/vm/drop_caches
VOILA! Cache dumped and Nagios is happy. Just dumping the cache shouldn't be taken lightly as it MAY have some adverse effects depending on your server. However the cache should slowly start to build back up.
Looking into the actual check_mem script, you can have it exclude the buffers in the calculation of free memory. Check this line and make sure the value is "1":
my $DONT_INCLUDE_BUFFERS = 1;
Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1435.html;d=1
Written in Perl. Requires FreeTDS to be install.
This plugin can query a Microsoft SQL Server or a MySQL Server. The plugin can also execute specific queries or stored procedures and return the results based. The results can then be compared via thresholds for numeric values or via regular expressions for string values.
Here is a good example of how I used the plugin in my blog. Count Log Entries Stored Procedure and check by Nagios.
Check_svn is a nagios check written in Python which will check the availability of your SVN repository from your Nagios server.
Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1554.html;d=1
The "check_svn" plugin worked from the command line. However when I attempted to configure the Nagios check, an error occurred.
Ref: http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1554.html;d=1
SVN CRITICAL: Error connecting to svn server - Can't open file '/root/.subversion/servers': Permission denied .
Solutions
Apparently this is caused by nagios environment issue. The nagiosexchange page recommends one solution. Alternatively you can modify the check_svn script.
command_line export HOME=/home/nagios && $USER1$/check_svn if self.confdir:
cmd += " --config-dir=%s" % self.confdirAll should work well now. I also make sure to use the "-T" option to output the test execution time which I can now graph with nagiosgraph.
I wanted to add the feature to send nagios alerts via Jabber, aka XMPP protocol, instant messages. Our office uses the Openfire platform for corporate instant messaging. Some googling found this:
I modified the server connection and user variables to fit my Openfire installation. However after running some simple command line tests, I could not get it to work. The unable to connect errors were easy enough to understand and fix, but then I got an unauthorized error.
ERROR: Authorization failed: error - not-authorized
The link below has information on solution to the error. OK, so now we assume you CAN send a test message via Jabber/XMPP. Let's configure it.
There are two primary scenarios for configuring the Jabber/XMPP notifications.
Scenario 1 - User wants to be contacted for ALL Nagios alerts via Jabber/XMPP.
Define the contact command something like:
# 'notify-by-jabber' command definition
define command{
command_name notify-by-jabber
command_line /usr/local/nagios/bin/notify_via_jabber \
$CONTACTADDRESS1$ "$NOTIFICATIONTYPE$ $HOSTNAME$ \
$SERVICEDESC$ $SERVICESTATE$ $SERVICEOUTPUT$ $LONGDATETIME$"
}
*note - I'm using the backslash for readability only, actually only one line.
Notice how I defined "$CONTACTADDRES1" in the command line. Now check out my Contact definition...
define contact{
contact_name jdoe
use generic-contact
alias John Doe
email jdoe@company.com
address1 jdoe
service_notification_commands +notify-by-jabber
}
This will use the generic contact info for the email method -and- add the jabber contact method. Note the "address1" line I am using for the appropriate jabber_id for the user. Alternatively you could create a contact template called "jabber-contact" using the notify-by-jabber command and then apply both templates to the user contact definition.
Scenario 2 - User only needs specific Nagios service/host alerts via Jabber/XMPP.
Unfortunately this is simple. You would need to define a second contact entirely to assign to the specific host/service. Create a "jdoe-jabber" contact so on your Nagios host/service definition you would have a line like:
contacts jdoe,jdoe-jabber
Maybe with an add-on or future version of Nagios we could define a user/contact method along the lines of:
contacts jdoe:notify-by-jabber,jdoe:notify-by-email
Anyone listening? Feel free to send me a note or comment.
| Attachment | Size |
|---|---|
| jabber-xmpp-notications.txt | 2.45 KB |
ERROR: Authorization failed: error - not-authorized
Obviously this means I am connected am communicating with the server. So I flipped TLS variable on/off a few times. I also turned on debug logging on my Openfire server and could see the connections. With TLS "on" or set to "1", I received this error:
Can't use an undefined value as a HASH reference at /usr/lib/perl5/site_perl/5.8.8\ /XML/Stream.pm line 1165.
So I turned TLS off and started working with the not-authorized error. Googled around and found the a fix that worked. I had to edit the Protocol.pm perl module to fix the authentication error. I found my file here...
/usr/lib/perl5/site_perl/5.8.8/Net/XMPP/Protocol.pm
and just commented out the line:
return $self->AuthSASL(%args);
Now I can annoy myself with all my Nagios alerts via IM as well as email! Attached is the file with the text used for the commands.cfg entry and the perl script to send the notifications.
Want to send SMS messages from Nagios? SMS messages sometimes blocked when sending them via <phone#>@provider.com? Sending alot of SMS messages? Sounds like you need an SMS Gateway Provider.
In our corporate environment we wanted a more reliable/consistant SMS messaging system to work with our Nagios monitoring environment. A little research quickly led us to Clickatell. To keep things as simple as possible, we setup Nagios to use the SMS Gateway via SMTP API.
Now we had to configure Nagios. First off, we needed to create a notification method/command. Here is the command.cfg entry we created:
define command{
command_name notify-service-by-sms
command_line /usr/bin/printf "%b" "api_id:<API_ID> \nuser:<USERNAME> \npassword:<PASSWORD> \nto:$CONTACTPAGER$\nreply:<REPLY_ADDY> \ntext:$NOTIFICATIONTYPE$ $HOSTALIAS$-$SERVICEDESC$ $SERVICESTATE$\ntext:Address-$HOSTADDRESS$\ntext:Additional Info-$SERVICEOUTPUT$" | /bin/mail -s "$HOSTALIAS$-$SERVICEDESC$ is $SERVICESTATE$" $CONTACTEMAIL$
}
Bold items in brackets you must specify based on your environment. Notice how we had to significantly strip down the info sent via SMS text message. We found the above was simple and communicated the required info.
Then let's create a contact to use...
define contact {
contact_name sms
use generic-contact
alias SMS Alert
email sms@messaging.clickatell.com
pager 16665551234,16665554321
service_notification_commands notify-service-by-sms
}
The phone numbers to SMS are just a comma seperated list. It is important to note the phone numbers must have a "1" before them.
Now simply add the "sms" contact in the service definitions you want to alert by SMS text messages. Reload Nagios and you should be off and running.
Out of the box the Nagios Plugins package has a check_file_age plugin. Well, that only checks to see if the file has been modified in the last specified time period and alerts if the file has NOT changed recently. I needed the exact opposite, to check to see if the file has been modified.
The "reversal" can be accomplished by changing two ">" symbols to "<" in the comparisons of the function. Of course, I changed the few name occurances of the plugin to a new name as well. This is a good sanity check for when any critical config file that I designate is changed, it will alert the IT team. Great for developers with sudo access touching key config files they should not be, such as the http.conf file.
Using NagiosGraph with Nagios can provide valueable information about your environment. At times, you may want to show something on a different scale or limit the data seen within the graph. Below are a few basics for manipulating NagiosGraph's show.cgi to customize the graph you are viewing.
We use the "check_mssql_monitor 0.9.0" with our Microsoft SQL Clusters and graph the results. My NagiosGraph "map" file graphs the CPU, IO, IDLE, and response time of checks and Nagios notes_url for the check links to the default graph. However the scale of values typically render the IO nearly flat, yet a small change in IO can be significant. Now we want to just generate IO graphs of each of our clusters to compare them to each other.
Here's the default URL of the graph we are working with:
http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo

By manipulating the URL, we can do some handy tweaks. Let's start with only graphing the IO. To do that, we need to add the appropriate options to the URL. Adding the datasource name and valuename to the URL like this:
http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo&db=mssql_monitor,io
My favorite option is make the graph bigger! After all, bigger is better right? Just add the geometry option to the URL likse so:
http://nagios.domain.com/nagiosgraph/show.cgi?host=servername&service=MSSQLinfo&db=mssql_monitor,io&geom=700x200

Now we have a much larger graph and are able to see the IO response increasing during the given time period. Next step is to figure out why!
HINT: for the db source name, check out the default graph and look for the name immediately under the graph. Then add the value as it appears in the legend.
While building another stored procedure that I execute by check_mssql in Nagios, I noticed a little hiccup. My stored procedure was returning a value like "85.67". When I executed the check_sql on the command line to run the procedure, I got a strange error...
# ./check_mssql -H mssqlclus1 -U username -P password -p 1433 -D database -w 20 -c 35 \ > -q "exec database.dbo.sp_GetDatabaseFileMetrics database,Used,Log,1" -W 90.00 -C 95.00 -s CHECK_MSSQL CRITICAL - Result is not numeric with result threshold defined (0.089992 seconds) \ | time=0.089992s;20;35
Now that does not make sense at all. I removed the -W and -C constraints and got:
CHECK_MSSQL OK - SQL Server result: 98.10 (0.122316 seconds) | time=0.122316s;20;35
I do not know about you, but "98.10" looks like a numeric to me. So I opened up the perl for the check_mssql and looked for the conditions that triggered the error. This was the regular expression it was evaluating to determine if the value returned was a numeric instead of a string.
$result =~ /^[-+]?\d+$/
Well, that does not do the trick if I have a value like "98.10". A value of "98" would have been fine. I freely admit I am no "code guru" by any means, but I figured I shoudl be able to come up with a fix for this. I copied the stored procedure to 'check_mssql2' and went to work. I created an OR condition to look for the integer regular expression or a decimal regular expression. There may be a better way, but this worked for me. I changed this:
!($result =~ /^[-+]?\d+$/)) {
to this:
!(($result =~ /^[-+]?\d+$/) || ($result =~ /^[-+]?\d+\.\d+$/))) {
I ran my tests and it worked great! A really useful link I found was this Regular Expression Validator at http://www.sweeting.org/mark/html/revalid.php. It also has some very handy reference info on the bottom which I found useful since I do not have the pleasure of writing them on a daily basis.