Thursday 5 April 2007

Nagios on Centos - a grudging union

Centos is one of the great network operating systems, it was developed by a group of people who saw Red Hat Enterprise version 4 had become super reliable but slightly bloaty, they exercised their rights under the GNU public license got the source of RHEL4, put it on a stairmaster and gave it to the people.

Likewise Nagios is a great open source network monitoring system, if you are a Linux user and run a network chances are you will have come across Nagios, short of forking out about £1000 it is in fact pretty much your only option. About 12 months ago I installed Nagios on Fedora and it was a breeze, even though Nagios is a very comprehensive system requiring lots of fiddly configuration, on Fedora if you follow the instructions you will succeed in getting going in about an hour.

Unfortunately given that Centos and Fedora have a common ancestry and are very similar, trying to install Nagios on Centos will drive you up the wall. Unless I have done something stupid without realising it, installing the system from RHEL4 RPM's seems to scatter the files from one end of the disk to the other and it takes lots of patience to track them all down and link everything up. My advice would be follow the instructions to the letter but if you don't find the files you are looking for don't be surprised. Click here for the main Nagios site, this post is not a guide to installing Nagios on Centos just an amendment to the install guide based upon my rather frustrating experience.

Just in case I forget or anyone else trips over this, the locations are as follows:

Config CFG files - /etc/nagios
Web interface files - /usr/share/nagios
Log files - /var/log/
CGI files - /usr/lib/nagios/cgi

A guy called Dag (??) has done some Centos RPM's but I couldn't subscribe to his repository, if you can it is quite possible that he has reworked the install to follow the instructions. Couldn't resist doing my 'Google Images' thing for Dag, turns out this Swedish guy is also comfortable going my the name Dag. There are some great translations for Dag on wikipedia, in Swedish it means 'Day' and in Turkish it refers to a 'Mountain'.




So now the dust has settled after our mammoth network rewire last week and Nagios is running sweetly I feel quite satisfied with everything. As expected we have had a few static routes crawl out of the woodwork and we have renewed our efforts to use DNS rather that IP addresses for routing around the network. It turns out reversing the VPN connections was not all that it promised and we have moved them all back again, it also seems that having Nagios running is actually very good for the stability of the VPN as the frequent pinging seems to keep the routers awake and the tunnels in good repair.

One job left to complete is to define a custom status map for Nagios, as we have over a hundred nodes on the network being monitored the auto generated map is a bit of a mess so I have to define the map by hand which is a bit of a pain. That said it will look very nice as we have purchased an icon library for our software development and their networking set is very sweet. See left for a sneak peek, note however that our main managed switches are not down it is just that Netgear have issued a firmware upgrade they are short of. One day I would love to do a more comprehensive Flash front-end to Nagios but frankly right now I have better things to do.

Another job I think would pay dividends would be to set up a secondary DNS server at Manchester, it is probably quite straight forward but I think I will let the dust settle before attempting this one.

23 comments:

Anonymous said...

I just finished a case study article for our site on a Nagios user that basically removed it in lieu of a third party out of the box solution from Hyperic. Sounds like things are going pretty well for you however.

I'm not clear though -- are you using CentOS as your mission critical OS? Where do you get your support from, if any? Did you have RHEL at one time? And, is Nagios historically buggy with Fedora/CentOS/RH, or is this a case that is specific to your environment?

Thanks!

Jack Loftus, News Writer
SearchEnterpriseLinux

Kieron Williams said...

Jack,

Thanks for the comment, nice to know a real person is reading and not just an army (ok at 25 it's more of a platoon) of RSS readers :o)

One of the reasons we chose Nagios is that our network is not that much of a critical part of our operation. Without wanting to understate the importance of smooth IT and administration we are a pub company and so if the beer stops flowing we have a problem, if the email is off for 30 minutes it is simply an inconvenience. That said we wanted to be proactive and have a network monitor so that out of hours we can fix minor issues but what I was not going to do was dip into the IT budget and wreck a grand on it.

Reading the summary of your article on Nagios is seems your case study was an internet based company in which case of course network down time equates to hard cash lost very quickly. Also our requirements are quite basic in terms of monitoring services, essentially I want to know that the pubs are online and the routers are responding so we just ping everything. Looking forward we might extend things to monitoring ups units and a very limited series of services like DNS but we are not a power user by any stretch.

As for operating systems our use of CentOS is not mission critical just yet although I have no objection to it becoming so. Our main systems are running on Windows but we were getting frustrated by the licensing constraints when we were developing new systems. For example if we want a new database installation to experiment with new applications I do not want to run out and buy a new Microsoft SQL license, so to keep this side of the law we started using MySQL and were very pleasantly surprised. We had the same issue when we were looking at fail-over servers, we do not want to go out and buy 2 or 3 grands worth of software for it to sit in a corner ticking over 'just in case DNS fails' so we began to experiment with fedora and again were very pleased with what we found.

There were 2 reason for using CentOS instead of Fedora, firstly our data centre operators recommended it to us as they were big fans and we also wanted to get a cut down installation of Fedora as we wanted a virtual server running 4 OS's and didn't need a nice GUI. We have been running CentOS now for 5 months whilst we bring on our next generation of applications and (find me a big piece of wood) have found it very reliable even having squeezed 4 installs on a VMWare base.

As far as support we simply use the Internet, we paid a lot for Microsoft SBS and get no support from them so we don't really miss it! I would say that having used Nagios on Fedora in the past installation was fine, for some reason with CentOS it just littered files from one end of the file system to the other but it could easily have been my fault.

I hope this gives you a bit more information.

Cheers
Kieron

JBC said...

Pretty slick Kieron.

(re: http://www.brunningandprice.co.uk/)

Not a bad gig you have going there. If you don't mind I'm going to keep your blog in mind and might shoot you some more Nagios/Linux related comments/questions in the future for SEL articles.

Cheers, and thanks again.

Kieron Williams said...

Please do, glad you found the Blog of interest.

Cheers
Kieron

Leo Martinez said...

Hi Kieron,

I am on the process of revamping our laboratory in my teams. I am leading a couple of tech-support team for enterprise products. And our test servers are spread across several OS versions and there's no way for me to know if they are really used efficiently.

I got across Nagios and tried to read the content of the webpage. But I was actually looking for actual implementation. I am glad to get across your blog. I am looking at monitoring all of the test servers in our laboratory (who are owned by more than 20 tech-support engineers).

I have a couple of questions though:
1. Can it monitor CPU, file and memory usage?
2. Can it monitor M$ OS performance (as in question 1) as well?
3. What routers does it support? I am assuming it's supporting Cisco.
4. I am amazed to see switches being monitored. I am assuming these are managed switches. Is this correct?

If you have time, I hope you can answer my questions.

Thanks,
Maenard

Kieron Williams said...

Richard,

I would suggest that if you want to really start using the more powerful features of Nagios you need to visit Nagios Exchange (http://www.nagiosexchange.org). All of the more exotic monitoring functions are accomplished using a very flexible plugin architecture and a brief visit will give you a flavour of what has been produced.

I can see in the OS category under Windows there is indeed a plugin to monitor CPU usage and Memory usage. I must confess that we use none of these more advanced features and the only function we monitor on the switches (which are Netgear managed switches) is their ability to accept a ping request. It turns out they needed a firmware upgrade even to accomplish this :o)

Hope this helps
Cheers
Kieron

Anonymous said...

Hi Kieron,

Just wanted to say nice post - I'm currently evaluating Nagios for my company so was interesting to read about your experiences.

Cheers

Robbie

Anonymous said...

Do you have multiple nagios nodes in a distributed architecture - reporting back to a central dashboard? or is that just one Nagios node doing all the checks?

What solution did you choose in the end for the VPN connection between the sites?

thanks (looking to do the same thing myself very soon)

Andy

Kieron Williams said...

Andy,

We only have one nagios installation doing all the work but I must say we have recently been investigating Spice Works which is the dogs danglies. Platform requirements are different and its a bit windows orientated but it goes a lot further into auditing systems and software than Nagios. I think if things keep going as they are we might switch so check it out.

Cheers
Kieron

Anonymous said...

g6luky Your blog is great. Articles is interesting!

Anonymous said...

We do use Nagios in a mission critical 24x7 availability environment. It is installed on CentOS as well. We have several hundred servers on monitoring, with 4-5 checks on average for each server. CPU/MEM/DISK/HP SIM are checked on all windows servers, (most of our app servers provide use with .aspx?health or .dll?health pages with error codes) and then specific services are running, http response, ssl certificate expiry etc.
Linux server similar thing all servers have load,mem,disk and then specific server role stuff like web server, ssl for web, hit xyz url and check for specific string (it can even follow 'links').

It checks which server are on load balancer, checks the vips, (no MS SQL specific checks were setup for some reason)...you get the idea.

If proper grouping of hosts/services is done very complex escalations can be setup with ease. Our setup pages all IT members during 9-6 (m-f workhrs), during 10-6 pages nightguy (m-f), if he does not answer with ack within 1-3 notifications, on-call starts getting paged as well, and then the rest of team if both contacts dont ack in time and in the pockets of time b/w workhours and nightguy it pages the on_call person, again if he does not ack then everyone gets paged.

Timeperiods concept is also used for services, so for example during night (backup hours) we don't raise alerts for high cpu/disk usage, but during day we expect to get alerted if it exceeds thresholds for a certain amount of time.

Other than linux/windows servers we also have servers that monitor temp/humidity in different parts of the Datacenter and nagios (indirectly) checks those. We also have checks for linux server where nagios looks at logs files for certain error codes and sends alerts.

Just as backup, we have an 'rsynced (configs)' secondary nagios server. This server is in sleep state but has shell script checking primary nagios is alive every few minutes, as soon as primary goes down the secondary wakes up and has exact similar config (only retention data is lost (comment/acknowledge etc). This setup has been live in production and primary monitoring solution for a few years. We do have MOM (Microsoft operations management) and its successor SCOM licenses but find nagios more efficient and reliable. MOM/SCOM is good for retrieving data such as cpu/mem/graphs from windows hosts. scom requires agent to be install on each server, while nagios is doing all the windows gathers stuff without agents on end machines (it uses a windows wsc proxy though).

Karl said...

Hi Kieron.. really nice article...
how die you manage the grafical map??
this is really nice

Anonymous said...

I enjoyed reading your blog. Keep it that way.

Anonymous said...

Sweet website, I had not come across beerbytes.blogspot.com before during my searches!
Continue the excellent work!

Anonymous said...

Hello there,

This is a question for the webmaster/admin here at beerbytes.blogspot.com.

May I use some of the information from your blog post right above if I give a link back to your site?

Thanks,
Thomas

Kieron Williams said...

Of course - fill your boots

Anonymous said...

Hello there,

Thanks for sharing the link - but unfortunately it seems to be down? Does anybody here at beerbytes.blogspot.com have a mirror or another source?


Thanks,
John

Anonymous said...

Hey,

I have a question for the webmaster/admin here at beerbytes.blogspot.com.

May I use part of the information from your post right above if I give a backlink back to your website?

Thanks,
Peter

Kieron Williams said...

No problem

Anonymous said...

Hello there,

This is a message for the webmaster/admin here at beerbytes.blogspot.com.

May I use part of the information from your blog post above if I give a backlink back to your website?

Thanks,
Peter

Kieron Williams said...

No problem

Anonymous said...

Hi there,

This is a question for the webmaster/admin here at beerbytes.blogspot.com.

Can I use some of the information from your post above if I give a link back to your site?

Thanks,
Mark

Anonymous said...

Hello,

I have a message for the webmaster/admin here at beerbytes.blogspot.com.

May I use some of the information from your blog post right above if I give a backlink back to your site?

Thanks,
Daniel

A view from the rack is the personal blog of an IT manager who works for a pub company - hence beer