So, the announcement of FreeBSD 9.2 came out on Monday [September 30th], which I missed because I was focused on my UNMC thing. But, once it appeared, I knew that I was going to want to upgrade to it sooner than later.
From its highlights, the main items that caught my attention were:
But, I did start this upgrade on October 4th....where for an unknown reason, I launched the
freebsd-update process on cbox, the busier of the two headless servers. I suspect I went with doing the upgrade on my headless servers, because they are entirely running on SSD and would likely see the benefit of lz4 compression. And, perhaps I did cbox, because it was the system that could most gain from lz4.
It took a couple iterations through
freebsd-update, before I got an upgrade scenario that could proceed. And, it took a long time given the high load that is cbox.
That is cbox is an Atom D2700 (2.13GHz, dual core) processor. And, cacti (especially with the inefficient, processor/memory intensive percona monitoring scripts -- might help if only scrpt server support worked, and wasn't just a left over from what it was based on.) being the main source of load. That is usually in the 11.xx area, except during certain other events (like, since 3.5, when
cf-agent fires...cbox is set to run at a lower frequency than my other systems.) or when the majority of logs get rotated and bzip'd. And, there's also some impact when zen connects to
rsyncd each day for
backuppc. But, these spikes weren't that significant. Though the high load would cause
cf-agent runs to take orders of magnitude longer than other systems, including its 'twin' dbox.
Also ran into a problem (again?) where a lot of the differences that
freebsd-update needed resolved were differences in revision tags....some as silly as '9.2' vs '9.1', others had new time stamps or usernames, but seldom any changes to the contents of the file. Which I then discovered a problem from having some of these files under
cfengine would revert these files back to having '9.1' revision strings, which confused the
freebsd-update. I ended up updating all the files in
cfengine to have the 9.2 versioning, though I thought about just removing/replacing it with something else entirely, though wasn't sure the impact that would have on current/future
Though it did seem to cause problem with the other two upgrades, where it would say that some of these files were now removed and asked if I wanted to remove these. Which doesn't make sense, since it didn't say that with the first upgrade. It was probably just angry that these files already claimed to be from FreeBSD 9.2.
It also didn't like that I use
sendmail, therefore my sendmail configs are specific to my configuration, or that I use
printercap is the one auto-generated by cups, etc.
But, once it got to where it would let me run my first "
freebsd-update install". I ran it, rebooted, ran it again, rebooted, updated stuff (though it didn't complain as much, perhaps because some of the troublesome kernel mod ports had corrected the problem of installing into
/boot/kernel, or perhaps enough stayed the same between 9.1 and 9.2, that things didn't freak out like before. And, this includes the virtualbox kernel mod, when I did the upgrade on zen, and later mew. But, I re-installed these ports and lsof. I did a quick check of other services, and then upgraded the 'zroot' zpool to have feature flags (which now means it no longer has a version, apparently instead of jumping the numbers to distinguish from Sun/Oracle it has eliminated having version numbers (for beyond 28) and having flags for the features added since. Wonder if the flags capture all has changed since 28, since I thought there have been other improvements internal that aren't described by version numbers. Namely, I seem to recall that there have been improvements in recoverability....namely it had been suggested, when I was trying to recover a corrupt 'zroot' on
mew, to try finding a v5000 ZFS live CD. Which I don't think I ever found, and gave up anyways when I concluded the level of corruption was too great for any hope of recovery and that I needed to resort to a netbackup restore, before the last successful full get's expired. Though being that it was nearly 90 days old, the other two month fulls didn't exist due to system instability that eventually caused the corrupted zpool (eventually found to be a known bad revision of the Cougar Point chipset and a bad DIMM...things seem to finally be stable from using a SiI3132 SATA controller instead of the on board, and getting that bad DIMM replaced....was weird that it was a Dell Optiplex 990, purchased new over a year after the problem had been identified and a newer revision of the chipset was released. I did eventually convince Dell support to send me a new motherboard and replace the DIMM. The latter was good, since I had to use DIMMs from another Dell that had been upgraded, so I had less memory for a while. But, while at first I did use the onboard SATA again, eventually I started having problems that would result in losing a disk from the mirrored zpool, to eventually causing a reboot where they would both be present again [though gmirror would need manual intervention]....and moving back to the SiI3132 has finally gotten things stable again. Though the harddrives in mew are SATA-III, so it would've been desirable to have stayed on the SATA-III onboard ports, where it was these ports that were the main source of problems in the prior defective version. Perhaps the fact that the prior version had a heatsink and the new version didn't, wasn't because they didn't need it to try to compensate for the problems caused by over-driving the silicon for the SATA-III portion. But, an oversight with the newer revision motherboard. The problem did tend to occur in the early morning hours on the weekend, when not only is there a lot of daily disk activity, but there is also a lot of weekly disk activity, etc. Oh well.)
So, after upgrading the zpool, and reinstalling the boot block/code. I then rebooted the system again. I had already identified the zfs filesystems where I had 'compression=on', so had written a script to change all these to 'compression=lz4'. Which I now ran.
And, then I turned my attention to doing dbox.
Pages: 1· 2
There are Unix servers at work that have uptimes in the >1000 days, there are even servers with updates in the >2000 days, in fact there are servers that have now exceeded 2500 days (I'm looking at one with 2562+ days.)
On one hand there are SAs that see this as a badge of honor or something to have had a system stay up this long. OTOH, its a system of great dread.
A while back this system was having problems....its Solaris and somebody had filled up /tmp....fortunately, I was able to clean things up and recover before another SA resorted to hard rebooting it.
The problem with these long running servers, especially in a ever changing, multi-admin shop, is that you can't be sure that the system will come back up correctly after a reboot.
We've lost a few systems at work due to a reboot. Some significant ones as simple as replacing a root disks under vxvm and forgetting to update the sun partition table, or a zpool upgrade and forgetting to reinstall the boot. To more significant ones, where a former SA had temporarily changed the purpose of an existing system all by command line and running out of /tmp...so that after its been up for 3+ years and he's been gone over a year....patching and rebooting makes it disappear.... the hardware that the system was supposed to be on needed repair, but he had never gotten around to it.
It'll be interesting to see what happens should the system ever get rebooted.
So, what brought this post one?
And, its the same time suck....cacti.
Last weekend got away from me, because I to make another attempt to improve
cacti performance. I had tried adding 3 more devices to it, and that sent it over the limit.
I tried the
boost plugin....but it didn't help, and only made things more complicated and failure prone. Evidently, updating rrd files is not a constraint on my
cacti server. Probably because of running on an SSD.
I made another stab at getting the percona monitoring scripts to actually work under script server, but that failed. I suspect the scripts aren't reentrant, because of their use of global variables and relying on 'exit' to cleanup things it allocates or opens.
I had blown some previous weekend when I had tried to build the most recent version of hiphop to maybe compile the scripts, but after all the work in figuring out how to compile the latest 2.0x version...it would SEGV, just as the older lang/hiphop-php did after resolving the problem of building with the current boost (a template had changed to need a static method, meaning old code won't link with newer boost libraries without a definition of this.) And, this is beyond what I have in my wheelhouse to try to fix.
During the week, I had come across some more articles on tuning FreeBSD, namely a discussion of kern.hz for desktop vs servers. Where it being 1000 by default is good for desktops, but the historical setting of 100 being what to use for servers. Though IIRC, ubuntu uses 250 HZ for desktops and 100 HZ for servers, it also doesn't do preemption in its server kernel along with other changes (wonder if some of those would apply to FreeBSD?) Though modern kernels have been moving to be tickless. Which I thought was in for FreeBSD 9, though the more correct term is dynamic tick mode...and which is more about not doing unnecessary work when things are idle. Which isn't the case with 'cbox'. So, perhaps, fiddling with kern.hz and other sysctls might still be relevant. Though haven't really found anything detailed/complete on what would apply to my situation.
So, I thought I would give kern.hz=100 a shot.
At first it seemed to make a difference....no improvement in how long to complete a poll, but the load was lower. Until I realized that a service had failed to start after reboot. I had only run the rc script by hand, I hadn't tested it in a reboot situation. And, its not an rc script....it was used to be a single line in rc.local that worked on ubuntu and FreeBSD (except on one of the Ubuntu systems it results in a ton of zombie processes, so making it an init.d script that I could call restart on happened.
So, I spent quite a lot of time reworking it into what will hopefully be an accept rc script. One thing I had changed was that instead of using a pipe ('|') which was causing the process after the pipe to respawn and turn the previous process into a zombie each time the log file was rotated and "tail -F" announced the switch. And, this was while I was moving the service to FreeBSD (and management under cfengine 3.)
Though looking at my cacti graphs later....while the service had failed to start after reboot, it turned out to have been running for sometime, until I had broken it completely in trying to rc-ify the init script. Will, duh....I had cfengine set to promise that the process was running, and it had repaired that it hadn't started after the reboot.
Another thing I had done with I had init-ified the startup of this service, was I switched from using pipe ('|') to using a fifo, which addressed the respawning and zombie problem and eliminated the original reason to have an init.d script....
While the init.d script had worked on FreeBSD...it was just starting the two processes with '&' on the end then exiting. FreeBSD's rc subroutines do a bit more than that. So things weren't working. The problem was that even though I was using daemon instead of '&', so that daemon would capture the pid and make a pidfile. seems daemon wants the process it manages to be fully working before it'll detach. But, the process is blocked until there's a sink on the other end of the fifo. (does sink fit was the name for the fifo's reader?) I first wonder if I could just flip the two around, but I suspect starting the read process first would be just as blocked until the write process is started. So, I cheated by doing a prestart of the writing process and only tracking the reading process.
Though it took a bit more work to get the 'status' action to work....eventually found I needed to define 'interpreter' since the reading process is a perl script. And, the check_pidfile does more than just check to see if there's a process at the pid, but that its the right process. And, it distinguishes between arg0 and the rest.
Pretty slick...guess I need to do a more thorough reading of the various FreeBSD handbooks, etc. Of course, it has been 13+ years between when I first played with FreeBSD to its take over of my life now.
As for the tuning....it had made a small difference, but no improvement on cacti system stats. Basically the load average fluctuates a bit more and the CPU utilization seems to be a bit lower...though it could because the 4 lines of the cacti graph aren't so close to each other now.
Meanwhile...I noticed that one of the block rules in my firewall had a much higher count than I would expect, so I think I was about to get logging configured to see what that's about.....(which I was working on when I remembered that I hadn't rebooted after making the kern.hz change to /boot/loader.conf yesterday...the commit also picked up files that I had touched while working on moving the one remaining application on 'box', though that may get delayed to another weekend....perhaps the 4 day one coming up.)
I had set cf-execd's schedule to be really infrequent (3 times an hour), because I was doing a lot of testing and cf-agent collisions are messy....messier than they were in cfengine 2 (in 2 it usually just failed to connect and aborted, in 3 it would keep trying and splatter bits and pieces everywhere....which is bad when there are parts using single copy nirvana. resulting in services getting less specific configs, until the next run.
But, I sort of brought back dynamic bundle sequences.... but key off of "from_cfexecd", so I can test my new promise with less problems of colliding with established promises. Though there are other areas where things still get messy.... need to clean up some of the promises I had based on how things were done at work, so that the promises are more standalone.
Kind of weird using my home cfengine 3 setup, and other admin activities, as the means to break the bad habits I had picked up at work....
Latest Poopli Updaters -- http://lkc.me/poop
|<< <||> >>|
«air purifier» progressive zen orac b2evolution raid «powersource 400» cox upgrade «hd movie» cpap mdadm 10.04lts ups «sans digital» lhaven raid1 woot tv «tivo hd» appletv cfengine3 prescription usb netflix boinc «watch instantly» eyeglasses «windows xp» «amazon prime» box ubuntu ebay tardis virtualbox dsl replaytv tivo twitter voip «doctor who» dvd «chicago tardis» quicken freebsd «windows 7» «instant streaming» linux backuppc amazon.com