devopsdays Hamburg 2010

Monday, 10. 18. 2010  –  Category: conference

Recently back from devopsdays in Hamburg. Here’s a quick braindump whilst it’s all still fresh.

People

I went with a bunch of colleagues from the BBC, hooked up with some people I’d not seen for ages, unexpectedly met in real life people I’d worked with online previously and met a whole load of talented and interested engineers.

Talks

The conference was split between set talks and open space sessions. The talk I got the most out of was Stephen Nelson Smith‘s talk on his experience bringing devops-shaped thinking to a large government site refresh. His thoughts about what worked and what didn’t were interesting and well made.

John Willis gave an excellent presentation on Chef. I’ve monkeyed a bit with Chef but am only using Puppet in real world situations so far. John’s a powerful evangelist and definitely got me enthused about Chef. I deliberately didn’t ask about Chef’s lack of dry-run mode since I’ve banged on about it on the chef-users list twice now ((and, for the record, I understand why such a mode would be imperfect and not aligned with Chef’s overall philosophy)), and because I wanted to see if anyone else flagged it as a concern – they did. All that aside, I am certainly going to get it up in a lab soon.

Link dump

  • Marionette Collective (mcollective) – message bus framework for mass host control. I’m really interested by this. Seems similar to Engine Yard‘s Vertebrae thing, but perhaps more immediately useable for me.
  • Fabric – a more traditional (ie: centralised) mass control mechanism. We’ve all rolled our own variant of this at some point, right?
  • Zookeeper – mentioned in passing, need to check this out.
  • cucumber-puppet – not currently doing anything like this, probably should!
  • Circonus – flagged in passing, monitoring.
  • Web Operations – seems to be a good compliment to Release It!, will grab and read.

Après-ski

The conference laid on a boat trip around Hamburg’s canals and harbour on the Friday night with beer and soup. Brilliant! Afterwards we maxed out the lounge area of the Chilli Club. After sufficient beers and cocktails we headed off to the Bunker club for an epic all nighter ((but no, I didn’t quite last the whole night!)). Clearly we didn’t know about the school gym dress theme but hey that didn’t matter. The utterly bizarre story that unfolded on stage throughout the night was hilarious. House of Pain MCd by a large sweaty man in a baby costume? Oh yes! (more outrageous moments omitted…)

Saturday night included more Brownian motion around the Reeperbahn, plus Mike and I hanging out in a bar 0wned by St Pauli supporters. I don’t know much about football but these guys are clearly very dedicated to their cause! Edgy. More drinks into the dawn, then tactical sleep before flying home.

Hamburgers

I was consistently amazed how friendly everyone is. Wandering around Hamburg you could strike up conversation with pretty much anyone and find something of interest to jib about. At several points strangers would change the course of their evening just to steer our slightly cat-herdesque group of geeks towards fun.

Thanks

Huge thanks to everyone who helped make this an excellent weekend, especially Marcel and Patrick.

Getting paranoid about ssh-agent

Wednesday, 09. 1. 2010  –  Category: vague

A colleague asked me about my SSH setup, which uses different SSH agents for each set of keys that I use (I tend to use a different keypair for each client I work with) and also makes ssh-agent confirm with me each time a key is used.

What’s the point of all that? Because it’s trivially easy to take over someone else’s SSH agent if you have root on a box they’re forwarding to:

$ ssh-add -l
1024 c7:ba:59:92:98:40:f4:53:75:e3:7f:03:fc:0e:3b:bd /Volumes/key/ssh/id_dsa-zomo-bbc (DSA)
$ sudo -i
# ls -ld /tmp/ssh-*
drwx------ 2 victim admins 4096 Aug 27 16:20 /tmp/ssh-bsKJhM8501
drwx------ 2 me  admins 4096 Sep  1 09:25 /tmp/ssh-NpAJW14419
# SSH_AUTH_SOCK=/tmp/ssh-bsKJhM8501/agent.8501 ssh-add -l
1024 7a:0a:df:bb:ab:cd:af:e1:04:97:cd:05:34:8c:b4:68 /home/victim/.ssh/id_dsa (DSA)

By setting SSH_AUTH_SOCK to their agent’s forwarding socket you can gain use of their agent for onward logins. Laws may apply.

Update: To be clear, the victim and the attacker here are both logged onto a remote host over SSH and using SSH agent forwarding. This isn’t a discussion of the risks of someone having root privilege on a machine where your SSH agent process runs (and your private SSH keys reside).

To mitigate this risk, I use a collection of scripts that do two things

  1. Run different SSH agents for different keys, so that a compromised agent
    has only limited use (eg: root on client A’s hosts can’t use it access
    client B’s hosts).
  2. Require ssh-agent to prompt for confirmation before it uses a key, so that
    a compromised agent stands less chance of being exploited (if I’m away or I decline the request then nothing happens).

They’re here: http://github.com/zomo/ssh-bits. No points for elegance, but they scratch the itch.

cron

Wednesday, 02. 24. 2010  –  Category: sw

Obviously cron jobs are abundantly useful for so many things, all the way from basic housekeeping up to big application functionality.

They’re also the source of plenty of flail. What do I mean?

  • They are neither code nor data, so often get overlooked, or shonkily installed, by application deployment tools
  • They run with a minimal environment that can catch out the unwary: scripts that work in interactive shell sometimes don’t from cron
  • The default behaviour of mailing output to the cronjob owner generates large amounts of mail that gets ignored, filtered or bounced
  • Jobs can fail silently and no-one notices until, say, you need to restore that backup that hasn’t run for last six months
  • Jobs that helpfully append their output to a log commonly don’t rotate that log
  • It’s easy to have jobs overlapping if they get stuck or take longer than expected to complete. This is a splendid way of wedging a machine.

The mail aspect is a particular peeve. In some jobs my mailbox has enjoyed several thousand cron generated mails a day, and there’s no way I’m able to accurately look at each one and react to it. Mostly they contain expected output from successful job execution, so they’re easy to skip. But I don’t trust my eyes to get that right all the time.

One approach to this is to arrange for jobs to only send mail on error. This is an improvement, but can lead into thinking that a job is happily succeeding when in fact it’s either not running or the only-on-error logic is bust. Since cron jobs often cover essential system tasks like backing up, syncing data around and reporting it’s vital that they don’t fail silently.

I’ve worked somewhere that tackled this by collating cron-generated mails from diverse systems into a system mailbox and pattern matching them for failure signs. This seems slightly dubious — it’s fragile and labour intensive — but at least the system also flagged if expected jobs failed to arrive and got our inboxes tamed.

To tackle these problems I find myself writing wrappers for cronjobs. I’ve written several variants to meet different situation’s needs. Unhelpfully I call them all cronwrap. These wrappers sets out to

  • Engage the amazingly useful lockrun utility to guard against multiple execution of stuck crons
  • Place cron output into timestamped logs that can be both aged out and made available to interested parties
  • Hook into local monitoring systems:
    1. On execution, update a run counter (SNMP data or some simple text file)
    2. On failure, send a SNMP trap or leave some bait for Nagios. Also, update a fail counter
    3. If lockrun has prevented a job running owing to overlap, send a SNMP trap or similarly bait Nagios
  • If required, send output by mail somewhere (sometimes this is necessary, even with the concerns listed above)

So, nothing surprising there. Using such wrappers helps keep cron jobs tamed and reliable, and it’s monitoring them near to where the action occurs, rather than mediating via SMTP.

This is hardly invention either, there’s plenty of prior art with different nuances in behaviour to meet the needs of different environments. Perhaps I’ll merge the variants of my efforts and publish too.

What’s curious is that this functionality isn’t available inside the cron daemon (( To be clear, I’m talking about the BSD cron written by Paul Vixie. None of the variants I’ve seen address these concerns either. I’d love to know if there’s any I’ve missed.)) itself. It is perfectly placed to catch exit status, divert output and know if a job has overrun; and would remove the need for all this additional monkeying to make jobs reliable and well behaved. If my C wasn’t just read-only I’d have a crack at it!

There, I’ve finally condensed all my cron rant into one sustained piece.

Update: I posted a cron wrapper at https://github.com/zomo/cronwrap.