A Solution to Noisy Bots

It’s pretty common across the software industry for teams that use chat tools like Slack or HipChat to configure various automated systems to post updates into chat rooms. People want to know how their build is doing, whether that deploy is done, who’s primary on call today, what alerts are going off… and that can be a lot of text. Over time, the tendency is to add detail and expand the length and number of messages a given bot pushes to your chat room.

But different teams will have different preferences about how much detail they want to see. Some team might want to have a separate chat room with a message about each of the seven steps of the build process for each of their projects while another wants messages in their home chat room only if a build fails. Until recently, I thought this was just one of those things you had to argue about and work through as an engineering organization.

Then, the other day, I realized that there already exists a solution to configuring the amount and detail of status-related output from software based on different environments. We commonly call these logging levels. If your build process passes a step without failure, that might be INFO level. If the step about style linting fails, maybe that’s only WARNING. If the tests are failing, though, that’s clearly ERROR. In this way, different teams can configure different systems to be more or less chatty to their taste and you don’t have to live with a compromise no one likes.

An Example: Continuous Deployment

We use Jenkins at Under Armour for continuous integration, as has been mentioned on this blog previously. We’ve extended Jenkins with some scripts that enable a project to opt into continuous deployment by calling a few functions in the Jenkinsfile. For example, a setup where any merge to master with a successful build gets pushed to production might look like this:

if (env.BRANCH_NAME == "master" && currentBuild.result in ["SUCCESS", null]) {
  uacf_deploy.register_instances([service: "service-name", instance: "prod"])
  uacf_deploy {
    slack_channel = "team-channel"
  }
}

The specifics of how this works and why aren’t super important, but what’s going on here is that we’re identifying which machines the deploy should target and then triggering the deploy with the name of a Slack channel.

In collaboration with one of our infrastructure engineers (which is the team that maintains our CI infrastructure), I recently added a new argument to the uacf_deploy call: log_level. In order to make that work, I created a new abstraction in our Jenkins Groovy scripts for talking to Slack. It’s not super complex, so here it is:

package com.ua

import groovy.json.JsonOutput
import groovy.json.StringEscapeUtils

class SlackChannel implements Serializable {
  String channel_name
  Integer log_level
  def context // Need this to be able to reference steps like node, sh and env.

  // Ordering matters here for comparison in can_log.
  private LOG_LEVELS = ["DEBUG", "INFO", "WARN", "ERROR"]

  def SlackChannel(context, channel_name, log_level) {
    assert LOG_LEVELS.contains(log_level)
    this.context = context
    this.channel_name = channel_name
    this.log_level = LOG_LEVELS.indexOf(log_level)
  }

  def debug(message, color="#6299e5") {
    log(":bug: ${message}", color, "DEBUG")
  }

  def info(message, color="good") {
    log(":information_source: ${message}", color, "INFO")
  }

  def warn(message, color="warning") {
    log(":warning: ${message}", color, "WARN")
  }

  def error(message, color="danger") {
    log(":x: ${message}", color, "ERROR")
  }

  def log(message, color, level) {
    if (can_log(level)) {
      context.node('linux'){
        context.sh "curl -s -d \"payload=${payload(message, color)}\" \"${context.env.SLACK_WEBHOOK_URL}\""
      }
    }
  }

  def can_log(level) {
    this.log_level <= LOG_LEVELS.indexOf(level) && this.channel_name
  }

  def payload(message, color) {
    StringEscapeUtils.escapeJava(JsonOutput.toJson([
      channel: "#${this.channel_name}",
      username: "Jenkins (${context.env.JENKINS_SLUG})",
      icon_emoji: ":jenkins:",
      attachments: [[
        text: message,
        color: color
      ]]
    ]))
  }
}

You can see that it implements a pretty common logger interface in that you instantiate it with a log level, then when you want to output information, you also pass in a log level and it conditionally writes the output or doesn’t. I also played around with emoji and colors to make it look nice and more obvious about what kind of output any given line is. The emoji are there to help out folks for whom differentiating based solely on color might be hard.

In usage in our continuous deployment script, it looks like this (edited liberally for simplicity’s sake):

import com.ua.SlackChannel

def uacf_deploy(slack_channel, log_level) {
  def slack = new SlackChannel(this, slack_channel, log_level)
  try {
    if (!(currentBuild.result in ["SUCCESS", null])){
      slack.error("[<${BUILD_URL}|${JOB_NAME}>] Deploy skipped for ${env.BRANCH_NAME} (${env.SHORT_GIT_COMMIT}) because build result was: `${currentBuild.result}`")
    } else if (UACF_Deploy_Globals.Instances.size() == 0){
      slack.warn("[<${BUILD_URL}|${UACF_TAG}>] Deploy skipped for ${env.BRANCH_NAME} (${env.SHORT_GIT_COMMIT}) because there were no targeted instances found.")
    } else {
      slack.info("[<${BUILD_URL}|${UACF_TAG}>] Deploying ${env.BRANCH_NAME} (${env.SHORT_GIT_COMMIT})...")
      // Deploy here.
      slack.info("[<${BUILD_URL}|${UACF_TAG}>] Deploy finished for ${env.BRANCH_NAME} (${env.SHORT_GIT_COMMIT}) :tada:")
    } catch (err) {
    slack.error("[<${BUILD_URL}|${UACF_TAG}>] Deploy of ${env.BRANCH_NAME} (${env.SHORT_GIT_COMMIT}) had an error: ${err}")
  }
}

There are some references in there to things I haven’t addressed, but you can see that it handles a few different situations and logs them each with an appropriate logging level. This lets one team set all their projects to log level INFO and get messages when deploys start and finish well as well as null deploys, broken deploys and broken builds while another team can set all their projects to log level ERROR and only hear about problems.

It also enables a team to treat different projects differently. Maybe a new project is still getting the CI and infrastructure ironed out so you want to see all the details about deploys until you’re confident it’s stable whereas that rickety old codebase you inherited and no one understands super well yet, maybe you keep it on INFO permanently just because it’s so problematic.

The specific implementation above isn’t super important, though feel free to take and adapt that if it’s useful (provided AS IS, no guarantees, etc.). The important thing is just to think of the output from your bots to your chat rooms as logging and let different teams set levels on a granular basis. This is still a new idea at UA and we haven’t converted most of our other stuff to support this kind of thinking, but so far it’s showing promise.

chatopsjenkins