net-snmp API and connection error handling

net-snmp has a strange API that does not seem to allow us to detect errors while trying to connect to the master snmpd instance.

When playing around with python-netsnmpagent, create a copy of run_simple_agent.sh named test.sh and modify as follows:

agentXsocket tcp:localhost:9000

or similar. Do not change the python simple_agent.py line by intent.

Running test.sh will give you an output such as:

* Starting the simple example agent...
Warning: Failed to connect to the agentx master agent (/tmp/simple_agent.deBvcPS2kl/snmpd-agentx.sock): 
simple_agent.py: Registered SNMP objects in Context "": 
[...]
simple_agent.py: Serving SNMP requests, press ^C to terminate

So the only indication that our simple_agent.py could in fact not connect to the master snmpd‘s AgentX socket is the Warning: line which we only see because we currently enable logging to stderr (the comment is wrong) in python-netsnmpagent:

# FIXME: log errors to stdout for now
libnsa.snmp_enable_stderrlog()

Actual connection establishment is triggered within python-netsnmpagent’s start() method. You might be fooled to believe that I simply forgot some error handling here:

def start(self):
    """ Starts the agent. Among other things, this means connecting
        to the master agent, if configured that way. """
self._started = True
libnsa.init_snmp(self.AgentName)

But look at net-snmp itself. From include/net-snmp/library/snmp_api.h:

NETSNMP_IMPORT
void            init_snmp(const char *)

So there is no returning of error conditions. Why did they design their API like this?

Analyzing further, if you look at the implementation in snmplib/snmp_api.c (line 808 for current net-snmp 5.7.2), you will see various function calls for which no error handling whatsoever can be found. And mind you, this is C, so we have no exception system.

In our case, all we got was the error message logged to stdout. Grepping the net-snmp sources will lead you to agent/mibgroup/subagent.c line 856 (for net-snmp 5.7.2). This is from the subagent_open_master_session function:

agentx_socket = netsnmp_ds_get_string(NETSNMP_DS_APPLICATION_ID,
                                      NETSNMP_DS_AGENT_X_SOCKET);
t = netsnmp_transport_open_client("agentx", agentx_socket);
if (t == NULL) {
    /*
     * Diagnose snmp_open errors with the input
     * netsnmp_session pointer.  
     */
    if (!netsnmp_ds_get_boolean(NETSNMP_DS_APPLICATION_ID,
                                NETSNMP_DS_AGENT_NO_CONNECTION_WARNINGS)) {
        char buf[1024];
        snprintf(buf, sizeof(buf), "Warning: "
                 "Failed to connect to the agentx master agent (%s)",
                 agentx_socket ? agentx_socket : "[NIL]");
        if (!netsnmp_ds_get_boolean(NETSNMP_DS_APPLICATION_ID,
                                    NETSNMP_DS_AGENT_NO_ROOT_ACCESS)) {
            netsnmp_sess_log_error(LOG_WARNING, buf, &sess);
        } else {
            snmp_sess_perror(buf, &sess);
        }
    }
    return -1;
}

So whatever we originally passed in as mastersocket ends up here as agentx_socket. If t == NULL, the connect failed (ie. invalid mastersocket or snmpd not running). Then unless the NETSNMP_DS_AGENT_NO_CONNECTION_WARNINGS flag was set, we generate the error message and either use netsnmp_sess_log_error or snmp_sess_perror to make it visible. And: we return -1. So from this perspective connection failure is detected.

However, looking further who calls subagent_open_master_session we’ll eventually end up here (line 96):

int
subagent_startup(int majorID, int minorID,
                             void *serverarg, void *clientarg)
{
    DEBUGMSGTL(("agentx/subagent", "connecting to master...\n"));
    /*
     * if a valid ping interval has been defined, call agentx_reopen_session
     * to try to connect to master or setup a ping alarm if it couldn't
     * succeed. if no ping interval was set up, just try to connect once.
     */
    if (netsnmp_ds_get_int(NETSNMP_DS_APPLICATION_ID,
                           NETSNMP_DS_AGENT_AGENTX_PING_INTERVAL) > 0)
        agentx_reopen_session(0, NULL);
    else {
        subagent_open_master_session();
    }
    return 0;
}

Depending on whether an AgentX ping interval was configured or not, it will either let agentx_reopen_session retry forever or just call subagent_open_master_session itself once. But as you can see: no checking of return codes, no further error handling.

What’s the context of subagent_startup itself? subagent_init, which itself does return code checking, registers it as a callback function in line 158 so that it is executed after the SNMP configs have been read:

snmp_register_callback(SNMP_CALLBACK_LIBRARY,
                       SNMP_CALLBACK_POST_READ_CONFIG,
                       subagent_startup, NULL);

Of course, if subagent_startup would returned an error code, who would be the one to take action on it? Seeing that its direct caller is merely generic callbacks code. Yet the question remains why the authors had to defer calling subagent_startup through the callback system at all, ie. why not trigger config reading and call it directly?

In either case, the way is has been implemented so far, it seems to be impossible for subagents to detect connection failures :(

Leave a comment Cancel reply