Incomplete backup config outputs nornir

Hello @dbarrosop, @ktbyers

I am running Nornir to backup device configs for approximately 10, 000 devices. While a majority of the devices gets backed up perfectly fine there are still a lot of devices that return partial running config or the saved .txt file is empty. These devices are accesible because when i login to them manually via putty and execute show run, i don’t see any delays or latency to produce the output.

What could i do to make this more reliable so that Nornir instead of producing partial outputs of show run should produce complete show run outputs.

While i think it has little to do with Nornir but underlying Napalm or Netmiko, I wanted to take help of experts to overcome this situation i am running into because partial backups are useless.

Thanks
Gurpreet Kochar

Hi @kochargs

I had the same problem (and a few more) when I wrote some proof of concept code last year. In order to make it reliable I ended up with a try/except block within a loop then popping results and re-trying when there was a problem. My code relied on Netmiko.

Good news, I just tried that code twice with latest versions of Nornir and Netmiko - perfect, the excepts are no longer activated and the backups are complete. Are you running the latest versions?

Anthony

Hello Anthony,

Yes i am using latest versions of netmiko napalm and nornir. Even i am using try except but the actual issue is not exceptions, its that the code does log into devices but it will produce either no output or some part of the output is logged.

Could you share your code so that i can run it against my set of devices ?

I’ve placed it up on github.

https://raw.githubusercontent.com/no-such-anthony/nornir-play/master/gather-configs.py

I was talking to mr ktbyers on a seperate forum and he feels this is not a code issue but rather AAA issue that causes significant delays leading to this behavior but i am willing to keep trying all options but i just can’t try without AAA :slight_smile:

Interesting. I also recently upgraded our TACACS from ISE 2.4 to ISE 2.6 Patch 3. Maybe that helped.

If i am running a job for some devices, i will randomly get “search pattern never detected” for some of the hosts. These are all cisco routers. The best guess is that the device has delay in producing the output to SSH channel but if i login manually and run the same command, it immediately prints error on the screen.

More details on this issue here:

The most likely cause of this (in this specific case) is AAA is causing significant slowness to respond under concurrent situation. Possible options to fix this:

  1. Reduce the number of workers
  2. Set global_delay_factor to 2 or 4 (or potentially larger to allow for more variability in delay).
  3. See if there is a way on AAA to improve the concurrent performance.

You could also do more debugging on Netmiko using Netmiko logs to try to get more data. You would need to do this in a way where you received separate log files or Netmiko session_log per Nornir thread. In the case of Netmiko session_log, you might need to add timestamps in some way. This might give us more data on the issue and what is happening.

Looking at the code I provided you could possible add the following under “except NornirSubTaskError as e:”

    if isinstance(e.result.exception, IOError):
      if 'search pattern' in e.result[0].result:
        print(f'{host} hit error {e.result.exception}')
        if retries == 2:
          raise e         
      else:
        raise e

I do like the idea of increasing global_delay_factor, I don’t think I tried that when I encountered the problems.

Did some further testing. As mentioned, environment has latest Netmiko & Nornir and a clean install/upgrade of AAA server. Inventory has 1374 hosts.

20 threads 26min 25sec with 0 exceptions
30 threads 25min 13sec with 0 exceptions
100 threads 27min 49sec with 61 exception retries (plus one new exception I hadn’t coded in)

Modified code to close connection within the task.
100 threads 9min 2sec with only 2 exception retries (on two firewalls with long configs)
200 threads 9min 20sec with 0 exceptions

I am fine with closing the connections within the task if I get better performance. Have you tried that @kochargs?

1 Like

Not yet, got caught up with BAU work.

Did you also update the code at github ? I will use that same code and give that a try later tonight and send an update to you. I am kinda new to python and coding so i take a little longer to work through it :smiley:

No hurry. BAU work always gets in the way of programming. I updated my code to close connections within the task and to handle a couple more exceptions.

You could try adding task.host.close_connections() to your existing code - at the end of the task function. With that and num_workers of 50 I think your script would be fine. That’s what I’ve seen in my experimentation anyway, others likely have different results.

When I don’t close the connections, after a while, the time to get the running-config of an individual device appears to increase up to 3 minutes. I have no idea why. Most of the exceptions are ‘Failed to enter enable mode. Please ensure you pass the ‘secret’ argument to ConnectHandler’.

Most recent results-
60 threads without closing connection in task - 24m 38s (48 exceptions)
60 threads with closing connection in task - 7m 47s (0 exceptions)

For those curious as to the eventual slow down that I see, looking at accounting logs -

When I don’t close the connection in task - example of eventual slow connection
9s between passing authentication & terminal length 0
16s between terminal length 0 & terminal width 511
14s between terminal width 511 & enable
34s between enable & the password being entered at the > prompt, due to after 30s enable times out
59s between enable password being entered at > prompt & re-trying enable (due to exception retry in code)
37s between second enable attempt and show running-config
Summary: 2m 49s from authentication to show running-config

When I do close the connection in task - example of normal connection (same device as above)
4s between passing authentication & terminal length 0
1s between terminal length 0 & terminal width 511
2s between terminal width 511 & enable
3s between enable & show running-config
1s between show running-config & exit
Summary: 11s to complete

Hi Anthony,

I tried your code and i am getting these error messages almost always now

test-134-ce01 hit error Search pattern never detected in send_command_expect
: test-134-ce01#
test-135-ce01 hit error Search pattern never detected in send_command_expect
:test-135-ce01#
test-020-vpn1 hit error Search pattern never detected in send_command_expect
: test-020-vpn1#
test-129-ce01 hit error Search pattern never detected in send_command_expect
: test-129-ce01#

It doesn’t matter what is the value of global_delay_factor that i use… its always the same… but if i login to these devices myself, there is no issue at all… agains something that has probably to do with AAA.

What is your num_workers? Also, I had forgotten to break out of the while loop, so it was getting the config more than once. Just add the break in the code after the length check.

Sure… Num workers i tried with 50 and with 20.
The script eventually error-ed out because the path to diff was not a valid path so it never wrote anything to disk…i didn’t fix it yet because i was still getting search pattern not detected with multiple combinations of num workers and global_delay_factor… a factor of even 8 doesn’t make any difference to the error message.

When you get a chance, just to rule it out, could you add the break in and try again,

...
  if len(data.result) < 100:
    raise ValueError('Config unexpectedly short')
    
  break

except NornirSubTaskError as e:
...

And modify the Search pattern check to

...      
if isinstance(e.result.exception, OSError):
        if 'Search pattern' in e.result[0].result:
          print(f'{host} hit error {e.result.exception}')
          temp_net_connect = task.host.get_connection("netmiko", task.nornir.config)
          print(f'{host} prompt = {temp_net_connect.find_prompt()}')
...

I started the week experimenting with process pools instead of thread pools. The Netmiko script had fast_cli=True with login, and fast_cli=False after login. It also had an insane 300 processes. Time to complete 1m 9s with no exceptions to gather the running-config from the same inventory of 1374 devices. 1m 32s without fast_cli login. Definitely no AAA isues here.

Going back to a more sensible number of 60 processes the script ran in 3m 8s compared to the 7m using Nornir and 60 threads. I also wrote a third script, using threading and Netmiko but without fast_cli, it also ran in at around 7m with 60 threads.

Continuing this learning adventure I ran across gevent monkey patching and decided to try that on the nornir and the threading netmiko script. The time to complete went from 7m down to around 4m 30s for both scripts with 60 threads. There seems to be no stopping the monkey.

Nornir, no exceptions for
120 threads 3m 5s
200 threads 2m 58s
300 threads 2m 35s
500 threads 2m 47s

Seeing the success there, I decided to try 100 threads and go back to leaving the connections open.
100 threads 3m 30s, no exceptions. What the…that used to take 25m with 50 exceptions.

By simply entering - from gevent import monkey; monkey.patch_all() - as the very first line of code, boom, no more issues with decreasing performance/reliability as the number of open SSH connections grew. I have no idea what it is doing, but I like it.

To keep me off the streets, and hopefully help others understand my ramblings, I did more work on this.

The environment
Python 3.6.9 running on Redhat 6
Netmiko “show running-config”
either 100 threads or 100 processes
1369 devices (mix of ios, ios-xe, asa, nexus, asr)
Code will retry a host on particular known exceptions
Four scripts (gevent, multiprocessing, concurrent.futures, Nornir)

multiprocessing Pool, fast_cli - 1m 12s
multiprocessing Pool - 2m 13s

gevent Pool, gevent monkey patch, fast_cli login - 2m 40s
gevent Pool, gevent monkey patch - 3m 16s

concurrent.futures ThreadPoolExecutor, gevent monkey patch, fast_cli login - 2m 43s
concurrent.futures ThreadPoolExecutor, gevent monkey patch - 3m 23s
Nornir, gevent monkey patch - 3m 19s

All good up to here, no exceptions, full config, fast, and no need to close the connection within task or reduce the number of threads. Now let’s remove the gevent monkey patch and watch things slow down.

  • Connections closed within task

concurrent.futures ThreadPoolExecutor - 6m 45s - 6 exception retries (from ASAs)
Nornir - 6m 50s - 9 exception retries (from ASAs)

  • Connections not closed within task, so we can reuse in later tasks

concurrent.futures ThreadPoolExecutor - 26m - 176 exception retries
Nornir - 26m, 165 exception retries

Just for the fun of it, what was the fastest I could get the config from 1369 devices? 39s with 300 processes - multiprocessing Pool, fast_cli login.

I am hoping someone smarter than me can work out what is the best way to deal with the thread I/O blocking that appears to be happening with SSH. Until then I will just use the monkey patch on any Nornir scripts.

(Edited to fix some times)