In machine learning, especially in deep learning, long-running processes are quite common. Just yesterday, I finished running an optimisation process that ran for the best part of four days – and that’s on a 4-core machine with an Nvidia GRID K2, letting me crunch my data on 3,072 GPU cores! Of course, I did not want to babysit the whole process. Least of all did I want to have to do so from my laptop. There’s a reason we have tools like
Sentry, which can be easily adapted from webapp monitoring to letting you know how your model is doing.
One solution is to spin up another virtual machine,
ssh into that machine, then from that
ssh into the machine running the code, so that if you drop the connection to the first machine, it will not drop the connection to the second. There is also
nohup, which makes sure that the process is not killed when you ‘hang up’ the
ssh connection. You will, however, not be able to get back into the process again. There are also reparenting tools like
reptyr, but the need they meet is somewhat different. Enter terminal multiplexers.
Terminal multiplexers are old. They date from the era of things like time-sharing systems and other antiquities whose purpose was to allow a large number of users to get their time on a mainframe designed to serve hundreds, even thousands of users. With the advent of personal computers that had decent computational power on their own, terminal multiplexers remained the preserve of universities and other weirdos still using mainframe architectures. Fortunately for us, two great terminal multiplexers,
GNU Screen ) and
tmux , are still being actively developed, and are almost definitely available for your *nix of choice. This gives us a convenient tool to sneak a peek at what’s going on with our long-suffering process. Here’s how.
sshinto your remote machine, and launch
ssh. You may need to do this as
sudoif you encounter the error where
screen, instead of starting up a new shell, returns
[screen is terminating]and quits. If
screenis started up correctly, you should be seeing a slightly different shell prompt (and if you started it as
sudo, you will now be logged in as root).
In some scenarios, you may want to ‘name’ your
screensession. Typically, this is the case when you want to share your screen with another user, e.g. for pair programming. To create a named screen, invoke
screenusing the session name parameter
-S, as in e.g.
screen -S my_shared_screen.
virtualenv(as you ought to!), activate the environment now using
source /<virtualenv folder>/bin/activate,replacing
virtualenv folderby the name of the folder where your
virtualenvs live (for me, that’s the
environmentsfolder, often enough it’s something like
by the name of your virtualenv (in my case,
research). You have to activate your virtualenv even if you have done so outside of
screenmeans you’re in an entirely new shell, with all environment configurations, settings, aliases &c. gone)!
virtualenvactivated, launch it as normal — no need to launch it in the background. Indeed, one of the big advantages is the ability to see verbose mode progress indicators. If your script does not have a progress logger to
stdoutbut logs to a logfile, you can start it using
nohup, then put it into the background (
bg) and track progress using
tail -f logfile.log(where
logfile.logis, of course, to be substituted by the filename of the logfile.
Ctrl--Dto detach from the current screen. This will take you back to your original shell after noting the address of the screen you’re detaching from. These always follow the format
<identifier>.<session id>.<hostname>, where hostname is, of course, the hostname of the computer from which the
screensession was started,
stands for the name you gave your screen if any, and is an autogenerated 4-6 digit socket identifier. In general, as long as you are on the same machine, the screen identifier or the session name will be sufficient – the full canonical name is only necessary when trying to access a screen on another host.
To see a list of all screens running under your current username, enter
screen -list. Refer to that listing or the address echoed when you detached from the
screento reattach to the process using
screen -r <socket identifier>[.<session identifier>.<hostname>]. This will return you to the script, which keeps executing in the background.
screenimmediately closing, with the message
[screen is terminating]upon invoking
screenas a non-privileged user.
There are generally two ways to resolve this issue.
- Use a privileged user account and always invoke
- As a privileged user, change the permissions of
sudo chmod 2775 $(which screen). The first digit is responsible for a privilege elevation upon execution to sudo, which means that repeated sudoing will not be necessary.
The overall effect of both solutions is the same. Notably, both may be undesirable from a security perspective. As always, weigh risks against utility.
Do you prefer
screen to staying logged in? Do you have any other cool hacks to make monitoring a machine learning process that takes considerable time to run? Let me know in the comments!