How to share your multi-GPU machine efficiently?
We have came across this scenario multiple times when sharing a multi-gpu infrastructure gets difficult when you have 5-6 people developing/prototyping ideas rapidly.
If you are someone like us, a team of small people exploring the world of deep learning and quick prototypers, you will often find the team arguming.
We started with maintaining an google sheets. The best thing which came with this was simplicity. It was really good until a long period of time. It was good because literally no technical maintenance was required. The major problem is that by default tensorflow hogs all the memory for all the devices if you don't specify correctly configurations. This means that if someone forgets to specify tf.GPUOptions then no one else can use the whole infrastructure because tensorflow has reserved everything for the submitted job.
There are some flaws however -
1. unrestricted access to critical infrastructure - What if I want to restrict inexperienced folks from using whole infrastructure?
2. How does manage environments efficiently on the production system? (One model needs tensorflow 1.12 while the other one needs tensorflow 1.14)
To resolve these problems we have now moved towards the container solution. Here we are not talking about docker, we are advocating Singularity Containers. Singularity is one of the best ways to manage a HPC.
We believe that it is the best option out there. So make sure that you follow us so that we can update you more with these quick articles.