Measuring and Improving Performance in the Umbra Cloud

All Posts

In the grand scheme of things, Umbra is a productivity tool and our goal is to help our customers work more efficiently. We respect our customers' time and strive to not keep them waiting, as doing so would be counterproductive to our goal. Umbra would simply not be useful if it weren't so snappy!

We are all engineers, so we want to take an engineer's approach to the problem. Throwing darts and guessing what needs improving won't get us anywhere. We need to be able to measure what is taking the most time right now, and we need to be able to measure the impact of any changes we make.

So what do we need to measure? The Umbrafication process divides the problem into thousands of smaller sub-tasks and works on them all independently. This allows us to use the cloud to throw enormous amounts of computing power at the problem! It also means we have two things to measure.


Firstly, we need to be able to measure what is taking the most time in each task. Umbra does this by stopping occasionally to ask, "what am I doing right now?", and keeping a tally. If we take enough samples we can make an accurate measurement of how much time each step takes. For example, if we catch ourselves in the mesh decimation code twice as often as in the texture resampling code, we can say that we spent twice as much time on mesh decimation than we did on texture resampling.

This is a topic worthy of its own blog post, but not this one. Instead, we are going to look at the bigger picture and consider Umbrafication performance as a whole. Even though we have all the resources of the cloud, we still have to make sure we are using them efficiently.

Picture two servers (or two CPU cores on the same server!) with a shared set of tasks to do, each taking some varying amount of time to complete. If they're both allowed to work on different tasks at the same time, they can actually complete the tasks faster by doing them in the correct order!

As a rather extreme example, say one of the tasks takes significantly longer than all the others. If they leave that task until last, whoever takes the slow task will be stuck working while the other machine has to sit idle. If we were to draw a timeline of their progress, with the slow task shaded red, it might look like this:


Umbra cloud  servers diagram bad  results.jpg

If instead, they decide to tackle the hard task first and then complete the easy tasks, they will both end up finishing their last task around the same time, and take less time overall!

Mike_ cloud good (2).jpg

In this second case, they have been able to better utilize the resources that are available to them.

By measuring and visualizing a timeline of the Umbrafication process, we were able to see that we were indeed ordering our tasks suboptimally. By estimating how long each might take before launching them, we are able to sort them so the more costly tasks are done first. With this change, we were able to improve our utilization of the cloud and make Umbrafications about 5% faster overall.


When Umbra improves, all of our customers win. Not only that, but all of our future customers win too! The 5% performance improvement we made translates to hundreds of hours saved across all of our customers, and only took a couple of days of my time to implement. I would say that's worth the investment - test it out for yourself!

Popular Posts