Developing GPU Streaming solution for Unreal Engine 4’s LiDAR Point Cloud plugin

Tech Analysis 23 February 2021

Some background for context

In early 2018, I have started experimenting with point cloud technology – I was especially interested in different ways to efficiently render the datasets on the screen. This eventually led me to create a freely available LiDAR Point Cloud plugin for Unreal Engine 4. Fast-forward to today, it has been acquired by Epic Games and is now fully integrated into the main engine branch! Over that time, however, the plugin has come a very long way, and I would like to share with you some of the highlights of that journey.

The problem

Before the introduction of GPU streaming, the largest limitation of the plugin was its reliance on storing all the data in the memory of the graphics card – the VRAM. To demonstrate why it is such a limiting factor, let us run through the process.

Since the vast majority of people prefer to visualize their cloud data as user-facing quads – or Sprites – this will be the scenario I will concentrate on.

To render a single sprite we need 4 vertices and each vertex requires:

  • Location – 12 bytes
  • Color – 4 bytes
  • Normal – 3 bytes
  • Metadata – 1 byte

Put together, this gives us memory requirement of 4 x (12 + 4 + 3 + 1) = 80 bytes per point. In other words, to render 100,000,000 points it will take approximately 7.5 GB of VRAM to store the cloud on top of everything else required for the scene.

And while most people looking into Point Cloud visualizations in a serious manner will probably own top-level hardware (the likes of Titans or Quadros), it will only help so much. What happens when you need a cloud with 1,000,000,000 points or more? One of our projects had us visualize over 16 billion points in a single level – to put that into perspective, you would need at least 1192 GB of VRAM for that – staggering!

The above did not even touch on another, rather important matter – the loading times. Pre-buffering that amount of data may cause very long delays and stuttering – both of which will degrade the end-user’s experience. With the streaming approach, only the relevant portion of the cloud is being transmitted, and there is no need for pre-loading large data sets.

Finally, many users need to efficiently update the data at runtime – be it for live data feeds from things like Kinect and LiDAR scanners or to execute some form of CPU-driven animation. Using streaming makes this process significantly easier, more maintainable, and expandable.

Instancing to the rescue

As some of you might have noticed in the example shown above, the sprite uses 4 vertices to display a single data point of the cloud – this means we are effectively duplicating the same data 4 times.

This is where the Hardware Instancing comes in. In essence, it allows us to generate only a single sprite, then instruct the graphics card to render it any number of times, at given locations. As a result, we no longer waste precious memory on unnecessary data.

This, in theory, would allow for much smoother operation and more point updates between individual frames.

So, what is the problem?

The benchmarks show a performance drop of around 6x when compared to the non-streaming approach! At the time, I’ve had a few potential ideas, but nothing really concrete yet.

Initially, I thought (well, hoped, really) that my implementation is just horribly incorrect and requires some corrections. However, after many hours of mundane comparisons with the Epic’s existing code and some consultations, I started entertaining the idea that the true problem lies somewhere else.

I had quickly ruled out memory bandwidth as the bottleneck since the problem persisted even when there were no data updates.

Polycount was next on the suspect list, so I had run the same test using different polygonal representations – from a simple triangle to an octagon (6 triangles). Unfortunately, the variance at the extremes was 10% at most, meaning I had to look elsewhere.

The first signs of success started showing after I began experimenting with batching the number of instances (rendering more than 1 sprite per instance). It seemed like the system did not like rendering millions of copies of the same object, which made sense, as there aren’t really many scenarios where you would need that many instances. Strangely though, the results were not very consistent.

Finally, during one of the tests, I accidentally forgot to transfer the color data stream – this resulted in a significant performance increase. Following this discovery, I had successfully identified what seemed to be the culprit – iterating over the instance data buffer (not even using the data itself) appeared to be responsible for nearly the whole performance loss.

Take 2 – Structured Buffers

I had been experimenting with different approaches to optimize the performance, and one of the solutions I had found is to use Structured Buffers instead of Instance Streams to pass the instance data. This proved to be superior – “only” 4x slower – on the right path, but still a far cry from the original approach.

Average performance cost of 1 million visible points

With regards to VRAM memory requirements, both streaming options consumed, on average, ~100 MB per 1,000,000 visible points, which is an improvement of several magnitudes.

There is also an accidental advantage of the Structured Buffer approach. Because it is bottlenecked by the number of instances and not the polycount, it is only marginally affected by the complexity of the sprite. In other words, you could use more complex-shaped sprites for essentially no extra cost.

Experimental hybrid solution

While experimenting with some Level-of-Detail improvement ideas, I accidentally came up with a potentially viable GPU streaming solution. Conceptually, it was based around a mixture of the non-streaming and Structured Buffer approaches and is referred to here as a Hybrid.

The idea was to reserve a space for and pre-generate a relatively small number of vertices in the VRAM, then dynamically populate the Structured Buffer with visible points’ data. Finally, render the requested number of sprites using the previously pre-generated pool while applying to them the relevant portions of the Buffer. Think of it like a dynamic, 3D Displacement Mapping.

Point data displacement

This allowed me to mostly avoid the bottleneck of Hardware Instantiation while retaining the ability to only transfer a single copy of the relevant data – a win-win.

As you can see from the graph below, the hybrid method provides a significant performance increase over the other two approaches and is tolerably slower than the non-streaming solution – not that bad when you consider the amount of flexibility gained in the process.

Average performance cost of 1 million visible points

The Hybrid solution was used in plugin versions released for Unreal Engine 4.22, 4.23, and 4.24.

Adding cache to the mix

One of the problems of the Hybrid solution was its per-frame update of the buffer content. Since the point cloud data generally stays unchanged for the duration of its usage, we could save quite a bit of performance by storing the render data in a temporary cache instead.

This idea led to the introduction of a shared LOD Manager system. One of its purposes was to keep track of the nodes in use, handle their storage streaming, as well as build, update, and release their render data cache.

Centralizing these processes allowed us to introduce a global point budget system, which I will describe in another blog post – stay tuned!

The way the caching works is actually quite simple. Instead of immediately disposing of the data once we are done rendering it, we store it for a pre-defined amount of time (5s by default) – refreshing that timestamp with every request to keep the data alive for as long as it is needed. Only once the node’s lifetime expires, it is physically removed from memory. This saves time by skipping redundant VRAM transfers.

Average performance cost of 1 million visible points

To make sure that what is rendered on the screen is consistent with the data itself, the cache is being flagged as invalid and forced to be rebuilt, whenever the contents of the node have been updated.

The caching was added for Unreal Engine 4.25 and 4.26.

Coming full circle

While updating rendering components for 4.27, I have been enlightened – what if I take the best elements of each technique described here, and combine them into a single system?

So, we only read the node from storage when it is relevant for the current view, initialize its render data when requested, and cache it to avoid redundant transfers. But what if, instead of using Structured Buffers, we store the results as many, small, static buffers – the same type that is used in the non-streaming approach?

Yes, this would mean the vertex data needs to be quadrupled again, resulting in 4 times higher VRAM usage, but because we do not pre-buffer the whole cloud upfront anymore, we no longer suffer from overrunning the VRAM capacity, nor from excessively long loading times.

We know that the static buffers are processed the fastest, so how does the new method compare? Have a look below:

Average performance cost of 1 million visible points

Mind-blowing, right? I will be stress-testing this approach for the next few weeks to make sure there are no unwanted surprises, but so far, so good.

Conclusions

The plugin has definitely come a long way since its early days. After rebuilding the rendering system several times over the last few years, I think I am finally happy with where the code is performance-wise. At least for now…

If everything goes well, I will be making this a default render mode, with the old (Cached) method as an optional toggle – for those wanting to record super-high-budget cinematics. When recording non-real-time takes, the runtime performance is not really important, but the significant increase in VRAM requirement may introduce an unwanted cap on the maximum fidelity of those shots.

 

BY MICHAL CIECIURA