Contents |

Level of detail control is generally practiced in order to spare computing resources and time in rendering entities of a graphics model. For example, if a game character is a very far from the player’s viewport (thus appears small on the screen), it is not beneficial to use very complex shaders or geometry to render that character. Fancy effects and sharp details wouldn’t be very visible from the distance.
We’d need to render something to depict the character. The player would definitely notice the sudden appearance of the character if he/she approached the entity. If the character would actually be of size less than one pixel, we could just not render it and the player wouldn’t notice a thing.
A level of detail control flow usually involves removing detail from a renderable entity. This is accomplished either by removing geometric primitives, reducing the precision of the elements that the entity comprises of or using smaller textures and/or simpler shaders to render said entity.
In older graphics architectures, the geometry-related level of detail adjustments almost always required CPU intervention, thus incurring extra overhead to the operation that is supposed to reduce the system load.
The generalized shader pipeline architecture of Direct3D 10 enables us to conditionally draw and remove geometry in both object and primitive level, compress the data and dynamically control shader flow to achieve all the methods of controlling level of detail in hardware, without direct involvement of the CPU (except for initial programming of the graphics processor). Next, we look at these methods in detail, starting from the object-level culling and finishing to the conditional flow implemented in pixel shaders.
Occlusion culling is a popular technique for reducing the drawing workload of complex 3D scenes. In legacy implementations, it works basically as follows:
On the other hand, if the number of the pixels that passed the depth test was zero, the object’s bounding geometry (and therefore the actual object) was definitely fully behind the geometry rendered in step 1. Therefore, we can ignore the drawing of the current object and move on to draw other objects as applicable.
Upon trying to download the occlusion data, the graphics card may report that the data isn’t available yet. In this case, we have to either wait for the data to become available or just proceed with rendering the detailed object.
There is a delicate balance of performance between giving up the query and just rendering the object, or being optimistic and waiting that the query, when it does successfully return the data, confirms that we don’t need to render the object. For small to moderate amounts of primitives in the detail geometry, it is statistically not worth the wait the results beyond one occlusion data download attempt; instead, it is preferable to just render the said detail geometry because drawing it would take less time than the wait for the data combined with the time it takes to actually call the download. Of course, the balance depends on the combination of the scene, the graphics hardware performance and the performance of the rest of the system including the CPU.
Before Direct3D 10, occlusion culling was quite bothersome to set up and manage, and despite some potentially significant savings in rasterizer utilization, more often than not the overhead of actually doing the occlusion culling routine outweighed the benefits towards overall application performance improvement.
It takes a certain time to route a graphics call all the way from the application to the graphics processing unit. This time varies between specific calls, but is significant as compared to the time it takes to perform simple to middle complexity calculations locally in either the CPU or GPU. This delay is due to many reasons, but the primary reason is the transition of the call from application layer to kernel layer in the operating system and the CPU.
Another significant delay can be caused by the latency and bandwidth of the graphics bus, but this ceases to be a very large bottleneck today due to massive bus bandwidths offered by the modern system board designs and memory electronics performance. Yet another potential delay is caused by the fact that if data required to fulfill a given graphics call isn’t ready by the time of the said call’s execution, the system will have to wait until the data does get available before further processing.
In order to gather the occlusion data in the old way, we need quite a few extra graphics calls in addition to the calls used to normally render the geometry:
In Direct3D 9, we could call the query to retrieve the number of occluded pixels asynchronously, so that we wouldn’t have to wait the results of the occlusion test and just draw the geometry in question. However, all the extra drawing calls necessary to perform the query in the first place take a lot of time from the CPU when it could do something else like physics or game artificial intelligence simulations as games commonly require it to do. This causes a considerable general delay in the application. Unless the geometry that was excluded from drawing based on the query is very complex in its actual shading logic, using occlusion queries in this style would generally only work to slow the entire application down instead of the introducing the intended performance increase.
There are a variety of other culling methods available, such as spatial analysis – which determines essentially if a given graphics object is in the line of sight with regard to the viewport – and distance analysis which modifies the drawing of objects based on the proximity of them from the viewpoint. However, we will concentrate on occlusion culling in this chapter since it easily maps to Direct3D 10 paradigms.
Direct3D 10 introduces the ID3D10Predicate interface that is designed to remove most of the overhead that plagues the legacy implementation of occlusion culling. The interface can be created using a predicate hint flag that stores the results of the occlusion query – which can be directly used in drawing – without ever needing to wait and download the results to the CPU in order to determine if the detailed version of the potentially occluded geometry should be rendered or not.
The following code shows the basics of using the ID3D10Predicate interface to issue and act upon the results of an occlusion query:
//Create the predicate object: ID3D10Predicate* g_pPredicate; D3D10_QUERY_DESC qdesc; qdesc.MiscFlags = D3D10_QUERY_MISC_PREDICATEHINT; qdesc.Query = D3D10_QUERY_OCCLUSION_PREDICATE; pd3dDevice->CreatePredicate( &qdesc, &g_pPredicate )); // Render the occluder: g_pPredicate->Begin(); //render here g_pPredicate->End(); // Use the occlusion data: // Render only if predicate indicates the proxy was not occluded //pass the predication object for the device to determine when //to render pd3dDevice->SetPredication( g_pPredicate, FALSE ); //Draw intensive stuff here based on the predicate result. //After the SetPredication call, the hardware uses the occlusion //information automatically to suppress the drawing operations. //This way, the CPU doesn’t need to wait for the results to be //downloaded, and the benefits of occlusion culling can //more effectively be used than in legacy approaches. //finally, reset the predication state so that the device can // draw unconditionally: pd3dDevice->SetPredication( NULL, FALSE ); //The device draws unconditionally now.
As a programmer, you can utilize one predicate object for many objects, or create more predicates if you want to control the rendering conditions in a hierarchical fashion. If you have many discrete objects, you will want to use multiple predicate objects in order to separately test the occlusion conditions for each of them. The occlusion objects are relatively light-weight, and this way you don’t have to separately render the test pass for each object.
The predicate hints still behave asynchronously as in the legacy implementation. However, as we don’t need to explicitly download the results to CPU in order to act upon them in drawing the detail geometry, the process of using the occlusion data becomes lightning fast and is much more likely to provide performance increases in Direct3D 10 than in the older graphics programming interfaces.
If you do need to actually get the predicate state to the CPU for some reason, you can use ID3D10Device::GetPredication to download the predication flag as a Boolean value which indicates the result of the test phase. Even though this isn’t the optimal way of using the predicate objects, it still improves on the legacy way; before Direct3D 10 it was necessary to download the actual figures of the test to the CPU and calculate the drawing condition flag there.
The predicate interface has limitations in what calls can be skipped with it. All the graphics command that draw geometry, as well as some data fill and copy operations will honor the current predication status. These all have one common thing – they require an internal lock on the graphics memory to succeed, and they move potentially large amounts of data around. This makes them potentially heavy operations.
The commands that don’t react to predicate conditions have to do with geometry layout, geometry manipulation, resource view creation, the Present method and resource creation. They will succeed when called even if the predicate condition was met.
The separation between those calls that can use the predicate flag and those that do not is mainly due to a practical architectural decision. Lightweight state changes take less time to switch than the predication saves, and excluding these selected states from the predication system reduces the complexity of the graphics driver – which results in potential performance increases and stability.
Furthermore, some states’ nature do not accommodate predicates well, such as setting the input assembler layout, for example. Most programmers do not really want to control this particular state change based on the results of drawing another object, and were it an option, it could promote potential bugs in the applications (as in inadvertently not specifying the input layout at all).
For a complete list of commands that either do or do not honor the predicates, please see the chapter in the DirectX SDK called “Draw Predicated Sample” (as of February 2007 SDK).

The fastest graphics primitive is a graphics primitive that doesn’t get drawn. In graphics architectures before Direct3D 10, there was no way for hardware to cull individual primitives from a graphics stream inside the drawing pipeline; this means that the culling was best done per-object level before the drawing of the object even began.
The culling criteria might be distance from the eye point, custom clipping geometry or clipping planes, or even the equipment parameters of a game character; that is, whether the character has a weapon or an armor at its disposal.
Common per-object culling can be wasteful in open scenes because the visibility of the object must be calculated on the CPU and the drawing of the said object must be initiated based on these calculations.
The design of Direct3D 10 takes this into account and provides a way to conditionally pass data through the pipeline, discarding data per-primitive on the geometry shader based on dynamically calculated parameters or geometry elements passed to it from the original geometry via the vertex shader.
The following code sample shows a simple geometry shader which removes triangles from a mesh based on the calculated normal of the current triangle:
//this geometry shader discards triangles based on the x component //of their geometric normal. GSIn contains world-space positions //in addition to ordinary vs output. [MaxVertexCount(3)] void ConditionalDiscardGS(triangle GSIn input[3], inout TriangleStream<PSIn> output) { //take the cross product of the input triangle edges in world space: float3 xProd = cross(normalize(input[2].wsPos - input[0].wsPos), normalize(input[2].wsPos - input[0].wsPos)); //then, if the x component of the input triangle is positive, //discard the current triangle altogether by simply not //outputting it to the rasterizer: if (xProd.x > 0.0f) return; else { //we now output the triangles that didn't face //world-space "right": PSIn outVtx = (PSIn)0; //iterate over the three vertices of the current triangle: for (int i = 0; i < 3; i++) { //for the output triangle, use the //perspective-space positions: outVtx.pos = float4(input[i].pos, 1.0); //OMITTED: pass the rest of the vertex elements to output. //TriangleStream::Append() sends the vertices // to the pixel shader: output.Append(outVtx); } } }
By replacing the test against x with a test against the z component of the normal, the code does exactly the same thing as the hardware does when performing back-face culling. To understand the practicality of this, consider the following: depending on your vertex shader and geometry shader combination, you may need to disable the default culling render state and do this operation yourself; for example, in rendering both front-facing and backwards-facing polygons in a single pass. For ordinary opaque rendering without any special needs, however, the default back-face culling - done automatically by the graphics hardware after the input assembler stage - usually performs better and doesn’t depend on the geometry shader.
We can selectively render subsets of the geometry by specifying, in the vertex buffers, to which subset a particular triangle belongs and comparing this to a specified subset index passed to the geometry shader. To specify the triangle subset identifiers, we can use a second stream of geometry in addition to the basoc geometry data.
The following IA layout describes a data access pattern in which the ordinary geometry is specified on the first vertex buffer, and the subset indices on a second one:
// This input assembler layout describes geometry from two vertex // buffers. The first buffer contains the positions and normals of // the geometry as usual, and the second buffer // contains the triangle subset indices. D3D10_INPUT_ELEMENT_DESC layout[] = { { L"POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D10_INPUT_PER_VERTEX_DATA, 0 }, { L"NORMAL", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D10_INPUT_PER_VERTEX_DATA, 0 }, { L"TEXTURE0", 0, DXGI_FORMAT_R32G32_FLOAT, 0, 0, D3D10_INPUT_PER_VERTEX_DATA, 0 }, { L"TEXTURE1", 0, DXGI_FORMAT_R32_UINT, 1, 0, D3D10_INPUT_PER_VERTEX_DATA, 0 }, };
The input element descriptor structure members used in the previous code are defined as follows:
LPCSTR SemanticName: The HLSL semantic name for this element. UINT SemanticIndex: If more than one element uses the same semantic, this denotes the index for that semantic to disambiguate the elements from each other. DXGI_FORMAT Format: This is the data type of the element. UINT InputSlot: The index of the input assembler for this element. 16 input assemblers are available. UINT AlignedByteOffset: Optional parameter that specifies the offset between the elements in the current input slot. The default value of 0, that corresponds to the constant D3D10_APPEND_ALIGNED_ELEMENT, specifies that this element is aligned next to the previous one in the current input slot without padding. UINT InputSlotClass: Defines if this element is per-vertex or per-instance data. Instancing is a way of reusing vertex data for multiple similar objects, while some elements vary per instance of the object being rendered. We don’t touch instancing further in this chapter, but the DirectX SDK contains information on how to take advantage of this feature. UINT InstanceDataStepRate: This value determines how many instances to draw with this element until the stream pointer is advanced to the next element in the current slot. This corresponds to the stream source frequency in Direct3D 9.
The first vertex buffer contains the ordinary geometry data, and the second one contains the corresponding subset indices as unsigned integers.
To enable setting multiple vertex buffers as stream sources, the device’s method IASetVertexBuffers() accepts an array of vertex buffer interfaces. Before rendering, we set the vertex buffers using this method. Each vertex buffer takes one input slot – in our case, we need two.
At render time, we load the subset indices from the secondary stream by using the vertex shader. Then we pass the subset id, along with ordinary vertex data, to the geometry shader. The geometry shader reads in the subset id and acts by appending the current triangle to the output stream or discarding it based on a reference subset id value set in the effect.
The following code shows the combination of the vertex shader and a geometry shader that takes into account the subset id of the incoming geometry:
// This vertex shader passes the ordinary geometry data and the // per-triangle subset index data to the geometry shader. // GSIn contains the same elements as the parameters of this // vertex shader. PSIn geometry shader output structure is // identical to it in this sample. // Do note that we don't need any extra processing here // despite the geometry originating from multiple streams: GSIn SubsetVS(float3 pos : POSITION, float3 nrm: NORMAL, float tcoord: TEXTURE0, uint subsetId : TEXTURE1) { GSIn ret; ret.pos = pos; ret.nrm = nrm; ret.tcoord = tcoord; ret.subsetId = subsetId; return ret; } // This geometry shader culls away all triangles in which the // subset ID is exactly two (for example's sake). We could use // more elaborate logic for this, but it is out of the scope of // this sample: [MaxVertexCount(3)] void SubsetGS(triangle GSIn input[3], inout TriangleStream<PSIn> output) { PSIn outVal; // the subset id is the same on each input vertex at this point, // so we need only to check the first vertex for the value: if (input[0].subsetId == 2) return; else { // if the subset id wasn't two, write the triangle out for // further processing: output.Append(input[0]); output.Append(input[1]); output.Append(input[2]); } }
Using geometry shaders for the purpose of culling geometry has some performance advantages over using the legacy style of culling per-object, because the drawing of objects and subsets can be driven by using shader parameters which are relatively lightweight to set as compared to data stream binding and separate drawing calls for each object or subset.
Also, as a completely new feature, the new technique allows to control the rendering by values calculated entirely on the graphics processor, which makes possible to implement custom dynamic culling rules; for example, to toggle the drawing of a particle based on its age, as computed by a large-scale particle simulation running entirely on the graphics processor without CPU intervention.

Pixel shaders can become very complex with the new shading architectures that allow almost unlimited numbers of instructions in the shader logic, a vast amount of inputs for sampling textures and geometry values and flow control systems. The recent advances in real-time graphics hardware allow performance that potentially lets a single pixel shader invocation to perform very long calculations that rival the shaders that are used in off-line rendering of graphics for feature movies in complexity.
However, computing time is still at premium. It is not practical to render entire scenes with super-complex pixel shaders, as the shader runtime duration directly affects the time it takes to render the current pixel. When the effective runtime duration of a complex pixel shader is multiplied by the common resolutions of today’s monitors, the frame rate is usually negatively affected – not to mention that most applications render many other intermediate images of the scene besides the frame buffer image. Fortunately, the new pixel shader architecture of Direct3D 10 allows us to control the flow of a pixel shader with minute detail.
Conditional flow statements have been a capability of pixel shaders since Direct3D 9. A conditional flow allows the shader code to branch and loop during its invocation, based on either external variables like number of lights of a particular effect or variables calculated inside the pixel shader itself, such as the terminating condition of a fractal-based algorithm implemented in the shader.
Direct3D 10 gives us new tools to take advantage of the dynamic execution capabilities of the modern hardware. Integer calculations and data are fully supported in all shader types, unlike in Direct3D 9 cards where integers were primarily emulated using floating-point arithmetic – or if integer operations were implemented, they would be lacking in features such as precise discrete indexing of data and bitwise operations like OR, AND and XOR.
By passing in subset values from the geometry streams as integers to the pixel shader, we can control per-primitive pixel shader invocation to allow for material selection inside the shader. If some part of the object needs to be rendered in a complex shader while the rest of the object doesn’t, we can use a complex code path for the marked areas and a simple code path for the rest of the primitives. We can even define material indices; this makes it possible that the pixel shader can take more branches than just two by evaluating the code path for the desired material via a switch statement.
The following code is a simple pixel shader that has the capability of changing its code flow based on the material index input to it as an integer along with the rest of the geometry data. It continues where the last code listing left off, using the subset ID of the geometry to change it’s behavior. In addition to the last listing , the PSIn structure is assumed to contain a diffuse color calculated in the vertex shader:
//this pixel shader changes its behavior dynamically based // on the incoming subset ID: float4 SubsetPS(PSIn input) : SV_Target { float4 ret = (float4)0; switch (input.subsetId) { case 0: //in case of subset 0, render the geometry red: ret = input.color * float4(1.0,0.0,0.0,1.0); break; case 1: //we want to render the subset 1 as blue: ret = input.color * float4(0.0,0.0,1.0,1.0); break; default: //OMITTED: For the rest of the subsets, do whatever needed. } return ret; }
A good performance booster to any scene is to render objects in distance with simpler shaders than when they are up close to the observer.
As we don’t need to use the values from the geometry or textures input to the pipeline, we can control the flow of pixel shaders directly with values calculated in the vertex or geometry shader, or even inside the same pixel shader. The size and depth of a screen-space primitive can be determined by the co-operation of the vertex and geometry shader, and this information can be passed to the pixel shader to determine the appropriate complexity of the code flow for shading that primitive.
In addition to shading the primitive differently based on depth, we can also discard the pixels using the same criteria. Consider that if a screen-space triangle is smaller than a couple of pixels on the screen, the user may not even notice if it is missing. Should an array of almost indistinguishable pixels be rendered in a very complex shader, it could affect the rendering speed of the object significantly. Similarly, pixels that are facing away from the light would not need to evaluate the full lighting equation which could be very complex as implemented in Direct3D 10 architecture.
It is worth noting that the dynamic flow control can slow down the performance of pixel rendering. This is due to the fact that in case of potentially random memory access patterns, the hardware has to drop various assumptions about the instruction branches and texels to be used in the drawing. However, if you use dynamic flow control to toggle between extremely simple and moderately to extremely complex shader paths AND the majority of resulting pixels fall into the simple category, using dynamic path selection will almost certainly yield much better performance than the naïve approach of rendering all the pixels in full detail.
The following code sample illustrates how to change the flow of the pixel shader based on the depth and size information that is input to the shader from the vertex shader and geometry shader. The PSIn structure contains, among the ordinary geometry data, the area of the current triangle in screen space calculated by the geometry shader:
// This pixel shader takes into account the screen-space size // of the current primitive when deciding rendering strategy: float4 TriangleSizePS(PSIn input) : SV_Target { //if the triangle is too small, do not render this pixel: if (input.triArea < 2.0) {clip (-1.0); return (float4)0;} //if the triangle is medium sized, execute a simple code path: if (input.triArea < 20.0) // Apply simple processing here. // Return the results here. // If we got this far, the triangle is moderately large. // Execute the most complex rendering logic here. }
For most interactive applications, it is important to render just as much as is needed without rendering objects that aren’t going to be visible anyway, to keep the general performance of the application as high as possible. The new shader pipeline of Direct3D 10 enables to implement this very efficiently and flexibly.
It is relatively easy for developers using Direct3D 10 to take advantage of the hardware-based culling and level-of-detail control methods outlined in this chapter to get the most performance out of their applications.
Of course, if you can efficiently determine objects’ visibility before they enter the graphics pipeline, it is usually the most effective method of accelerating your graphics engine. For this popular approach, spatial-partitioning techniques such as quadtrees and octrees are tried and true.
On the other hand, if you’re not sure about the visibility without jumping through a lot of hoops in the CPU (and consuming considerable performance in the visibility testing), it is best to let the graphics hardware handle the testing and acting upon the resulting data as described in the previous subchapters.