bw logo

Chapter 24. Job System

24.1. Overview

The job system allows additional cores in multicore systems to execute code and calculate data for rendering just in time. It also moves the issue of D3D commands onto its own core, with the main thread simply recording them for execution in the next frame.

The rendering of a frame is divided into blocks. Blocks are rendered in order. Each block begins only after the previous block has finished.

Each block consists of any D3D rendering commands and associated vertex and index buffers, textures, shader constants or any other data. The block can be produced as a combination of conventional D3D rendering from the main thread and output from jobs running in parallel on cores allocated to run jobs.

Any number of jobs can produce the input for a block. Since only one block is rendering at a time and jobs execute in parallel you should use at least as many jobs as there are cores, otherwise there will be idle cores. Jobs within a block can finish in any order, but the block will not be rendered until all jobs are complete. While the jobs are executing D3D executes the previous block. Thus the job cores and the D3D core are constantly working while at the same time the main thread is preparing another frame of blocks and their corresponding jobs.

24.2. Under the Hood

All rendering commands and jobs are stored in a command buffer. This is accomplished by wrapping D3D. All D3D function calls go to the wrapper which records them to be executed on the next frame while at the same time the commands from the previous frame are executed in another thread on the D3D core.

When flushed the D3D core first stalls for the results of the first block's jobs. Meanwhile each core in the job cluster starts grabbing jobs from the block. There is no central dispatch mechanism. The cores grab jobs and atomically decrement a counter for that block when they finish writing their results. Note that jobs are grabbed in order but they are not necessarily finished in order. For example, if the first job takes a long time the second might finish first and that core can begin on another job. This means that the output is not guaranteed to be contiguous until after the last job finishes and decrements the counter to zero.

Only then can the D3D core begin to process the results of the first block, while the cluster begins to operate on the second block, outputting the results into another buffer.

If the D3D core finishes first it retires the buffer and stalls until its next buffer is ready. If the cluster finishes first it stalls until the D3D core retires the buffer it is consuming so it can receive output from the cluster.

24.3. Wrapper API

In addition to wrapping D3D the wrapper has a small API to control its behaviour.

DX::newBlock(): Starts a new block. All rendering from the previous block will finish before this one will begin and all job output that was used to render the previous block will no longer be accessible. What this means is that all jobs producing output for this block must be allocated before the next call to DX::newBlock().

DX::setWrapperFlags() and DX::getWrapperFlags(): These functions get and set flags that control the behaviour of the wrapper.

IMMEDIATE_LOCK: Flush the command buffer and then execute the lock in the main thread. This is required if you are not going to fill in the entire locked region. It is an extremely expensive way to lock and should be avoided where possible.

DEFERRED_LOCK: This flag is used to lock a buffer that will be filled in with a job. The pointer that a lock returns can only be used to store into a job and then accessed when the job executes. The actual lock occurs in the next frame when the job executes.

24.4. Job System API

The Job System API is accessed through the JobSystem singleton. It is obtained with JobSystem::instance().

allocJob(): This method places a job into the list of jobs for the current block and returns a user implemented class derived from Job. This has a virtual function called execute() which actually performs the job. The derived class stores anything necessary for the execution of the job. Note that the jobs within a block will not execute in order.

allocOutput(): The job will need to produce output and this is allocated up front with allocOutput(). It is called from the main thread and its result is placed into the Job object. The memory for the output is not actually allocated at this time, but it will be made ready at the time that the job is executed next frame.

Within a block you can have any mix of job and output allocations. For example, you may allocate one output and divide it between many jobs or vice versa or any combination you like. The only rule is that the output and the jobs must all come from the same block.

24.5. An Example

Let us imagine that we are updating and rendering a simple particle system. This will be done in one block. The particle system consists of 4096 points, each of which will be updated and generate one vertex into a vertex buffer for rendering.

Our block will consist of setting a vertex bufferer and calling DrawPrimitive on it. The vertex buffer will be filled using jobs. The vertex buffer will be divided into 8 parts of 512 vertices each. Each part will be filled in with one job.

Main thread:

  • New Block

  • Lock vertex buffer of 4096 points using the DEFERRED_LOCK flag

  • Set 8 jobs, each filling in 512 points

  • Set rendering states

  • DrawPrimitive

  • Reset wrapper flags

All of the above steps are not immediately executed but rather recorded for execution on the next frame.

Jobs and D3D core:

During the next frame the default block renders until the new block for the particles is reached. At the same time the 8 jobs to fill the vertex buffers are executed. When the new block is reached and the jobs are completed the particles are rendered. At the same time the jobs for the next block are executed.

24.6. Implementing it

Now that we understand how this example works we can go through the steps of implementing it.

First we need to implement our job object.

class PointSpriteParticleJob : public Job
{
public:
    void set( Particle* particles, Moo::VertexXYZDP* pVertex, uint nPoints );
    virtual void execute();

private:
    Particle* particles_;
    Moo::VertexXYZDP* pVertex_;
    uint nPoints_;
};

The job object inherits the virtual execute() method which gets called in the next frame just before the output of the job will be needed for consumption by the D3D core.

We also implement a set() method which gets called from the main thread and stores all the information that will be needed in execute().

The execute() method will do two things. It will update the positions of the particles and output the new positions into a vertex buffer. Each given job object will do this for only a part of the particle system and vertex buffer so that the entire task can be divided into several jobs and execute in parallel on several cores.

Now we are ready to use the job class in our rendering code.

We begin this with a new block.

DX::newBlock();

Now we are ready to queue our rendering commands.

First we need to lock a deferred vertex buffer, and to do this we need to set the appropriate wrapper flag before locking.

Normally when locking you get a pointer that you can use immediately to write vertex data. However our vertex data will be calculated in the next frame by jobs, so the lock actually has to occur at that time. Therefore we use a deferred lock which returns a pointer now but does not perform the lock until required.

uint32 oldFlags = DX::getWrapperFlags();
DX::setWrapperFlags( DX::WRAPPER_FLAG_DEFERRED_LOCK );

Moo::DynamicVertexBufferBase2<Moo::VertexXYZDP>& vb = Moo::DynamicVertexBufferBase2<Moo::VertexXYZDP>::instance();
Moo::VertexXYZDP* pVertex = vb.lock2( 4096 );

Now we are ready to allocate and set up our jobs. The pointer from the lock is used to set up the jobs.

for ( uint i = 0; i < 8; i++ )
{
    job = jobSystem.allocJob<UpdateParticlesJob>();
    job.set( particles + i*512, vertices + i*512, 512 );
}

Finally we can unlock the buffer, reset the wrapper flags and render.

At this point we can render as if the jobs that we have allocated and set are complete, since the following rendering commands will not be executed until that time.

vb.unlock();
uint32 lockIndex = vb.lockIndex();

DX::setWrapperFlags( oldFlags );

vb.set( 0 );
Moo::rc().drawPrimitive( D3DPT_POINTLIST, lockIndex, nPoints );