A look into: AMD ISA from HLSL

I thought after my last post of the Shader Analyser which output AMD ISA (Instruction Set Architecture) it was worth doing a little write up on what exactly that is and why it may be worth while to take the generated code into consideration when doing low level optimisation.

So to get started we will look at a simple example pixel shader. In all of these examples we will be using shader model 5.0 and will be building for Hawaii architecture.

struct PS_INPUT
    float4 pos : SV_POSITION;
    float4 tex : TEXCOORD0;

float4 psMain(PS_INPUT input) : SV_TARGET
	return float4(1,1,1,1);

Ok, so here we have our very basic Pixel Shader. It is taking an input structure from the Vertex Shader which is passing in a position but not using it and just writing out one into each channel.

So, we can look at this in three levels getting progressively lower: the ASM that is generated from DirectX, the AMD ISA and then the AMD IL (Input Language) which is the instructions actually passed in the GPU. So lets take a look at each of these:

DirectX ASM

dcl_globalFlags refactoringAllowed
dcl_output o0.xyzw
mov o0.xyzw, l(1.000000,1.000000,1.000000,1.000000)

Here we can see that is declaring an output register (o0.xyzw) and then copying 1.0 into each channel at that address and returning. This is as straight forward as you can get. It makes no use of the position or texcoord data passed through to the shader as we haven't accessed them at all in the HLSL shader.


shader psMain

  v_mov_b32     v0, 1.0                                     // 00000000: 7E0002F2
  v_cvt_pkrtz_f16_f32  v0, v0, v0                           // 00000004: 5E000100
  s_nop         0x0000                                      // 00000008: BF800000
  exp           mrt0, v0, v0, v0, v0 done compr vm          // 0000000C: F8001C0F 00000000
  s_endpgm                                                  // 00000014: BF810000

So now we are getting more complex looking, but don't worry it is still very simple when you break it down! Here is a link to the all instructions. So lets take a look at this line by line. 

Our first instruction is v_mov_b32 which the documentation says:

Single operand move instruction. Allows denorms in and out, regardless of denorm mode, in
both single and double precision designs.

This means that instruction is just a move pretty much like the "mov" in the DirectX ASM. Where it is moving the value of 1.0 into the register v0.

Next we have the more intimidatingly named v_cvt_pkrtz_f16_f32, but not to worry again we will go to the documentation to see what this is:

Convert two float 32 numbers into a single register holding two packed 16-bit floats.

So in the first instruction we stored a 32 bit value of 1.0 into the register v0. Now we are going to store two 16 bit values in that same register. And the two 16 bit values are both going to be the vlaue we stored in v0 initially converted into 16 bit. So this gives us a register which is storing two 16 bit values of 1.0. This is a little strange, but things tend to get a little bit strange the further down you go as you start seeing things the compiler has done to make the code run more optimally for its hardware.

Our next instruction is "s_nop" if you have worked with assembly before this may be familiar it it means no operation. The description from the documentation:

Do nothing. Repeat NOP 1..8 times based on SIMM16[2:0]. 0 = 1 time, 7 = 8 times.

Now this is even more odd than before you must be thinking. Why on earth would a shader want to waste an instruction doing nothing? Well, this calls for us to dig further into the documentation where we will find this little bit of information:

Must add an S_NOP between two consecutive S_SETREG to the
same register.

S_SETREG is an instruction to write data to an internal hardware register, so this could be telling us that the reason for this s_nop may be that the compiler is adding the required s_nop as the next instruction is going to write to the same register as the instruction above. However, in this case I believe the s_nop is there to pad this shader to be 4 instructions and has been placed before the export instead of after it as an optimisation.

The next instruction is "exp" which our documentation tells us is the export function for this shader program. This is where the shader writes to the render targets.  This line is a little more complex than the others so we can look at it one bit at a time.

The first bit to make sense of would be the words at the end: "done compr vm". These are each individual flags. The flag "done" is used to indicate that this is the last output to a render target from this program, "compr" is telling the GPU that this is 16bit per component rather than 32 bit and "vm" is saying that this is a valid mask for the wavefront and must be set at least once per pixel shader. I will go into wavefronts and what this means in move detail in a later post.

The next part of this line to take a look at is the "mrt0" this is telling the program to write into the first render target. This is specified in our HLSL shader where we set the output of psMain to write to SV_TARGET. 

The last part of the line is the repeat of the register "v0". This is telling the program which value to write into each channel. "v0" currently contains two channels, each with a 16-bit value of 1.0 in it. And due to the "compr" flag only the first component is read. 

Finally the last line of the AMD ISA is "s_endpgm" which is obviously the instruciton to the end the program but to be consistent here is the description from the documentation:

End of program; terminate wavefront.

This is telling GPU to end this program and the wave wavefront. Pretty straight forward. 

So you can see that the ISA is just the lower level version of the the DirectX ASM. It is doing the same things but is a little bit more explicit about how and we are beginning to see the quirks of the GPU come through.


This will be covered in the next post!