Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembly optimization for walls #143

Open
Awesoken2 opened this issue Apr 29, 2023 · 21 comments
Open

Assembly optimization for walls #143

Awesoken2 opened this issue Apr 29, 2023 · 21 comments
Labels
enhancement New feature or request

Comments

@Awesoken2
Copy link

Awesoken2 commented Apr 29, 2023

Not really an issue, but an opportunity!

For comparison, here is the existing FastDoom code for texture mapping walls, reordered for simplicity, and showing only the math for texture mapping:

   xor ebx, ebx       ;386:2cc, 486:1cc
   shld ebx, edx, 7   ;386:3cc, 486:2cc
   add edx, ecx       ;386:2cc, 486:1cc
   .. (use ebx as texture offset)

You can easily eliminate 1 clock cycle (cc) on 486 (-2cc on 386!) using 'lea' to perform both an 'add' and 'mov':

   lea edx, [ebx+ecx] ;386:2cc, 486:1cc
   shr ebx, 25        ;386:3cc, 486:2cc
   .. (use ebx as texture offset)

   lea ebx, [edx+ecx] ;386:2cc, 486:1cc
   shr edx, 25        ;386:3cc, 486:2cc
   .. (use edx as texture offset)

You must write out 2 iterations in your unrolling macro, as ebx & edx are being swapped on each iteration. In the setup code, write out an extra pixel so the count is always even.

-Ken S.

@Awesoken2
Copy link
Author

Awesoken2 commented Apr 30, 2023

A similar trick using LEA works for ceilings & floors:

Existing code:

   shld ebx, ecx, 22  ;386:3cc, 486:2cc, ebx:[xxxxxxxxxxUUUUUUuuuuuuuuuuVVVVVV]
   shld ebx, ecx, 6   ;386:3cc, 486:2cc, ebx:[xxxxUUUUUUuuuuuuuuuuVVVVVVUUUUUU]
   and ebx, ebp       ;386:2cc, 486:1cc, ebx:[00000000000000000000VVVVVVUUUUUU]
   add ecx, edx       ;386:2cc, 486:1cc

New code:

   lea edx, [ecx+ebp] ;386:2cc, 486:1cc, ecx:[VVVVVVvvvvvvvvvvUUUUUUuuuuuuuuuu]
   shr ecx, 26        ;386:3cc, 486:2cc, ecx:[00000000000000000000000000VVVVVV]
   shld cx, dx, 6     ;386:3cc, 486:2cc, ecx:[00000000000000000000VVVVVVUUUUUU]
   .. ;use ecx

   lea ecx, [edx+ebp] ;386:2cc, 486:1cc
   shr edx, 26        ;386:3cc, 486:2cc
   shld dx, cx, 6     ;386:3cc, 486:2cc
   .. ;use edx

Notes:

  • I changed the increment register from edx to ebp to make the 2 alternating position registers (ecx & edx) equivalent when it comes to addressing (ebp has some weird cases).
  • The hi&lo 16 bits of the U-V registers (pos & inc) are swapped from the existing code. As a quick hack, use: "rol ecx, 16" and "rol ebp, 16" to compensate.
  • The shld in the new code is using the "new" bits of the low half of the texture position, so that must be compensated for in setup code. Simply subtract the low bits once in setup like this: "sub cx, bp"
  • Warning: on newer CPU's (especially Pentium Pro and above), reading a 32-bit register after writing its 8/16-bit sub register can cause a big penalty known as a partial register stall. I do not believe this affects the 386/486, which is what you care most about. Nevertheless, it would be a good idea to test and compare speed on various CPUs to be absolutely sure.

@RamonUnch
Copy link
Contributor

Wow, it is amazing to see this kind of optimizations after 30 years of DOOM!

I have a Pentium III @ 700MHz with Windows 98SE I can make some benchmarks tests with DOOM in dos mode if needed.
I do not have a 386 nor 486 though. On a Pentium III however it would really not practically matter because we would have max FPS anyway, unless FastDoom started to raise all vanilla limits and allowed loading large wads.

By the way to better display the asm code on GitHub you can use this sequence,

```asm
asm codes gores here
```

ie:

xor ebx, ebx ;386:2cc, 486:1cc
shld ebx, edx, 7 ;386:3cc, 486:2cc
add edx, ecx ;386:2cc, 486:1cc

You can do the same with other languages replacing asm with c cpp etc.
It is really unimportant but it took me ages to figure this out by myself, so I am telling :)

@Awesoken2
Copy link
Author

Thanks for the asm formatting hint. I edited my posts above and they look much better now!

If you want to test frame rate for the above code, it really has to be on a true 386 or 486. A Pentium III will run the wall loops about the same, and the ceiling loops a lot slower due to the partial register stalls.

If you're optimizing for a Pentium or above, here are some suggestions:

  1. The 'shld' instruction is slow and should be replaced with 'rol' (and 'lea').
  2. Unrolling all the way is no longer advantageous. Newer CPUs prefer tight assembly loops that fit in the instruction cache.
  3. Do aligned 32-bit writes to video memory. Easy for ceilings. Yes it's possible for walls (which are rendered in the vertical direction) too! The trick is: you render 4 vertical wall lines simultaneously. To free up registers, use self-modifying code for the constants, especially the 4 v-increments.

Curiously, if you actually did all of the above, you would basically have the Build Engine which uses all of the above techniques. (It's strange how I would know that? ;-P)

I don't know if FastDoom is autodetecting the CPU at startup. If not, I found this simple way to do that in Watcom C (DOS):

static int getcpu (void)
{
   union REGS r;
   r.x.eax = 0x0400; //DPMI get version
   int386(0x31,&r,&r);
   return((r.x.ecx&0xff)*100+86); //Returns: (286),386,486,586,686
}

^ Much easier than installing an invalid opcode handler ; P

Curiously, I never bothered to look at Doom's assembly loops in real detail until recently. I was shocked to find this line which was present in both inner loops (walls & ceilings/floors):

dec [loopcount]

That's a read/modify/write instruction, which takes 6cc on 386, or 3cc on 486. It would have been so easy to free up a register and knock off a few cycles: -4cc on 386, or -2cc on 486 (per iteration). I see FastDoom gets around this by unrolling all the way - a great strategy for 386 & 486. For the Pentium or above though, it's generally better to write tight loops with self-modifying code if you can't fit everything into registers.

@viti95
Copy link
Owner

viti95 commented May 2, 2023

Thanks for the ideas @Awesoken2 ! Pretty interesting optimization using LEA instruction, I'll try it for sure. As for register stalls, the 486 has 1 cycle penalty when writing to a lower part of a register (AL for example) and then using it afterwards fully (EAX). Also there is 1 cycle penalty when decoding an instruction with an immediate value and either an index offset or an immediate displacement.

Right now FastDoom only focuses on low end 486s and 386 cpu's so don't care much about Pentium or similar cpu's. I'll use for sure the getcpu function, I know a trick to use the CR2 register as an scratchpad on 386 processors (386SX and 386DX) that allows 32-bit data copy that is faster compared to memory copies.

BTW I changed the original code from

mov  ebx,edx
...
shr  ebx,25

to

xor  ebx,ebx
...
shld ebx,edx,7

because Texas Instruments 486 CPU's datasheet describe SHLD/SHRD to be 1 cycle faster than SHL/SHR. I still have to try this, as Texas Instruments 486 share same core as Cyrix and ST 486s, and the datasheet of those describe the instruction latency to be the same in both cases.

And one question, the Build engine used a backbuffer for rendering? or did it use direct rendering to VRAM?

@viti95 viti95 added the enhancement New feature or request label May 2, 2023
@Awesoken2
Copy link
Author

The penalty for write AL immediately followed by read EAX can be avoided by putting an unrelated instruction between them. Remember to re-order instructions when putting them back into the real code.

That stinks about the 1cc penalty for using both index & base. I see it in the 486 manual now. I totally missed that one. It basically ruins the lea trick for 486. Oh well, there's still the 386, I guess.

Some good news: the operand size prefix (66h) for the "shld cx, dx, 6", which usually has a 1cc penalty, can be overlapped with the previous instruction if it is multi-cycle (i.e.: shr ecx, 26).

I've never heard of this faster memcpy on 386 using CR2.. how does it work?

I know very little about the non-Intel processors from back then. It's crazy that SHLD would ever be faster than SHL as it's more complicated.

The Build Engine from 1995 or before had all kinds of crazy video modes, including unchained modes with support for every resolution I could find working values for. I would render direct to VRAM when possible - including unchained mode. The code sure was ugly though. Horizontal lines had to be done in 4 passes since using outp() to change the map mask was crazy slow.

Before VESA was prevalent, I made some special modes for Tseng ET4000, S3, and Paradise cards. These modes were all 320x200, but could triple buffer using a simple linear format (like mode 0x13). Supporting these cards required reverse engineering to figure out how to change the 64K window, as well as the high bits of the start address register.

By 1996, VESA had taken over - and it was equal or better than all the other modes, especially once linear memory (usually by using UNIVBE) was supported (for hi-res modes). With some prompting from Apogee, I removed the older modes, most of which were slower anyway. I kept a fallback mode to mode 0x13. The only modes to use a memcpy were this fallback mode, and the hi-res VESA modes when linear wasn't supported.

@viti95
Copy link
Owner

viti95 commented May 3, 2023

The 386SX (and some 386DXs) suffers from huge memory bottleneck, specially if the system doesn't have any cache. Reading and writing to the CR2/CR3 register is faster than doing from memory, even though the number of cycles of the MOV instruction is a bit higher:

mov r32, CR2 ; 6 cycles
mov r32, CR3 ; 6 cycles
mov r32, mem ; 4 cycles (+ mem latency)
mov CR2, r32 ; 4 cycles
mov CR3, r32 ; 5 cycles
mov mem, r32 ; 2 cycles (+ mem latency)

In FastDoom CR2/CR3 registers aren't used, so it's possible to use them as a scratchpad (for example to access dc_source or ds_source on rendering functions). I did some small tests and proved to be faster on my 386SX and cacheless 386DX, but much slower on any other 486 based system . A similar trick can be used with Cyrix based 386/486 cpu's with the debug registers, but it's not always faster as those cpu's have integrated L1 cache.

@Awesoken2
Copy link
Author

I have a new wall loop for 386 & 486. This one has no base+index addressing penalties:

      ;eax:&pal, ebx:Vhi, ecx:Vlo, edx:Vinclo, esi:&tex, edi:&scr, ebp:Vinchi
   add ecx, edx       ;386:2cc, 486:1cc, ecx:[vvvvvvvv vvvvvvvv vvvvvvvv v0000000]
   adc ebx, ebp       ;386:2cc, 486:1cc, ebx:[00000000 00000000 00000000 xVVVVVVV]
   and ebx, 3fh       ;386:2cc, 486:1cc, ebx:[00000000 00000000 00000000 0VVVVVVV]
   mov al, [esi+ebx]
   mov al, [eax]
   mov [edi-(LINE-1)*SCREENWIDTH], al

If you don't mind texels 128-255 being shifted one column to the right, or wasting 2x memory by repeating columns, you can remove the 'and' and change 'adc' to: 'adc bl, dl' (swapping edx & ebp in the process) to keep the upper 24 bits zeroed.

As usual, be sure to interleave instructions in the actual implementation to avoid penalties.

@Awesoken2
Copy link
Author

Awesoken2 commented May 5, 2023

..and here is a new ceiling/floor loop for 386 & 486:

   ;eax:&pal
   ;esi:&tex
   ;edi:&scr
   ;ecx:[  ulo  ..   vhi    ..  ]
   ;edx:[uinclo ..   vinchi ..  ]
   ;ebx:[   0    0   vhi    uhi ]
   ;ebp:[   0    0    0   uinchi]
add ecx, edx     ;386:2cc, 486:1cc, ecx:[uuuuuuuu uu000000 VVVVVVvv vvvvvvvv]
adc ebx, ebp     ;386:2cc, 486:1cc, ebx:[00000000 00000000 xxxxxx00 0xUUUUUU]
mov bh, ch       ;386:2cc, 486:1cc, ebx:[00000000 00000000 VVVVVVvv 0xUUUUUU]
and ebx, 0xfc3f  ;386:2cc, 486:1cc, ebx:[00000000 00000000 VVVVVV00 00UUUUUU]
mov al, [esi+ebx]
mov al, [eax]
mov [edi+PLANE+PCOL], al

Notes:

  • Ceiling&floor textures require a special cache, where textures are stored with a pitch of 1024 (instead of 64). The way it works is: you would allocate a 64KBy block for every group of 16 textures (assuming they're 64x64, which they are in Doom).
  • To support 16-bit writes, you can replace EDX with ESP to free up DL & DH. Then disable interrupts!

@viti95
Copy link
Owner

viti95 commented May 5, 2023

Is it possible to use the ESP register as a additional register if the interrupts are disabled? I've been trying to use it but I get random lockups.

@Awesoken2
Copy link
Author

There are 2 ways to use ESP:

  1. Without disabling interrupts, as a small iteration counter:
espbak: dd 0

   ..
   mov dword ptr ds:[espbak], esp
   mov dword ptr ds:[endp-4], (esp-num_iters*4) ;pseudocode!
   jmp short beg                                ;for self-modifying code

beg:  ..
      mov ?, [esp+0x88888888]                   ;optionally address an array
      ..
      sub esp, 4
      cmp esp, 0x88888888                       ;self-modified end pointer
endp: jnc short beg

   mov esp, dword ptr ds:[espbak]
   ..

Leave ESP as is for a starting value. You can then subtract 4 from it per iteration as your counter. Compare ESP to a self-modified value (assuming the entire purpose is you're out of registers) to decide when to terminate the loop. Finally, restore to the original value. You don't have to disable interrupts here since ESP always points to a valid clobber-able stack. Just be careful not to count too high, or else you'll run out of stack space. A few hundred iterations should be safe.

  1. Disabling interrupts, as a general purpose register:
espbak: dd 0
..
   cli 
   mov dword ptr ds:[espbak], esp
   
   mov esp, (whatever) ;safely clobber esp here

   mov esp, dword ptr ds:[espbak]
   sti

Notes:

  • By disabling interrupts, you will affect the timing of various things, so make sure the loop never runs for very long. A single texture-mapped line should be ok for most things, however it will ruin the quality of any sound that relies on a high speed interrupt, such as doing PWM on the PC Speaker. For this reason, I try to avoid using it whenever possible.
  • You can't use any PUSH/POP (in the normal way) inside the block as these instructions rely on ESP pointing to your stack.
  • You must backup/restore ESP to/from a global address. Using "PUSH ESP/POP ESP" fails for obvious reasons.

Curiously, in modern Windows code, ESP can be used freely - i.e. interrupts do not need to be disabled. I think it's because hardware interrupts are running in a different thread.

@Awesoken2
Copy link
Author

While reading my trusty "i486 Microprocessor Programmer's Reference Manual", page 844 (G-8), I noticed this stinker under "PREFIX OPCODES":

"On either processor, all prefix opcodes, including OFh, segment override, operand size/addressing, bus-lock, repeat, etc. require an additional clock to decode. This clock can be overlapped with the execution of the previous instruction if it takes more than one clock to execute."

That means your favorite instructions - SHLD & SHRD - are probably taking 1cc longer than you think, at least on Intel processors.

Too bad I don't have a real 486 to actually test this with. The ancient 486SX-20 laptop I thought I had just lost its backlight - and apparently the VGA connector doesn't work either. Boo. So I installed both PCEm and 86Box, assuming they'd be more accurate than DosBox. Well, I found that they really aren't. They just stupidly do simulated delays based on the instruction pneumonic, completely ignoring things like whether parameters are register or memory form. Forget about simulating penalties and they even say they don't simulate cache behavior. Oh well.

@RamonUnch
Copy link
Contributor

Boo. So I installed both PCEm and 86Box, assuming they'd be more accurate than DosBox. Well, I found that they really aren't. They just stupidly do simulated delays based on the instruction pneumonic, completely ignoring things like whether parameters are register or memory form. Forget about simulating penalties and they even say they don't simulate cache behavior. Oh well.

Sorry for your hardware loss.
PCem is better than DOSBox but far from perfect indeed. The point of those emulators is to have a wide software support and DOSBox does not care at all about accurate details emulation. PCem on the other hand is more focused on accuracy (at least from what they claim) but is still far form its goal especially in cache handling as you said, I saw this by testing with SpeedSys and you only get a flat line on the memory bandwidth test.

Speedsys

Hopefully the situation will improve for now I see PCem as a tool to use old software not really to develop new software for old hardware.

@viti95
Copy link
Owner

viti95 commented May 9, 2023

While reading my trusty "i486 Microprocessor Programmer's Reference Manual", page 844 (G-8), I noticed this stinker under "PREFIX OPCODES":

"On either processor, all prefix opcodes, including OFh, segment override, operand size/addressing, bus-lock, repeat, etc. require an additional clock to decode. This clock can be overlapped with the execution of the previous instruction if it takes more than one clock to execute."

That means your favorite instructions - SHLD & SHRD - are probably taking 1cc longer than you think, at least on Intel processors.

Too bad I don't have a real 486 to actually test this with. The ancient 486SX-20 laptop I thought I had just lost its backlight - and apparently the VGA connector doesn't work either. Boo. So I installed both PCEm and 86Box, assuming they'd be more accurate than DosBox. Well, I found that they really aren't. They just stupidly do simulated delays based on the instruction pneumonic, completely ignoring things like whether parameters are register or memory form. Forget about simulating penalties and they even say they don't simulate cache behavior. Oh well.

I've tried on a Intel 486DX4 75MHz and the difference between MOV+SHR and XOR+SHLD is basically none. Still have to try on a Cyrix based 486 to see if the datasheets are right. Also I have to test any 386 to see if there is any performance difference.

@viti95
Copy link
Owner

viti95 commented May 14, 2023

@Awesoken2 I've been able to use your idea to render columns using interleaved code, the code is in this branch: https://github.com/viti95/FastDoom/tree/ken-silverman-optimizations

For now only is implemented for high detail and Mode Y executables (LEA+SHR), but this will allow me to check on real hardware the performance difference.

@viti95
Copy link
Owner

viti95 commented May 14, 2023

Some benchmarks (only for R_DrawColumn high detail):

UMC Green 486 (50MHz):
Old code: 42.732 fps
New code (LEA+SHR): 43.307 fps

Intel 486DX2 (66MHz):
Old code: 37.651 fps
New code (LEA+SHR): 37.819 fps

ST 486 DX2 80MHz (Cyrix):
Old code: 41.397
New code (LEA+SHR): 40.785

@Awesoken2
Copy link
Author

That's +1.3%, +0.4%, and -1.5% (respectively). Are those fps numbers consistent across runs? If you're using the 140Hz interrupt counter and doing a multi-minute test, that's kind of slow and crappy. With a more precise timer, you could get a number instantly and not have to wait minutes per test. If the RDTSC instruction were available, that would be perfect. Too bad it doesn't exist on a 486. A good alternative is to use the 1.193MHz PIT, for example:

   //Use PIT (1193181Hz) channel 2 for precise timing. Notes:
   //* will interfere with PC Speaker code that uses PIT timer 2.
   //* counter is 16-bit, so FPS < 18.2Hz requires some extra code to resolve.
   //* PIT counter decrements - remember to negate!
static void hitim_start(void) { outp(0x61,inp(0x61)|1); outp(0x43,0xb4); outp(0x42,0); outp(0x42,0); } //period=65536
static int  hitim_get  (void) { int i; outp(0x43,0x84); i = (int)inp(0x42); return (int)inp(0x42)*256 + i; }
static void hitim_stop (void) { outp(0x61,inp(0x61)&0xfc); }

You can use the 140Hz interrupt counter to resolve the upper 16 bits like this:

   //int gametics = 0;
   //interrupt IRQ8_timer_handler() { .. gametics++; .. } //140Hz

   int hitic0 = tim_get(); int gametic0 = gametics; //read 1st time stamp (or use previous frame's time stamp)
   .. /*stuff to be timed*/ ..
   int hitic1 = tim_get(); int gametic1 = gametics; //read 2nd time stamp

   int dt = (gametic1-gametic0)*(1193181/140); //rough approx; multiplier should match period of PIT channel 0
   dt -= 32768; dt += (((hitic0-hitic1) - dt)&0xffff); //quantize to best 16-bit interval. Note: PIT counts backwards!
   //printf("%.3f fps",1193181.f/(float)dt); //show immediate fps; better to fifo dt's & display median. ; )
}

BTW, I posted a 2nd wall loop (about halfway down the thread) that has the potential to be even faster, and should definitely be tested as well. Perhaps we should call them:

  1. xor/shld (original)
  2. lea/shr (flipflop)
  3. add/adc/and (noshift)

@viti95
Copy link
Owner

viti95 commented May 15, 2023

Yes I repeat the benchmarks multiple times, so the result is pretty much stable. And yeah benchmarks are slow on Doom, as every frame has to be rendered to get the final result.

BTW there is an issue with the later wall optimization (that's why I haven't posted results with it). EBP register is used to jump to the unrolled texture mapping, so it cannot store an initial value:

jmp  [scalecalls+4+ebp*4]

Maybe it's possible to patch the instruction in realtime in order to free the EBP register, I'm still trying.

@viti95
Copy link
Owner

viti95 commented May 15, 2023

Some unexpected results for LEA+SHR. I've tested on my 386SX@33 MHz on potato detail (same code as high detail) and this is the result:

original: 14.224
flipflop: 14.085

So also a bit slower. I think is this because the 386 has this two penalties:


DECODE-AFTER-JUMP OVERHEAD

When a jump is performed, a 386 spends some extra time decoding
the next instruction to execute and flushing the pipeline.

THE FIRST INSTRUCTION requires exactly ONE EXTRA CPU CYCLE
for every COMPONENT ( prefixes, opcode, mod/rm , sib ,offset, immediate value )
of the instruction. Notice i said COMPONENT!!!
An 8bit displacement or a 32bit one, count always as an 1 extra cycle.


SHORT INSTRUCTIONS

Short instructions are loaded and decoded faster than longer ones
and since 386 has no internal cache and less memory access bandwidth
than a plain 486, this helps a lot.

@Awesoken2
Copy link
Author

Awesoken2 commented May 15, 2023

Ah. I wasn't thinking about that when allocating registers. Here are some easy solutions:

  1. Combine texture increments (edx:Vinc_lo & ebp:Vinc_hi) into the same register since their bits don't overlap. The extra amount being added to the low end of Vinc_lo (starting at bit 7) will never be noticed. You could use a 'rol edx, 7' to initialize the increment nicely.

  2. Store jump target on the stack:

   ..
   lea ebx, [scalecalls+4+ebp*4]
   push ebx

   mov ebx, edx
   shr ebx, 25
   jmp dword ptr [esp]

..

vscale1:
   add esp, 4
   (pop, etc.)
   ret

I don't see how a jump penalty would cause a noticeable slowdown, as there are no jumps other than the initial one into the big unrolled loop. Perhaps there's a code alignment problem? I wonder if sticking nop's after the align 4 would help.

Yeah, the lea instruction may have a 1cc penalty for using an index, so it wouldn't surprise me if it doesn't run any faster. For this reason, I think the 'noshift' loop is a lot more promising, especially on a 386.

@viti95
Copy link
Owner

viti95 commented May 15, 2023

I've solved the EBP issue for now storing it in a variable in ram and jumping using that variable. Now I'm trying to add the new optimized wall loop, but I don't understand well how it works, or at least well what each variable means.

I guess &pal is the palette ptr, &tex the texture ptr and &scr the VRAM ptr, but the other ones no idea 😅

@Awesoken2
Copy link
Author

Here's a picture of the texture registers I use in the noshift loop:

               31...24  23...16  15.....8 7......0
ecx: V_lo:    [vvvvvvvv vvvvvvvv vvvvvvvv v-------]
ebx: V_hi:    [-------- -------- -------- -VVVVVVV]

edx: vInc_lo: [iiiiiiii iiiiiiii iiiiiiii i-------]
ebp: vInc_hi: [-------- -------- -------- -IIIIIII]

key:
   lower case: lower bits / fractional part
   upper case: upper bits / your 7-bit texel coordinate
   -: don't care (could be 0 or garbage before being masked off)

The idea is the same as adding a 64-bit number using two 32-bit registers:

add eax, ecx  ;add lo 32 bits
adc ebx, edx  ;add hi 32 bits, using carry flag from previous instruction

The only difference is I'm using 32 bits in the middle (bits 7 to 38).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants