Code Gems

Welcome to Code Gems! Here follows some nice trick on the Intel processor series, based on my articles in the Imphobia diskmagazine. My aim is to introduce those few-byte-length jewels of assembly coding, which undoubtedly prove:

"There's always a better way."

This page contains the revised versions of my Imphobia articles. I left some obviously boring topic, and corrected some other. If you still find errors, feel free to contact me.

Respectable part of this was debugged out of various products, while others were experimentally worked out by me or by one of my friends; and I'm pretty sure that You have known many of them before. So there are no unambiguous credits for this.


Here's the menu (internal links to this page):

Loop using ECX int 16-bit code
Rejecting JUMPS
Nested loops
Multi-segment REP STOS/MOVS
Optimizing with ESP
REP zeroes CX
Puzzle
A nice tweaked mode operation
XCHG instead of MOV and PUSH/POP
Short compare
Checking if a register contains 8000h
Tiny Little Instructions
Writing a two-digit number
HLT
Macro fun
Optimizing for coding time
Puzzle Solution
LEA strikes back
Aligned fill (Kicking out conditional jumps, part I)
Use ES
Introducing a new video resolution
Fast way to clear the screen
Uncle Ben's wisdoms of the month
Intel processor bug???
Calculating the absolute value of AX
Pixel drawing in protected mode
Simple recursive calls
Hardware scroll with one page
Gouraud shading - 2 instructions/pixel
Demand paging on SVGA boards in PM
Filling a register with the Carry flag
BSWAP with 16-bit register
Operand sizes
Rounding
Penalties on the 486
Aligning inner loops
Aligning a memory pointer
Kicking out conditional jumps, part ][

(The Crab)
PHEW!
Dis is what I call a menu!


Loop using ECX int 16-bit code

By default, in 16-bit code the LOOP instruction uses CX for counting. But it's possible to change for ECX, just put an operand-size prefix (a 67h byte) before the LOOP. The same goes for JCXZ. Turbo Assembler has built-in instructions: LOOPW and LOOPD. The first one uses always CX regardless to the current segment size, the other uses ECX. Of course, all versions like LOOPDE, LOOPDNE and JECXZ are also available.


Rejecting JUMPS

Nice that the 386 knows the long conditional jumps - but not for the LOOP. When a long LOOP is needed (and JUMPS is on), TASM compiles this:


	loop	cycle_temp
	jmp	cycle_end
cycle_temp:
	jmp	cycle_start
cycle_end:
From the optimization's point of view (both size & speed) it isn't good. What I do is I turn on the JUMPS until the final version, then compile without JUMPS, and fix the remaining LOOPs with hand. Be careful with using JUMPS in 286-compatible code too. With a small brainwork another dozen of bytes can be saved. Take a look at this piece of initialzing code:

test_config:
	[VGA checking code]
	jne	bad_config
	[286 checking code]
	jne	bad_config
	[mouse checking code]
	jne	bad_config
	[soundcard checking code]
	jne	bad_config
	...
If the bad_config is too far from this code, every conditional jump will be extracted into two instructions. So if we put a

bad_config_collector:
	jmp	bad_config
instruction close enough to TEST_CONFIG and replace all JNE BAD_CONFIG with JNE BAD_CONFIG_COLLECTOR, then we saved another few bytes. Of course only when BAD_CONFIG can't be brought any closer.

Nested loops

Sometimes there's a need for little nested loops. One solution:


	mov	cl,outer_cycle_num
outer_cycle:
	[outer cycle code]
	mov	ch,inner_cycle_num
inner_cycle:
	[inner cycle code]
	dec	ch
	jne	inner_cycle
	loop	outer_cycle
This is two byte shorter than DEC CL / JNE combination. It was invented by TomCat / AbaddoN while developing a boot sector intro.

Multi-segment REP STOS/MOVS

In flat 16-bit real mode it's possible to use multi-segment block movement. It uses ECX for counting and ESI / EDI for addressing.


ESTOSD  macro
	db	67h
	stosd
endm
For example:

	xor	eax,eax
	mov	ecx,100000h
	mov	esi,200000h
	rep	estosd
This code clears four megabytes of memory. Possible to write the appropriate macros for stosb, stosw, movsb, scasb, cmpsb,...

Optimizing with ESP

Using ESP as a general-purpose register isn't so familiar because the interrupts should be disabled. But check this out: In real mode (and sometimes in protected mode) the stack operations ignore the upper word of ESP. (Except if a protected mode program forgot to reload the segment rights & limits. This is why I recommend to restore the default real mode settings.) So. When SP=0000, the first word to be pushed will be placed to SS:FFFE. Ergo if we initialize ESP to 00010000h, we have a wonderful 32-bit number, stack operations refer to the top of the stack segment, and interrupts can be enabled. For example let's assume we want a big nested loop (ECX is the only free register & ESP=00010000h):


	mov	ecx,(outer_num)*10000h
outer_cycle:
	[outer cycle code]
	mov     cx,inner_num
inner_cycle:
	[inner cycle code]
	loop    inner_cycle
	sub     ecx,esp
	jne     outer_cycle
This can be combined with the other nested loop method. And this is a possible technique for using the upper words of the 32-bit registers without a couple of SHRs. The disadvantage is that CX must be zero when the SUB ECX,ESP occurs. But if we restrict the usage of the upper words to 15-bit, the lower word can be anything. Example:

	mov	ebx,(cyclenum-1)*10000h
cycle:
	[cycle code]
	sub	ebx,00010000h
	jns	cycle
BX can contain any value, that won't be touched. Cyclenum-1 can be max. 8000h. Another small thing concerning the stack: on 386+ after all instruction which modifies SS, the interrupts will be disabled for the next instruction. So we can save that CLI/STI pair.

REP zeroes CX

Usual problem: mem->mem copy. DS:SI, ES:DI, and CX are prepared but


	rep	movsb
is slow... And

	shr	cx,1
	jnb	copyeven
	movsb
copyeven:
	je	copyready
	rep	movsw
copyready:
is also slow... Then comes the light:

	shr	cx,1
	rep	movsw
	adc	cx,cx
	rep	movsb
sounds good. I found it in the SSI Spring '94 Software demo by Future Crew. Yes, I debugged! And that was worth... Remember, LOOP also zeroes (E)CX.

Puzzle

Let's assume that EAX contains 0, except the least 8 bits (AL). How many instructions needed to fill the upper 3 bytes of EAX with AL (Without any precalculated table)? E.g. if EAX=000000e3, it should be transformed to e3e3e3e3. If you wish to guess it yourself, think of it a little before you read further.


A nice tweaked mode operation

Let's put a raw picture's line:

	mov	al,2
	mov	dx,3c4h
	out	dx,al
	inc	dx
	mov	al,11h
	mov	cx,linelength

onerow:
	out	dx,al
	movsb
	rol	al,1
	adc	di,0ffffh
	loop	onerow
Actually this is a very slow way to do it, I wanted to show just its philosophy.

XCHG instead of MOV and PUSH/POP

Note that


	xchg	reg,ax
takes only 1 byte and the TASM compiles XCHG ANYREG,AX even if we write XCHG AX,ANYREG. And

	xchg	ax,ax
is nothing else but the good old NOP. (Protected mode freaks may think XCHG EAX,EAX :-) Sometimes when there is a free register, a double XCHG is better than the PUSH/POP pair.

Short compare

If a routine gives back the result in a register and its value is 0 on error and 1 on success, the ordinary


	or	ax,ax
	je	error
can be replaced with the shorter

	dec	ax
	jne	error
For example, the XMS driver reports the errors this way. Is it fair that testing if a register=0 is longer than testing if it is 1 or -1? :-(

Checking if a register contains 8000h

(or 80000000h):


        neg     reg
        jo      it_was_8000

Tiny Little Instructions

CBW/CWD/CDQ/CWDE are the short ways to clear/set AH/DX/EDX/upper word of EAX when the MSB of AL/AX/EAX/AX is known in advance.


Writing a two-digit number

Let's suppose we want to write out a two-digit number in any numerical system between [2..10]. Value comes in AL, the base of the numerical system is in AH.


WRITE2DIGIT     macro
local	zeros,convert

	mov	byte ptr aam_operand,ah
	call	convert
zeros	db	'00$'

convert:
aam_operand	equ $+1
	aam
	xchg	ah,al
	add	word ptr zeros,ax
	mov	ah,9
	pop	dx
	int	21h
endm
See it from a debugger ;-)
When we want to write decimal numbers only, then unnecessary to rewrite the AAM's operand. Morals of the macro:
1. AAM has an operand, which is 0ah by default. It can be redefined.
2. Don't forget to purge the prefetch queue with a JMP or CALL when using self-modifying code near to IP.
3. See Ralph Brown's IntrList for more juicy undocumented functions.

HLT

HLT can be useful! It's very easy to make timings with it:


	mov	cx,18
	hlt
	loop	$-1
This was an 1-second delay.

Macro fun

Let me ask You one of my favourite questions! Which programming language is this:


	Writeln('Hello, world!');
Pascal...?
Not.
This is assembly.
Here's the solution:

WRITELN macro   _string
local   _afterstring

	call	_afterstring

db	 _string,13,10,'$'

_afterstring:
	mov	ah,9
	pop	dx
	int	21h
endm
This idea was taken from Silent's Planet Zzyqxhaycom BBS advert.

Optimizing for coding time

Look at this:


b	equ	byte ptr
w	equ	word ptr
o	equ	offset
...
E.g.

	movzx	cx,b ds:[80h]
With these abbrevations some typing time can be saved.

Puzzle Solution

A possible solution for the riddle:


	imul	eax,01010101h
(Now you can kill me.)

LEA strikes back

Most of us love the LEA instruction because of its wide usability. One of the possible uses is when it's used to multiply:


	LEA	EBX,[EBX+EBX*4]
Is it the fastest way in real mode when the upper word of EBX is sometimes useless? Not. The machine code of this instruction is 66,67,8d, 1c,9b. As You can see, it contains TWO prefixes, 66 and 67: both the operand- and address-size prefix. What happens if the first one is missing, like in

	LEA BX,[EBX+EBX*4] (67,8d,1c,9b) ?
- the upper word of EBX isn't changed
- instruction is shorter by one byte
- TASM understands this form too :-)
- ...and it's faster!
Let's take a look from the other side:

	LEA	EAX,[SI]
This does the same as

	MOVZX	EAX,SI !!!
And it's shorter & quicker in both 16 and 32-bit code... What a pity that only a few register combinations fit between the brackets : [BX], [BP], [SI], [DI], [BX/BP+SI/DI+immediate]. In 16-bit code there's a similar trick, LEA with 16-bit immediate. For example,

	LEA EAX,[1234h]
which clears the upper word of EAX too but is shorter than

            MOV EAX,1234h.
Hell, TASM doesn't understand the immediate LEA, it must be hardcoded each time.

Aligned fill (Kicking out conditional jumps, part I)

Back to the stoneage : filled vectors. The most dirty solution for filling a horizontal line (instead of a rep stosb) is probably this:


	test	cl,1
	je	_one
	stosb
_one:
	test	cl,2
	je	_two
	stosw
_two:
	shr	cx,2
	rep	stosd
Generally it's a really time-wasting way. A doubleword written to the memory may take some extra cycles if it wasn't aligned on dword boundary. Writing ONE doubleword MISALIGNED may take as much time as writing TWO doublewords ALIGNED! So here follows a horizontal line filler, which writes everything completely aligned without any conditional jumps (eax = color, cx: number of bytes to fill, es: di -> target):

mov	bx,cx	; Save CX

xor	cx,cx	; Put 1 to CX if it
test	bx,bx	; wasn't 0, else leave
setne	cl	; it zero

and	cx,di	; Leave CX 1 if DI is
sub	bx,cx	; odd, else clear it and adjust BX
rep	stosb	; Fill one byte if DI was odd
cmp     bx,2    ; Put 2 to CX if we
setnb	cl	; need to fill two or
add	cx,cx	; more bytes, else 0

and	cx,di	; Clear CX if DI is on
sub	bx,cx	; dword boundary, else leave it & adjust BX
shr	cx,1	; Fill one word (if CX isn't 0)
rep	stosw

mov	cx,bx	; Put the number of
shr	cx,2	; remaining bytes to
rep	stosd	; CX and fill dwords

and	bx,3	; Fill the rest
mov	cx,bx
shr	cx,1
rep	stosw
adc	cx,cx
rep	stosb
Is it really faster than a rep stosb? Not always. Only when a lot of bytes have to be filled - around 10. And of course it can be even quicker with conditional jumps. But without those it's so nice, eh...?

Use ES

Are You bored of writing


	MOVS	WORD PTR ES:[DI],ES:[SI]
when you want to redefine the source segment register? You can use

	SEGES
	MOVSW
instead... SEGDS, SEGES, etc. are built-in macros in TASM, they're gonna be compiled as DS:, ES:, etc. prefixes.

Introducing a new video resolution

In 256-color modes using two screen pages was always a big pain. We had to deal with chain-4 mode or VESA or SVGA registers. Basically, on a 'standard' VGA card in 'chunky' mode there's no way to reach the card's memory above 64k. So what can we do with the usable 64k? If we want to use it for two screen pages, one page eats 32k, which resolution is it enough for? Well, a resolution of 256*128 is a possible choice. With this we have can handle two pages... How to initialize this mode:


	mov	ax,13h
	int	10h
	mov	cx,11
	mov	dx,03d4h
	mov	si,offset tweaker
	rep	outsw

radix	16
tweaker	dw	0e11,6300,3f01,4002
	dw	8603,5004,0d07,5810
	dw	0ff12,2013,0915
After this there's a small window in the middle of the screen with 256*128 pixel dimensions. Pixel drawing will be quite easy because of the horizontal length :-) This resolution is fully laptop-incopatible and some weird monitors probably won't accept it (however it never happened under my tests) as well as some VGA cards won't love the rep outsw (Hi Jmagic ;-) This mode was used in Technomancer's UnderWater demo and Fish intro. (Thanks for detecting the bug, pal!)

The next topic was written by LEM - jeanmarc.leang@ping.be

Fast way to clear the screen

(by using less colors in 256c mode)

Let's say You use 32 colors, the first 32 and put black to the registers left. In the next frame put black to the first 32 registers and restore the 32c palette in the next 32 registers. and so on... prob... after 8 frames you'll reach the end of the palette and go back to the first 32 colors... (if You use 1 col/frame you'll reach the end of the palette after 256 frames ofcoz) so you'll see the old crap you left 8 frames earlier...

Hehehe,
DID YOU REALLY THINK THAT WAS ALL ???

You can actually clear the screen with that technique!
Let's say You want to draw points in 1 color (works for 1 -> xx colors !!!) You take 200 colors for the trick which makes 56 colors left; use them for a logo for example, a size of 64*200 (any size, but just to show you it works) and put it somewhere on the screen (let's say on the left). Now You've got 200 colors left, so you can display points during 200 frames before having garbage on the screen, RIGHT?

WRONG!!! You can display points forever without garbage!


frame	0	:col   0 on,			clear line   0
frame	1	:col   1 on,	col   0 black,	clear line   1
...
frame 199	:col 199 on,	col 198 black, clear line 199
Go back to color 0, so now You actually have cleared 200 lines in 200 frames! Remember, color 0 was used only during 1st frame so when you're going to use it again, there won't be be color0 on the screen anymore (but there will be color 1 -> 199 but they're all black). Of course you can use 2, 10, x colors (and clear then screen in 200/x times, the problem is that you have to set/reset a lot of colors every frame) AND keep some colors for a nice smoothed picture (of course You CANNOT draw with the tricky technique on it.)

Uncle Ben's wisdoms of the month

Remember, a NEG + DEC pair equals a NOT. And the DEC/INC doesn't change the carry flag. You may need it some day...


Intel processor bug???

I've faced (again...) an interesting problem during the development of some 32-bit interface. Let's say we don't need the 0..7 interrupts, so there are no interrupt gates in the memory where the IDT starts. The first valid int gate is No. 8 , which is at IDT's base+40h. The problem is that when an interrupt occurs in 32-bit protected mode, and the interrupt handler's code is in the beginning of the IDT (in those unused 40h bytes), the processor shuts down, but when the int handler's code is somewhere else, everything is okay. Do you have any ideas...?


Calculating the absolute value of AX

This wonderful 'gem' was developed by Laertis / Nemesis.


	cwd
	xor	ax,dx
	sub	ax,dx

Pixel drawing in protected mode

Here comes a 'routine' which sets a pixel to the given value in 256-color mode: (parameters: EAX=X coordinate, EBX=Y coordinate, CL=color)


	add	eax,table[ebx*4]
	mov	[eax],cl
The only difference from the real mode method that the TABLE doesn't contain the 0, 320, 640, etc. values. It contains (a0000-base of DS), (a0000-base of DS+320), ... There's an other version which doesn't change EAX:

	mov	edx,table[ebx*4]
	mov	[eax+edx],cl

Simple recursive calls

Sometimes we have to call one subroutine many times like this:


	mov	cx,4
	call	waitraster
	loop	$-3
But this requires a register as cycle counter ;-) There's the other way:

	call	waitraster4
	...

waitraster4:
	call	waitraster2
waitraster2:
	call	waitraster
waitraster:
	mov	dx,03dah
	...
	ret
Well, this is not really interesting. It just works :-) Now a more usable example: loading instrument data to the AdLib card.

;Load AdLib instrument. Inputs:
;ds:si: register values (5 words;
;	lower byte: data for operator
;	1, higher byte: data for
;	operator 2)
;al:	adlib port (0,1,2,8,9,a,10h,11h,12h)
loadinstr:
	mov	dx,388h
	add	al,0e0h
	call	double_load

	sub	al,0c6h
	call	double_double
	add	al,1ah
double_double:
	call	double_load
	add	al,1ah
double_load:
	call	final_load
final_load:
	mov	ah,[si]
	inc	si
	out	dx,al
	call	adlib_address_delay
	xchg	al,ah
	out	dx,al
	call	adlib_data_delay
	mov	al,ah
	add	al,3
	ret

Hardware scroll with one page

First a few words about vertical hardware scrolling. The 'standard' scroll requires at least two pages. In the beginning the first page is visible, and it's black. Then the screen goes up one row - the first row of the second page appears at the bottom. Now this row is copied to the 1st row of the 1st page (which row is now invisible). This process continues until the 2nd page is entirely visible. At this point the two pages are identical. Now the 1st page is displayed again and the whole process starts from the beginning. The problem with it is the memory requirememnt, which is too big. With this method it's impossible to make a 640*480 scroll since one page occupies more than 128k video memory. But why do we need two pages? Because the video memory is not 'circular'. I mean if we'd scroll the screen up by one pixel, the 1st row of the video memory which was on the top of the screen now would be at the bottom. With this kind of video memory we could do a smooth vertical scroll with a single page: in the beginning, the screen is black. Now wait for a vertical retrace, then change the 1st row, and shift the screen up by one row that the previously modified row appear in the bottom. Perfect eh? The question is how can we make 'circular' memory... It's a well-known fact that there's a certain problem with the hardware scroll on TSENG cards: every second page contains some 'noise' instead of the scroll we're expected to see... The cause of this is the 'memory display start' register (3d4/0c,0d) which works a bit different than other cards. At other cards always only the first 256k of the video memory will be displayed on the screen, even if the memory display start register is set close to the end of the 256k. These cards handle this 256k memory as a circular buffer, but the TSENG boards not:
Normal VGA & TSENG is different!
Normal VGATSENG VGA
Memory Display Start Register = XXXXVideo memory:
XXXX-3ffff
Video memory:
XXXX-3ffff
00000-
Display memory wraps to zero!
40000-
Dispalying the video memory continues!
So what we can do is 'emulate' the standard VGA circular buffer with the LINE COMPARE REGISTER (3d4/18h). The function of this register is pretty simple: if the scanline counter reaches this value, the display address wraps to 0, beginning of the video memory:
Normal VGA & TSENG is the same!
Normal VGATSENG VGA
Memory Display Start Register = XXXXVideo memory:
XXXX-3ffff
Video memory:
XXXX-3ffff
Line Compare Register00000-
Display memory wraps to zero!
00000-
Display memory wraps to zero!
The *big* advantage is that it's possible to emulate shorter than 256k circular video memory! It should work on all VGA cards. The most elegant way is to add a Line Compare Register changer code to the Memory Display Start register modifier routine. With this the existing 'standard' scrollers can be fixed for TSENG cards too. Remember, the line compare register is 10-bit, the highest two bits are located in 3d4/7/4. bit and 3d4/9/6. bit.


Gouraud shading - 2 instructions/pixel

The main goal of this example is not really to show a G-shading with two instructions ;-) It's rather an example for 'how to pray down the upper words of 32-bit registers without shifting. There's often a need for calculating with fixed-point numbers: a doubleword's upper word is the whole part, the lower is the fractional part. The problem is that the upper words of the 32-bit registers are hard to reach. For example, at ADD EAX,EBX how to get EAX's upper word? No (quick) way :-( The idea beyond the trick is changing the upper & lower words, and using ADC instead of ADD:


; EAX & EBX are fixed-point numbers
	ror	eax,16
	ror	ebx,16
cycle:
	...
	adc	eax,ebx
	stosw
	...
	loop	cycle
The whole part of the fixed-point numbers will be in the lower words :-) It's very imprtant to save the Carry flag for appropriate result. Now the Gouraud shading. The following piece of code is only a horizontal shaded line drawer routine, not the whole poly-filler. Colors are expected to be fixed-point numbers presented as doublewords with 8-bit whole part in the highest byte (this value will appear on the screen) and 18-bit fractional part. (18 bits may seem to be a lot, but surely more accurate than 8 bits ;-)

;In:	eax:	end color
;	ebx:	start color
;	ecx:	line length
;	es:edi:	destination
;!!!	32-bit PM version   !!!

gou_line:
	sub	eax,ebx

;Fill edx with the carry flag
	sbb	edx,edx
	idiv	ecx

;Pull down the upper parts of dwords
	rol	eax,8
	rol	ebx,8
	xchg	ebx,eax

;Calculate the address of the entry
;point in the linearized code
	neg	ecx
	lea	ecx,[ecx*2+ecx+320*3+offset gou_linearized]
	jmp	ecx

gou_linearized:
rept	320
	stosb
	adc	eax,ebx
endm
	ret
Variations: if You want to use it in real mode, then You have to modify the linearized-code entry point calculation, because the length of a stosb/adc pair is four bytes:

       neg     cx
       shl     cx,2
       add     cx,320*4+offset g.lin.
       jmp     cx
486-optimization fans may think some indexed linearized code instead of stosb :-) In this case take care to correctly set up the lin. code because the lengths of 'mov [edi+0],al', 'mov [edi+1],al' and 'mov [edi+200h],al' are different, so with a rept we won't get equal-length instructions.

Demand paging on SVGA boards in PM

(This is going to be deep protected mode system coding, be prepared...) It would be very nice to 'map' the video memory to the linear address space so we could reach it as a one megabyte long array. Some cards support it, the rest not: at these cards only the 'bank switching' routines allow to access the entire video memory. Our goal is to reduce the number of bank switches as possible. Several techniques has been developed, but many of them has a big problem: the routine which determines whether a bank-switch is necessary must run very much times. The next method solves this problem. It maps an 1MB long memory area to the video memory on any SVGA card, and bank-switch will occur only if necessary. It works in protected and flat virtual mode only, NOT in (flat) real mode. Essentially it's a kind of 'virtual memory' technique based on PAGING. Let's set up the 4k-page table reserving one megabyte above the highest physical memory address (let's say from 800000h to 8ffffffh) and map it to a0000h by 64k steps:
000000-7fffff mapped to 000000-7fffff, it's the normal mapping
Physical addressLogical address
000000-000fff000000-000fff
001000-001fff001000-001fff
002000-002fff002000-002fff
......
7ff000-7fffff7ff000-7fffff

It's 800000-80ffff mapped to a0000-affff
Physical addressLogical address
800000-800fffa0000-a0fff
801000-801fffa0100-a1fff
......
80f000-80ffffa0f00-affff

It's 810000-81ffff mapped to a0000-affff
Physical addressLogical address
810000-810fffa0000-a0fff
811000-811fffa0100-a1fff
......
81f000-81ffffaf00-affff

It's 820000-82ffff mapped to a0000-affff
......

It's 8f0000-8fffff mapped to a0000-affff
Physical addressLogical address
8f0000-8f0fffa0000-a0fff
8f1000-8f1fffa0100-a1fff
......
8ff000-8fffffaf00-affff

Great. From 8 to 9 megabytes we can address the a0000 - affff segment sixteen times. Now comes the TRICK. Mark all 4k pages between 810000- 8fffff as 'NOT PRESENT' and pages in 800000-80ffff as 'PRESENT', and hook interrupt 0e ('page fault' exception). If a page fault occurs, it means that a bank switch needed - mark the accurate pages as 'PRESENT' and old ones as 'NOT PRESENT', do the bank-switch, and return from the exception. The fault handler looks like this:


	push	eax edx
	mov	eax,cr3		; Get page fault address:
	sub	eax,800000h	; Substract starting address
	shr	eax,16		; Put bank's number to AL

; SVGA bankswitch
	mov	dx,svga_switch_port
	out	dx,al

; Mark pages present/absent
	(not too difficult to do :-)
; Bye
	pop	edx eax

; Return from the fault
	...
This example assumes that the video memory can be browsed in 64k banks and the bank-switch is simple :-) Now let's observe the problems...
1. Paging must be enabled. This causes some slowdown. But paging can be disabled when no video-operations are in use.
2. EMM compatibility. No problem with VCPI, the only difference is that paging may not be disabled at virtual-mode callbacks. The big problem is the DPMI, which doesn't allow modifying the page table.
3. Reading the video memory. Some SVGA cards have a separate write and read bank register, but that makes no difference. The page fault handler has to determine whether it was a read or write operation. So it will (occasionally) switch bank TWICE in a single MOVS instruction! This means that reading the video memory should be eliminated as possible.
4. Words and doublewords written to bank boundaries. This is the roughest problem. When no special code is inserted to handle this case, the system will fall to an infinite exception cycle :-( Let's take look at an example: a STOSD occurs to 80fffe. It causes a page fault since the pages from 810000 are not present. The fault handler enables the pages between 810000-81ffff and disables 800000-80ffff, and returns to the instruction which caused the exception. But then immediately a new fault is generated because the 80fffe address is in an absent page... Of course this can be avoided by
a) writnig bytes only,
b) writing words to even addresses and dwords to dword boundaries.
If none of these conditions can be satisfied, the fault handler must decode the instruction which caused the exception and emulate it. This can be pretty simple if only one-two kind of instructions access the desired area. Of course it's enough to check only those instructios which caused a page fault above the address fffc:

	mov	eax,cr3
	cmp	ax,0fffch
	ja	possible_bank_override
normal_fault:
	...
; Return from the fault
	...

; Check the instruction which caused
; the page fault (STOSD's code is A5)

possible_bank_override:
	mov	edx,[esp]
	cmp	byte ptr[edx],0a5h
	jne	normal_fault

; Now emulate a STOSD
	...
; Return from the fault
	...

Filling a register with the Carry flag

Sometimes we need to fill a register with the carry flag: put 0 to the desired reg if Carry=0 or put ffffffff if Carry=1. Probably the fastest way to do it (for example, with EDX):


	SBB	EDX,EDX
In practice it can be used to prepare EDX before a division. (Before dividing a signed number by an unsigned number.) By the way, the Carry won't change after this operation. I rambled about it and made the conclusion that SBB EDX,EDX is faster and in many cases more appropriate than CDQ. What's more, there's an undocumented 1-byte instruction called SETALC which fills AL with the Carry; it's code is 0d6h. (Hi Rod!)

BSWAP with 16-bit register

Yeah, BSWAP with a word register operand is a valid instruction. At least in the genuine Intel processors. Some specifications recommend against using it for better compatibility with Intel-clones. Anyway, if it works, it brings down the upper word of the doubleword register without affecting its upper 16 bit.For example, if EAX=ab120000 then


	BSWAP	AX
results in EAX=ab1212ab. Flags won't change of course.

Operand sizes

At instructions where the operand is small (like 'ADD EBX,12'), the machine code is shorter since the operand is stored in 8 bits and will be sign- extended to 32-bit when the instruction actually runs. This means that all operand values in the [-128..127] range (hexadecimally [ffffff80..0000007f]) save 3 bytes per instruction. There's only one trick concerning it. When using 'ADD EBX,80h', the operand takes 4 bytes, since 00000080h falls out of the range. But the


	SUB	EBX,0FFFFFF80h
will do the same as 'ADD EBX,80h' but three bytes shorter. The same goes for 'SUB EBX,80h' / 'ADD EBX,0ffffff80h' too. Note: EAX's operand is always 32-bit, it can't be shrunk. This comes from the 8086 era, when the 16-bit registers ruled. In 16-bit code most instructions which affect AX are shorter therfore there were no need for making such sign-extended operands for AX; and since the 32-bit code comes directly from the 16-bit code, there are no short sign-extending opcodes for EAX. Except if you use the good old LEA ;-) So 'LEA EAX,[EAX+12]' costs three bytes while 'ADD EAX,12' eats five. Of course, only in 32-bit code! In real mode 'LEA EAX,EAX+12' have TWO prefix bytes (slowing the 1-clock LEA back to THREE clocks). I've got one more addressing-mode trick. Instead of 'MOV EAX,[EBX*2]' the 'MOV EAX,[EBX+EBX] is better. The first one's code is seven bytes, the second's is three ;-) Generally, every scaled addressing like [EBX*2] have a 32-bit displacement so the [EBX*2] is in fact [EBX*2+00000000h]. But if another register is present, the opcode can be shorter, for example, 'MOV EAX,[ESI+EBX*4+10h]' is a 4-byte instruction. Nevertheless, on the 486 every instruction which uses two registers in addressing or one register scaled, takes one extra cycle! So 'MOV EAX,[EBX+EBX] takes two cycles and so the 'MOV EAX,[EBX*2]', and the 'LEA EAX,[EBP+EDX*8+1234h]' too.

Rounding

Let's say we want to divide a number by a power of two (2, 4, 8, 16, etc.) then rounding it upwards. In this case


	SAR	EAX,4
	ADC	EAX,0
will perfectly work. The credit for this one goes to Good Old Dave :-)

Penalties on the 486

In most cases, the 486 is free from flow-dependence penalties which mean that an instruction which uses the result of the previous instruction will not cause slowdown:


        ADD EAX,EBX
        ADD ECX,EAX
takes two cycles. On a Pentium, however, it takes two cycles too, but the

        ADD EAX,EBX
        ADD ECX,EBX
takes one cycle because the second instr. doesn't use the result of the first so they can be 'pair'-ed. These situations are quite well described in the opt32.doc application note released by Intel, I just want to point to one interesting thing. (By the way, there's a new versioun out with Pentium Pro optimization tricks. Check www/ftp .intel.com!) Generally, the 486 has two types of flow-dependence penalties:
1) immediately using a register after its 8-byte subregister was modified (so this applies to (e)ax, (e)bx, (e)cx, (e)dx after al, bh, etc. has been changed);
2) using a register in addressing immediately after it was modified. (This is valid for all registers, but beware, the LEA is an addressing instruction, so try avoid using it if its operand was modified by the previous instruction). For example, how many cycles does the following code sequence eat (in protected mode assuming 100% cache hit):

	ADD	ECX,EBP
	ADC	BL,DL
	MOV	AL,[EBX]
On a 486 the add is one, the adc is another one, but the mov takes three even if the operand is already in the cache! Why? There is a double penalty: one clock for using a register in addressing immediately after it was modified. (Address Generation Interlock AGI),; and another cycle for using a register immediately after its subregister was modified (Flow Break). So this innocent MOV instruction costs THREE cycles... Hey, I'm a smart coder, I'm gonna put an instruction between the ADC and the MOV, and the problem is solved! Really? The

	ADD	ECX,EBP
	ADC	BL,DL
	SUB	ESI,EBP
	MOV	AL,[EBX]
sequence takes 5 clocks: the add, adc, sub take three but the mov takes two because ONE cycle inserted BETWEEN the ADC and the MOV can save only ONE penalty, not TWO. So for a perfect one clock per one instruction ratio at least TWO instructions have to be inserted. Or, one two-cycle instr. like shr or even a prefixed like add ax,bx in 32-bit code.

Aligning inner loops

Aligning the first instruction of an 'inner' loop to 16-byte boundary may increase performance: the 486's and the Pentium Pro's instruction prefetcher (or what) just loves aligned addresses. But the straight-forward solution:


	JMP	_INNERLOOP
	ALIGN	16
	_INNERLOOP:
sounds quite dirty from many points of view: it may waste both space and speed. Certaily a more elegant method needs, which won't put a JMP when the _INNERLOOP label is already on paragraph boundary; and when there are only 1-2 bytes remain before the next aligned position, it will put some NOPs instead of a JMP:

CODEALIGN MACRO
LOCAL _NOW,_THEN
_NOW:
ALIGN	16
_THEN:
IF (_THEN-_NOW)					; Already aligned?
	IF (_THEN-_NOW) LE 3			; 0,1,2,3 remained?
		DB (_THEN-_NOW) DUP (90h)	; put NOPs
	ELSE
		ORG _NOW			; Set position back
		JMP _THEN			; Jump to the boundary
		ALIGN	16			; Apply aligning
	ENDIF
ENDIF
ENDM
Simply put 'codealign' before the top of the inner loop:

(...Inner loop preparation code...)
codealign
_INNERLOOP:
(...Inner loop...)
This one is not fully optimized from the speed's point of view: instead of two NOPs (when two bytes remain) one 'MOV ESP,ESP' would be faster and instead of three NOPs an 'ADD ESP,0'.

Aligning a memory pointer

After allocating a few doublewords it's not sure that the array's starting address is on dword boundary, we have to adjust it:


	ADD	EBX,3
	AND	EBX,0fffffffch
With this technique it's possible to align to any power of two. This idea came from Technomancer's PM kernel.
Another (space-saving) trick is when flags needed to be stored with pointers; for example, in a 3D code one pointer belongs to every face (triangle). In runtime we need one bit to indicate if the face is visible, or not. It would be too expensive to allocate another byte for every triangle. But where can that one bit be stored? Right in the pointer. This requires that the faces' stuctures always start on even addresses; in this case, their address' lowest bit is always zero, so the pointers' lowest bit is always zero. This lowest bit can be used as a flag: set by the appropriate routine (e.g. 'int face_visible(face *)') and checked by other routines. Only take care to clear it before accessing the face's structure.

Kicking out conditional jumps, part ][

When optimizing for the Pentium (or for Pentium Pro with respect for compatibility) it might save some cycles to change the conditional jumps to non-branching code. For example, the next code sequence implements this IF constuction:


IF (eax = ebx) THEN ecx=x;
ELSE ecx=y;

In asm:
	SUB	EAX,EBX
	CMP	EAX,1
	SBB	ECX,ECX
	AND	ECX,x-y
	ADD	ECX,y
It also reveals how to copy the Zero flag to the Carry in one cycle ('CMP EAX,1' ;-) On the Pentium Pro, however, the next version should be prefered:

	MOV	ECX,x
	CMP	EAX,EBX
	CMOVNE	ECX,y
Yeah, the Pro has been blessed with the conditional move instruction, like cmovs, cmovne, cmovg, and all the rest. As always, the optimization techniques of the Pro are completely different from the other members of the Intel family. For example, using a 32-bit register after its subregister was modified may cause a 6-7 clock penalty! Can you believe it? Linear texturemapping inner loops using the classic 8-bit U/V coordinates suck... Some other notes: loop aligning to paragraph boundary now boosts the performance again since the Pro always loads 16 bytes of code in one step. The banned instructions: CDQ and MOVZX are also faster than the Pentium's mov/shift-crafted substitutors.

(The Crab)

If I had knuwn dat kodin' o' de Intel family is sooo diffiqlt, I'd have stayed at mine ZX Spektrum for ever!