Welcome to Code Gems! Here follows some nice trick on the Intel processor series, based on my articles in the Imphobia diskmagazine. My aim is to introduce those few-byte-length jewels of assembly coding, which undoubtedly prove:
"There's always a better way."
This page contains the revised versions of my Imphobia articles. I left some obviously boring topic, and corrected some other. If you still find errors, feel free to contact me.
Respectable part of this was debugged out of various products, while others were experimentally worked out by me or by one of my friends; and I'm pretty sure that You have known many of them before. So there are no unambiguous credits for this.
Here's the menu (internal links to this page):
Loop using ECX int 16-bit code
Rejecting JUMPS
Nested loops
Multi-segment REP STOS/MOVS
Optimizing with ESP
REP zeroes CX
Puzzle
A nice tweaked mode operation
XCHG instead of MOV and PUSH/POP
Short compare
Checking if a register contains 8000h
Tiny Little Instructions
Writing a two-digit number
HLT
Macro fun
Optimizing for coding time
Puzzle Solution
LEA strikes back
Aligned fill (Kicking out conditional jumps, part I)
Use ES
Introducing a new video resolution
Fast way to clear the screen
Uncle Ben's wisdoms of the month
Intel processor bug???
Calculating the absolute value of AX
Pixel drawing in protected mode
Simple recursive calls
Hardware scroll with one page
Gouraud shading - 2 instructions/pixel
Demand paging on SVGA boards in PM
Filling a register with the Carry flag
BSWAP with 16-bit register
Operand sizes
Rounding
Penalties on the 486
Aligning inner loops
Aligning a memory pointer
Kicking out conditional jumps, part ][
PHEW!
Dis is what I call a menu!
Loop using ECX int 16-bit code
By default, in 16-bit code the LOOP instruction uses CX for counting. But it's possible to change for ECX, just put an operand-size prefix (a 67h byte) before the LOOP. The same goes for JCXZ. Turbo Assembler has built-in instructions: LOOPW and LOOPD. The first one uses always CX regardless to the current segment size, the other uses ECX. Of course, all versions like LOOPDE, LOOPDNE and JECXZ are also available.
Nice that the 386 knows the long conditional jumps - but not for the LOOP. When a long LOOP is needed (and JUMPS is on), TASM compiles this:
loop cycle_temp jmp cycle_end cycle_temp: jmp cycle_start cycle_end:
test_config: [VGA checking code] jne bad_config [286 checking code] jne bad_config [mouse checking code] jne bad_config [soundcard checking code] jne bad_config ...
bad_config_collector: jmp bad_config
Sometimes there's a need for little nested loops. One solution:
mov cl,outer_cycle_num outer_cycle: [outer cycle code] mov ch,inner_cycle_num inner_cycle: [inner cycle code] dec ch jne inner_cycle loop outer_cycle
In flat 16-bit real mode it's possible to use multi-segment block movement. It uses ECX for counting and ESI / EDI for addressing.
ESTOSD macro db 67h stosd endm
xor eax,eax mov ecx,100000h mov esi,200000h rep estosd
Using ESP as a general-purpose register isn't so familiar because the interrupts should be disabled. But check this out: In real mode (and sometimes in protected mode) the stack operations ignore the upper word of ESP. (Except if a protected mode program forgot to reload the segment rights & limits. This is why I recommend to restore the default real mode settings.) So. When SP=0000, the first word to be pushed will be placed to SS:FFFE. Ergo if we initialize ESP to 00010000h, we have a wonderful 32-bit number, stack operations refer to the top of the stack segment, and interrupts can be enabled. For example let's assume we want a big nested loop (ECX is the only free register & ESP=00010000h):
mov ecx,(outer_num)*10000h outer_cycle: [outer cycle code] mov cx,inner_num inner_cycle: [inner cycle code] loop inner_cycle sub ecx,esp jne outer_cycle
mov ebx,(cyclenum-1)*10000h cycle: [cycle code] sub ebx,00010000h jns cycle
Usual problem: mem->mem copy. DS:SI, ES:DI, and CX are prepared but
rep movsb
shr cx,1 jnb copyeven movsb copyeven: je copyready rep movsw copyready:
shr cx,1 rep movsw adc cx,cx rep movsb
Let's assume that EAX contains 0, except the least 8 bits (AL). How many instructions needed to fill the upper 3 bytes of EAX with AL (Without any precalculated table)? E.g. if EAX=000000e3, it should be transformed to e3e3e3e3. If you wish to guess it yourself, think of it a little before you read further.
Let's put a raw picture's line:
mov al,2 mov dx,3c4h out dx,al inc dx mov al,11h mov cx,linelength onerow: out dx,al movsb rol al,1 adc di,0ffffh loop onerow
XCHG instead of MOV and PUSH/POP
Note that
xchg reg,ax
xchg ax,ax
If a routine gives back the result in a register and its value is 0 on error and 1 on success, the ordinary
or ax,ax je error
dec ax jne error
Checking if a register contains 8000h
(or 80000000h):
neg reg jo it_was_8000
CBW/CWD/CDQ/CWDE are the short ways to clear/set AH/DX/EDX/upper word of EAX when the MSB of AL/AX/EAX/AX is known in advance.
Let's suppose we want to write out a two-digit number in any numerical system between [2..10]. Value comes in AL, the base of the numerical system is in AH.
WRITE2DIGIT macro local zeros,convert mov byte ptr aam_operand,ah call convert zeros db '00$' convert: aam_operand equ $+1 aam xchg ah,al add word ptr zeros,ax mov ah,9 pop dx int 21h endm
HLT can be useful! It's very easy to make timings with it:
mov cx,18 hlt loop $-1
Let me ask You one of my favourite questions! Which programming language is this:
Writeln('Hello, world!');
WRITELN macro _string local _afterstring call _afterstring db _string,13,10,'$' _afterstring: mov ah,9 pop dx int 21h endm
Look at this:
b equ byte ptr w equ word ptr o equ offset ...
movzx cx,b ds:[80h]
A possible solution for the riddle:
imul eax,01010101h
Most of us love the LEA instruction because of its wide usability. One of the possible uses is when it's used to multiply:
LEA EBX,[EBX+EBX*4]
LEA BX,[EBX+EBX*4] (67,8d,1c,9b) ?
LEA EAX,[SI]
MOVZX EAX,SI !!!
LEA EAX,[1234h]
MOV EAX,1234h.
Aligned fill (Kicking out conditional jumps, part I)
Back to the stoneage : filled vectors. The most dirty solution for filling a horizontal line (instead of a rep stosb) is probably this:
test cl,1 je _one stosb _one: test cl,2 je _two stosw _two: shr cx,2 rep stosd
mov bx,cx ; Save CX xor cx,cx ; Put 1 to CX if it test bx,bx ; wasn't 0, else leave setne cl ; it zero and cx,di ; Leave CX 1 if DI is sub bx,cx ; odd, else clear it and adjust BX rep stosb ; Fill one byte if DI was odd cmp bx,2 ; Put 2 to CX if we setnb cl ; need to fill two or add cx,cx ; more bytes, else 0 and cx,di ; Clear CX if DI is on sub bx,cx ; dword boundary, else leave it & adjust BX shr cx,1 ; Fill one word (if CX isn't 0) rep stosw mov cx,bx ; Put the number of shr cx,2 ; remaining bytes to rep stosd ; CX and fill dwords and bx,3 ; Fill the rest mov cx,bx shr cx,1 rep stosw adc cx,cx rep stosb
Are You bored of writing
MOVS WORD PTR ES:[DI],ES:[SI]
SEGES MOVSW
Introducing a new video resolution
In 256-color modes using two screen pages was always a big pain. We had to deal with chain-4 mode or VESA or SVGA registers. Basically, on a 'standard' VGA card in 'chunky' mode there's no way to reach the card's memory above 64k. So what can we do with the usable 64k? If we want to use it for two screen pages, one page eats 32k, which resolution is it enough for? Well, a resolution of 256*128 is a possible choice. With this we have can handle two pages... How to initialize this mode:
mov ax,13h int 10h mov cx,11 mov dx,03d4h mov si,offset tweaker rep outsw radix 16 tweaker dw 0e11,6300,3f01,4002 dw 8603,5004,0d07,5810 dw 0ff12,2013,0915
The next topic was written by LEM - jeanmarc.leang@ping.be
Fast way to clear the screen
(by using less colors in 256c mode)
Let's say You use 32 colors, the first 32 and put black to the registers left. In the next frame put black to the first 32 registers and restore the 32c palette in the next 32 registers. and so on... prob... after 8 frames you'll reach the end of the palette and go back to the first 32 colors... (if You use 1 col/frame you'll reach the end of the palette after 256 frames ofcoz) so you'll see the old crap you left 8 frames earlier...
Hehehe,
DID YOU REALLY THINK THAT WAS ALL ???
You can actually clear the screen with that technique!
Let's say You want to draw points in 1 color (works for 1 -> xx colors !!!)
You take 200 colors for the trick which makes 56 colors left; use them
for a logo for example, a size of 64*200 (any size, but just to show you
it works) and put it somewhere on the screen (let's say on the left).
Now You've got 200 colors left, so you can display points during 200 frames
before having garbage on the screen, RIGHT?
WRONG!!! You can display points forever without garbage!
frame 0 :col 0 on, clear line 0 frame 1 :col 1 on, col 0 black, clear line 1 ... frame 199 :col 199 on, col 198 black, clear line 199
Uncle Ben's wisdoms of the month
Remember, a NEG + DEC pair equals a NOT. And the DEC/INC doesn't change the carry flag. You may need it some day...
I've faced (again...) an interesting problem during the development of some 32-bit interface. Let's say we don't need the 0..7 interrupts, so there are no interrupt gates in the memory where the IDT starts. The first valid int gate is No. 8 , which is at IDT's base+40h. The problem is that when an interrupt occurs in 32-bit protected mode, and the interrupt handler's code is in the beginning of the IDT (in those unused 40h bytes), the processor shuts down, but when the int handler's code is somewhere else, everything is okay. Do you have any ideas...?
Calculating the absolute value of AX
This wonderful 'gem' was developed by Laertis / Nemesis.
cwd xor ax,dx sub ax,dx
Pixel drawing in protected mode
Here comes a 'routine' which sets a pixel to the given value in 256-color mode: (parameters: EAX=X coordinate, EBX=Y coordinate, CL=color)
add eax,table[ebx*4] mov [eax],cl
mov edx,table[ebx*4] mov [eax+edx],cl
Sometimes we have to call one subroutine many times like this:
mov cx,4 call waitraster loop $-3
call waitraster4 ... waitraster4: call waitraster2 waitraster2: call waitraster waitraster: mov dx,03dah ... ret
;Load AdLib instrument. Inputs: ;ds:si: register values (5 words; ; lower byte: data for operator ; 1, higher byte: data for ; operator 2) ;al: adlib port (0,1,2,8,9,a,10h,11h,12h) loadinstr: mov dx,388h add al,0e0h call double_load sub al,0c6h call double_double add al,1ah double_double: call double_load add al,1ah double_load: call final_load final_load: mov ah,[si] inc si out dx,al call adlib_address_delay xchg al,ah out dx,al call adlib_data_delay mov al,ah add al,3 ret
First a few words about vertical hardware scrolling. The 'standard' scroll requires at least two pages. In the beginning the first page is visible, and it's black. Then the screen goes up one row - the first row of the second page appears at the bottom. Now this row is copied to the 1st row of the 1st page (which row is now invisible). This process continues until the 2nd page is entirely visible. At this point the two pages are identical. Now the 1st page is displayed again and the whole process starts from the beginning. The problem with it is the memory requirememnt, which is too big. With this method it's impossible to make a 640*480 scroll since one page occupies more than 128k video memory. But why do we need two pages? Because the video memory is not 'circular'. I mean if we'd scroll the screen up by one pixel, the 1st row of the video memory which was on the top of the screen now would be at the bottom. With this kind of video memory we could do a smooth vertical scroll with a single page: in the beginning, the screen is black. Now wait for a vertical retrace, then change the 1st row, and shift the screen up by one row that the previously modified row appear in the bottom. Perfect eh? The question is how can we make 'circular' memory... It's a well-known fact that there's a certain problem with the hardware scroll on TSENG cards: every second page contains some 'noise' instead of the scroll we're expected to see... The cause of this is the 'memory display start' register (3d4/0c,0d) which works a bit different than other cards. At other cards always only the first 256k of the video memory will be displayed on the screen, even if the memory display start register is set close to the end of the 256k. These cards handle this 256k memory as a circular buffer, but the TSENG boards not:
Normal VGA | TSENG VGA | |
---|---|---|
Memory Display Start Register = XXXX | Video memory: XXXX-3ffff | Video memory: XXXX-3ffff |
00000- Display memory wraps to zero! | 40000- Dispalying the video memory continues! |
Normal VGA | TSENG VGA | |
---|---|---|
Memory Display Start Register = XXXX | Video memory: XXXX-3ffff | Video memory: XXXX-3ffff |
Line Compare Register | 00000- Display memory wraps to zero! | 00000- Display memory wraps to zero! |
Gouraud shading - 2 instructions/pixel
The main goal of this example is not really to show a G-shading with two instructions ;-) It's rather an example for 'how to pray down the upper words of 32-bit registers without shifting. There's often a need for calculating with fixed-point numbers: a doubleword's upper word is the whole part, the lower is the fractional part. The problem is that the upper words of the 32-bit registers are hard to reach. For example, at ADD EAX,EBX how to get EAX's upper word? No (quick) way :-( The idea beyond the trick is changing the upper & lower words, and using ADC instead of ADD:
; EAX & EBX are fixed-point numbers ror eax,16 ror ebx,16 cycle: ... adc eax,ebx stosw ... loop cycle
;In: eax: end color ; ebx: start color ; ecx: line length ; es:edi: destination ;!!! 32-bit PM version !!! gou_line: sub eax,ebx ;Fill edx with the carry flag sbb edx,edx idiv ecx ;Pull down the upper parts of dwords rol eax,8 rol ebx,8 xchg ebx,eax ;Calculate the address of the entry ;point in the linearized code neg ecx lea ecx,[ecx*2+ecx+320*3+offset gou_linearized] jmp ecx gou_linearized: rept 320 stosb adc eax,ebx endm ret
neg cx shl cx,2 add cx,320*4+offset g.lin. jmp cx
Demand paging on SVGA boards in PM
(This is going to be deep protected mode system coding, be prepared...) It would be very nice to 'map' the video memory to the linear address space so we could reach it as a one megabyte long array. Some cards support it, the rest not: at these cards only the 'bank switching' routines allow to access the entire video memory. Our goal is to reduce the number of bank switches as possible. Several techniques has been developed, but many of them has a big problem: the routine which determines whether a bank-switch is necessary must run very much times. The next method solves this problem. It maps an 1MB long memory area to the video memory on any SVGA card, and bank-switch will occur only if necessary. It works in protected and flat virtual mode only, NOT in (flat) real mode. Essentially it's a kind of 'virtual memory' technique based on PAGING. Let's set up the 4k-page table reserving one megabyte above the highest physical memory address (let's say from 800000h to 8ffffffh) and map it to a0000h by 64k steps:
Physical address | Logical address |
---|---|
000000-000fff | 000000-000fff |
001000-001fff | 001000-001fff |
002000-002fff | 002000-002fff |
... | ... |
7ff000-7fffff | 7ff000-7fffff |
Physical address | Logical address |
---|---|
800000-800fff | a0000-a0fff |
801000-801fff | a0100-a1fff |
... | ... |
80f000-80ffff | a0f00-affff |
Physical address | Logical address |
---|---|
810000-810fff | a0000-a0fff |
811000-811fff | a0100-a1fff |
... | ... |
81f000-81ffff | af00-affff |
... | ... |
Physical address | Logical address |
---|---|
8f0000-8f0fff | a0000-a0fff |
8f1000-8f1fff | a0100-a1fff |
... | ... |
8ff000-8fffff | af00-affff |
Great. From 8 to 9 megabytes we can address the a0000 - affff segment sixteen times. Now comes the TRICK. Mark all 4k pages between 810000- 8fffff as 'NOT PRESENT' and pages in 800000-80ffff as 'PRESENT', and hook interrupt 0e ('page fault' exception). If a page fault occurs, it means that a bank switch needed - mark the accurate pages as 'PRESENT' and old ones as 'NOT PRESENT', do the bank-switch, and return from the exception. The fault handler looks like this:
push eax edx mov eax,cr3 ; Get page fault address: sub eax,800000h ; Substract starting address shr eax,16 ; Put bank's number to AL ; SVGA bankswitch mov dx,svga_switch_port out dx,al ; Mark pages present/absent (not too difficult to do :-) ; Bye pop edx eax ; Return from the fault ...
mov eax,cr3 cmp ax,0fffch ja possible_bank_override normal_fault: ... ; Return from the fault ... ; Check the instruction which caused ; the page fault (STOSD's code is A5) possible_bank_override: mov edx,[esp] cmp byte ptr[edx],0a5h jne normal_fault ; Now emulate a STOSD ... ; Return from the fault ...
Filling a register with the Carry flag
Sometimes we need to fill a register with the carry flag: put 0 to the desired reg if Carry=0 or put ffffffff if Carry=1. Probably the fastest way to do it (for example, with EDX):
SBB EDX,EDX
Yeah, BSWAP with a word register operand is a valid instruction. At least in the genuine Intel processors. Some specifications recommend against using it for better compatibility with Intel-clones. Anyway, if it works, it brings down the upper word of the doubleword register without affecting its upper 16 bit.For example, if EAX=ab120000 then
BSWAP AX
At instructions where the operand is small (like 'ADD EBX,12'), the machine code is shorter since the operand is stored in 8 bits and will be sign- extended to 32-bit when the instruction actually runs. This means that all operand values in the [-128..127] range (hexadecimally [ffffff80..0000007f]) save 3 bytes per instruction. There's only one trick concerning it. When using 'ADD EBX,80h', the operand takes 4 bytes, since 00000080h falls out of the range. But the
SUB EBX,0FFFFFF80h
Let's say we want to divide a number by a power of two (2, 4, 8, 16, etc.) then rounding it upwards. In this case
SAR EAX,4 ADC EAX,0
In most cases, the 486 is free from flow-dependence penalties which mean that an instruction which uses the result of the previous instruction will not cause slowdown:
ADD EAX,EBX ADD ECX,EAX
ADD EAX,EBX ADD ECX,EBX
ADD ECX,EBP ADC BL,DL MOV AL,[EBX]
ADD ECX,EBP ADC BL,DL SUB ESI,EBP MOV AL,[EBX]
Aligning the first instruction of an 'inner' loop to 16-byte boundary may increase performance: the 486's and the Pentium Pro's instruction prefetcher (or what) just loves aligned addresses. But the straight-forward solution:
JMP _INNERLOOP ALIGN 16 _INNERLOOP:
CODEALIGN MACRO LOCAL _NOW,_THEN _NOW: ALIGN 16 _THEN: IF (_THEN-_NOW) ; Already aligned? IF (_THEN-_NOW) LE 3 ; 0,1,2,3 remained? DB (_THEN-_NOW) DUP (90h) ; put NOPs ELSE ORG _NOW ; Set position back JMP _THEN ; Jump to the boundary ALIGN 16 ; Apply aligning ENDIF ENDIF ENDM
(...Inner loop preparation code...) codealign _INNERLOOP: (...Inner loop...)
After allocating a few doublewords it's not sure that the array's starting address is on dword boundary, we have to adjust it:
ADD EBX,3 AND EBX,0fffffffch
Kicking out conditional jumps, part ][
When optimizing for the Pentium (or for Pentium Pro with respect for compatibility) it might save some cycles to change the conditional jumps to non-branching code. For example, the next code sequence implements this IF constuction:
IF (eax = ebx) THEN ecx=x; ELSE ecx=y; In asm: SUB EAX,EBX CMP EAX,1 SBB ECX,ECX AND ECX,x-y ADD ECX,y
MOV ECX,x CMP EAX,EBX CMOVNE ECX,y
If I had knuwn dat kodin' o' de Intel family is sooo diffiqlt, I'd have stayed at mine ZX Spektrum for ever!