As my French speaking readers may know, I recently awoke my Apple IIe from its long hibernation and proceeded to some minor repairs to render it usable again.

There are many many interesting games running on Apple II, but I’m a coder so I want to run some of my own code on it! 😉

CC65, a compiler suite targeting the 6502

Traditionally, these 8bit personal computers were programmed in BASIC. Apple IIs were provided with AppleSoft, a quite powerful (for that time) BASIC from Microsoft. But BASICs are interpreted languages, thus are quite slow. So if you wanted to do something serious, you had no choice but programming in assembly.

The Apple II is powered by a very simple MOS 6502 CPU. But although I had my time programming in ASM (mainly for Motorola’s 68000 and 56000), I wanted a way to avoid plunging too deeply into the arcane of the 6502 architecture. So I was quite pleased to find that there exists an open-source cross compiler suite targeting the 6502! It even provides a limited support of the standard library on the Apple II and an easy way to read inputs and draw ASCII characters on the screen!

Armed with this powerful suite, I decided to quickly implement the Game of Life. This “Game” consists in placing so called “cells” on a board then watching them evolve or die, as they follow some basic rules.<

Animated Game of Life Some cells evolving following the Game of Life’s rules.

Compilation, was flawless. Putting the binary on a disk was not much of a hurdle either. But when I watch the cells evolving… It is SLOOOOOOOOOOOW ! 😮

Investigating the code

I know that the 6502 in my Apple is a 8 bit CPU running at a paltry 1 MHz. But the most demanding part of my code is the following function, called 798 times per screen (or 38 times per line).

void count_neighbours( uint8_t* cell, uint8_t* count ) {
    *count = *cell++; *count += *cell++; *count += *cell;
    cell += JUMP_BEGINNING_NEXT_LINE;
    *count += *cell; cell+=2 ; *count += *cell;
    cell += JUMP_BEGINNING_NEXT_LINE;
    *count += *cell++; *count += *cell++; *count += *cell;
}

So we are talking about a grand total of around 10000 a 8 bit additions and 20000 8 bit memory access. That’s not negligible, but that should not take so long. Drawing a cell on the screen is only a matter of ms. So I know that my function is the main culprit.

I decided to have a look at the generated assembly file. It’s quite easy as C65 compiles the C into assembly, before assembling it to the binary object.

Do not hesitate to quickly scroll down: the listing is quite long 😑

---------------------------------------------------------------
; void __near__ count_neighbours (__near__ unsigned char *, __near__ unsigned char *)
; ---------------------------------------------------------------

.segment	"CODE"

.proc	_count_neighbours: near

.segment	"CODE"

;
; {
;
	jsr     pushax
;
; *count = *cell++; *count += *cell++; *count += *cell;
;
	jsr     pushw0sp
	ldy     #$05
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	sta     regsave
	stx     regsave+1
	clc
	adc     #$01
	bcc     L0158
	inx
L0158:	jsr     staxysp
	ldy     #$00
	lda     (regsave),y
	jsr     staspidx
	ldy     #$01
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	jsr     pushax
	sta     ptr1
	stx     ptr1+1
	ldx     #$00
	lda     (ptr1,x)
	jsr     pusha0
	ldy     #$07
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	sta     regsave
	stx     regsave+1
	clc
	adc     #$01
	bcc     L015B
	inx
L015B:	jsr     staxysp
	ldx     #$00
	lda     (regsave,x)
	jsr     tosadda0
	ldy     #$00
	jsr     staspidx
	ldy     #$01
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	jsr     pushax
	sta     ptr1
	stx     ptr1+1
	ldx     #$00
	lda     (ptr1,x)
	jsr     pusha0
	ldy     #$07
	lda     (sp),y
	sta     ptr1+1
	dey
	lda     (sp),y
	sta     ptr1
	lda     (ptr1,x)
	jsr     tosadda0
	ldy     #$00
	jsr     staspidx
;
; cell += JUMP_BEGINNING_NEXT_LINE;
;
	lda     _JUMP_BEGINNING_NEXT_LINE
	ldx     #$00
	ldy     #$02
	jsr     addeqysp
;
; *count += *cell; cell+=2 ; *count += *cell;
;
	ldy     #$01
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	jsr     pushax
	sta     ptr1
	stx     ptr1+1
	ldx     #$00
	lda     (ptr1,x)
	jsr     pusha0
	ldy     #$07
	lda     (sp),y
	sta     ptr1+1
	dey
	lda     (sp),y
	sta     ptr1
	lda     (ptr1,x)
	jsr     tosadda0
	ldy     #$00
	jsr     staspidx
	ldy     #$02
	ldx     #$00
	tya
	jsr     addeqysp
	ldy     #$01
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	jsr     pushax
	sta     ptr1
	stx     ptr1+1
	ldx     #$00
	lda     (ptr1,x)
	jsr     pusha0
	ldy     #$07
	lda     (sp),y
	sta     ptr1+1
	dey
	lda     (sp),y
	sta     ptr1
	lda     (ptr1,x)
	jsr     tosadda0
	ldy     #$00
	jsr     staspidx
;
; cell += JUMP_BEGINNING_NEXT_LINE;
;
	lda     _JUMP_BEGINNING_NEXT_LINE
	ldx     #$00
	ldy     #$02
	jsr     addeqysp
;
; *count += *cell++; *count += *cell++; *count += *cell;
;
	ldy     #$01
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	jsr     pushax
	sta     ptr1
	stx     ptr1+1
	ldx     #$00
	lda     (ptr1,x)
	jsr     pusha0
	ldy     #$07
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	sta     regsave
	stx     regsave+1
	clc
	adc     #$01
	bcc     L016A
	inx
L016A:	jsr     staxysp
	ldx     #$00
	lda     (regsave,x)
	jsr     tosadda0
	ldy     #$00
	jsr     staspidx
	ldy     #$01
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	jsr     pushax
	sta     ptr1
	stx     ptr1+1
	ldx     #$00
	lda     (ptr1,x)
	jsr     pusha0
	ldy     #$07
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	sta     regsave
	stx     regsave+1
	clc
	adc     #$01
	bcc     L016D
	inx
L016D:	jsr     staxysp
	ldx     #$00
	lda     (regsave,x)
	jsr     tosadda0
	ldy     #$00
	jsr     staspidx
	ldy     #$01
	lda     (sp),y
	tax
	dey
	lda     (sp),y
	jsr     pushax
	sta     ptr1
	stx     ptr1+1
	ldx     #$00
	lda     (ptr1,x)
	jsr     pusha0
	ldy     #$07
	lda     (sp),y
	sta     ptr1+1
	dey
	lda     (sp),y
	sta     ptr1
	lda     (ptr1,x)
	jsr     tosadda0
	ldy     #$00
	jsr     staspidx
;
; }
;
	jmp     incsp4

.endproc

WTF ??!!??

As I said I’m not an expert in the 6502’s ISA, but more than 220 instructions, including many jumps to subroutines ??? Basic operations such as additions, and stack operations are performed by subroutines (addqysp and pusha0)????
Clearly there is something wrong. No wonder that my Game of Life runs so slowly !

I read the coding hints of the CC65 documentation, but it appears that I did nothing too wrong. Plus, the CC65 compilation suite is praised, and is considered more effective than some C compiler of the 80s.

There must be something else.

And indeed, the culprit is the 6502 itself: its architecture is unsuited for efficient high level programming !

The MOS 6502

The 6502 is indeed a very unusual beast. If we compare it to the Z80, a very common 8 bit CPU of that time,  the 6502 is even more a true 8 bit architecture: it does not provide any 16 bit register nor any support for 16 bit operations. And as a matter of fact, it only comes with a single “true” register! The Z80 provided eight 8 bit registers that could be combined to form four 16 bit registers! Ouch!

If we look at the 6502 block diagram below, we can see that its 8 bit register is called the Accumulator and can be source or destination to ALU and LOAD/STORE operations. There are also two more “simili-registers”, X and Y, which are called Index Registers. Indeed, operating with them is much more limited. They are mainly used to store and produce indexes that will serve in some indirect addressing modes.

6502 block diagram The 6502 block diagram with the areas of interest highlighted

Some more limitations:

  • The 6502 accesses data stored in 256 bytes pages. If you want to access any address higher than the “zero page” (0x0 to 0xFF) you’ll get a penalty as it requires to compute a 16 bit address !
  • The stack’s address is fixed to the “first page” (address 0x100 to 0x1FF) and thus cannot contain more than 256 bytes.
  • There is no multiply nor divide operation. As a matter of fact, no “complex” operation is supported as there is no microcode!
  • Only the Accumulator can be pushed or pulled (poped) from the stack.

I could continue this list.
I’m sure you begin to understand that, this is far from being C friendly.

In order to address these shortcomings, the compiler programmers had no choice other than to rely on inefficient solutions. For instance, as the 256 byte stack is too small, CC65 stores its own “unofficial” stack in the highest addresses and grows downwards. In order to use it, it has to rely on custom “push” and “pop” subroutines. It is because of the accumulation of such tricks that the generated code is inefficient.

But anyway kudos to the C65 developers! It was not a small task to allow easy C programming to such a target !

Any strong point?

If the 6502 was so primitive, how comes that it had such a tremendous success in its time? You can find it in the Apple II, but also in the NES, the C64, the BBC micro and many others!

Well, I only stated that is was unsuited for high level compiled languages but, during the 80s, home computers were not programmed like that!

When properly programmed in assembly, the 6502 can truly sing!

First, its small instruction set (only 56!) can be seen as a strength. Decoding them is fast and cheap. Furthermore, the execution time is small compared to the other architectures of the time : almost always one cycle (excluding memory access). Some kind of primitive pipelining is also possible when combining  some addressing mode with some operations: the next instruction can be fetched before the completion of the current one!

And the chip itself was cheap. Cheap to produce, thus cheap to buy. It is composed of only 3510 transistors!!! As a matter of fact, the 6502 is considered by many as the precursor of the RISC architectures. And it is well known that it has inspired the designers of one of the most well known RISC CPU: ARM!

Finally, the 6502 accesses its memory faster than its contemporaries: in one cycle! Therefore, a programmer can use the zero page as a pool of 256 8 bit registers! The 6502’s designers were not crazy: their CPU lacks registers because it does not need any!

With these strengths the 6502 could compete with the other 8 bit CPUs, often clocked 2 to 4 times faster. It is a clean and elegant design, that is unfortunately ill-equipped for modern programming.

So, after all I’ll have to plunge into what I wanted to avoid: I’ll have to write my core routines in assembly language! Stay tuned! 😉

Addendum, 4 years later

Quite recently, a few readers stated that the code given in example was not properly written to suit the 6502 limitations and that I could achieve better performance. One of them pointed me to a recent article: “CC65-Advanced-Optimizations“. For the most part, they are right. If you are motivated to code in C for the 6502, I urge you to read it. It will help you to avoid pitfalls.

But I will keep on my stand that the 6502 is ill-suited for C.

First a few thoughts about this article. In this case, a very simple program that is comparable in complexity to “Game of Life”. The optimized program executes 18 times faster that the “naïve” one. That is quite an achievement.
But some of the most helpful tricks have nothing to do with C on the 6502:

  • Compiler Options (34% speedup from the previous state)
    Yeah, OK. Who does expect optimized code without using the optimization switches?? I compiled in Ois.
  • Replace calculations, switches, and screen access by using Lookup Tables (342% speedup)
    This is a classical optimization trick, that has nothing to do with the language or the targeted CPU.
  • “Identify code critical places and rewrite them in assembly”
    That is common sense, and that is exactly what I did.

Other tricks are indeed here to help the compiler to produce tighter code, more suited to the 6502 architecture. Such as

  • Use 8-bit indexed to access array element instead of pointer
    Indeed pointers are 16-bit values. And the 6502, being 8-bit, sucks at manipulating 16-bit values.
    I used pointers in my code, so I would gain here.
  • Avoid using the stack
    That is fair. As I said, the hardware stack of the 6502 is really limited so CC65 has to emulate one by software and that is slow. Nevertheless, the stack is used to save the context before each function call, and it also means that you have to avoid passing parameters to functions.
  • Decompose each calculus
    Indeed in C intermediary steps must be computed internally on at least 16-bits.

If I were to follow carefully each of these latter advices, my “full C” Game of Life would maybe execute 5 times faster. That a lot. But it would still be very slow. If you read my follow-up article, where I optimized “Game of Life” and rewrote some function in assembly, I gained a 20 factor. And believe me this poor assembly: I was just beginning to learn.

And anyway, you can easily understand that if you follow these tricks, it is precisely because the 6502 is unsuited for efficient high level programming.

And what is not said is that the runtime will consume a lot of memory and most of the utterly precious zero page, leaving you only with 8 usable zero page locations!

Others share the same point of view:

Again, I have nothing against CC65 and its maintainers. On the contrary, they did a wonderful job. If it democratizes homebrew development for 6502-based machines, with good enough performance, we all win at the end! 🙂