Solving a Simple Logical Problem in AVR Assembler, and scaling up to 16 bits (PART I)

If you go on over to Coderbyte.com there is a free programming challenge called Pentagonal Number that asks you to calculate the number of dots in a pentagon based on some pattern of growth.

The idea is to write a program that takes some number as input and returns the amount of dots that would appear in that number’s iteration of whatever pattern is going on here. It’s not hard to figure out that the pattern is the iteration number, multiplied by 5, added to the number of dots in the previous iteration.

int dots = 1;
for(int  iteration=0; iteration<input; iteration++ ){
  dots = dots + iteration*5;
}

I think this pretty much sums up the solution in a C-like language. Very easy. But it is important to check that it works. The input should always be greater than 0, since I’m not sure what a zero’th pentagon would be. Thus, an input of 1 would return a “pentagon” of 1 dot. An input of 2 would return a pentagon of size 6. And input of 3 would return a pentagon of size 16; etc.

However, there is one thing that is annoying about this loop. If the input is 1, the solution should also be 1. Thus, if dots initializes at 1, then the loop doesn’t need to run at all. But it does, it runs at least once, and thus does a useless calculation. It multiplies 5 by 0 and then adds it to 1, resulting in 1. This is fixed by initializing iteration to 1.

int dots = 1;
for(int iteration=1; iteration<input; iteration++ ){
  dots = dots + iteration*5;
}

But how could we write this in Assembly? First, I’d want to think about this in terms of how many operations are going on, how many variables I am using, and the nature of the loop that I want to use. First, let me try to count out all of the fundamental instructions that are going on here.

1. INIT dots
2. INIT iteration
3. INIT or LOAD input
4. MULTIPLY iteration by 5
5. ADD the results to dots
6. INCREMENT iteration
7. COMPARE iteration with input
8. BRANCH if lower

It looks like I will need up to 3 instructions to initialize or load my variables (w/ each one likely requiring its own register)… plus 3 arithmetic instructions for the calculation… plus 2 conditional instructions to make the loop. Generally in AVR Assembly, condition structures will be made out of two instructions. First, a Compare instruction that sets the Status Register based on the results of subtracting two values. And then a Conditional Branch which either jumps or doesn’t to a location based on information inferred about the results of the Compare operation by reading the Status Register. I guess just to jump right into it, it would look something like this in basic AVR Assembly…

.def five      = r16 .def dots  = r17
.def iteration = r18 .def limit = r19

//imagine input is some address in memory or i/o where we might //receive an arbitrary input the the pentagon problem

LDI five,5                 //load 5 immediately into reg five
LDI dots,1		   //load 1 immediately into reg dots
LDI iteration,1	           //load 1 immediately into reg iteration
LDS limit,input            //load input from dspace into reg limit
RJMP condition             //jump to condition
for_loop:
    MUL iteration,five
    ADD dots,product
    inc iteration
condition:
    CP iteration,limit     //compare iteration w/ limit and branch to
    BRLO for_loop          //for_loop if iteration lower than limit

You’ll be happy to learn that most AVR devices have a MULtiply instruction. Unfortunately, it doesn’t take Immediate Values, it only multiplies registers. Thus, I need to have an extra initialization step where I immediately load a 5 into a register that I named “five”. Because I can infact name registers in AVR Assembly IDE’s, which is amazing because it actually makes Assembly tolerable. This is done with the .def and .undef directives, which can be placed anywhere.

AVR Assembler is incredibly flexible when it comes to formatting. Most of the time you can place multiple directives and instructions on the same line, and it knows exactly how to interpret it. I like to define two registers per line because it takes up less lines and is prettier. Part of keeping your sanity with Assembler is finding aesthetic.

.def five      = r16 .def dots  = r17
.def iteration = r18 .def limit = r19

LDS is a neat little instruction that makes life easy. It stands for LoaD from Space; or more specifically LoaD Direct from Data Space. It will accept a 16bit address and load a byte from that address into a register. This completely bypasses the need for using a Pointer to load from Data Space. Naturally, the cost is a double sized instruction and a clock or two more to execute than the more common Indirectly Loading( LD r16, X ). Using a pointer to load from Data Space is usually called Indirect Loading. It is important to not confuse that with Immediately Loading, because of the common use of “I”. I did NOT load “five”, “dots”, and “iteration” Indirectly, I loaded them IMMEDIATELY.

Please check my templating post if you want more information, but… essentially, because I am only simulating this problem and not running it on an actual device, I am not taking real world input, say from an I/O pin or a peripheral module, so I have to make up my own input and program it in. This could be done easily by Immediately Loading(LDI) whatever number I want to test into a register. However, that would be hardcoding the problem to calculate based ONLY on the number I programmed; making it a one shot program, I would have to re-assemble each time I want to calculate a new number. By writing the program to operate based on a value we read from memory, we can imagine that on a running device that memory value could change, thus the program is an actual function, not just a math problem.

So anyway… in this case the “input” is actually a label for a 16bit Data Space(RAM) address, where I might expect to load the input for this Pentagon Problem. Because if you can still remember, that’s what we are trying to do. That’s enough about initialization for now.

RJMP condition
for_loop:
    MUL iteration,five
    ADD dots,product
    inc iteration
condition:
    CP iteration,limit
    BRLO for_loop

This Loop Structure is very important. It is the Assembly equivalent of a While Loop. Technically, the condition could be placed at the top, like in other languages, but I have come to much prefer this style, it usually requires one less label. No matter how it is structured, you can’t avoid having to have a jump instruction; it is required to make sure the body and condition execute in the correct order. Please see my post on Conditional Loops for more information. RJMP is usually one clock cheaper than JMP, so it is usually preferred; read the instruction manual for why. CP compares two registers by virtually subtracting “limit” from “iteration” to measure which one is larger or smaller than the other. BRLO stands for BRanch if Lower, which turns out to be distinct from BRanch if Less Than(BRLT), because BRLO is intended for unsigned comparisons, and BRLT is intended for signed comparisons. I am not using signed numbers.

You will notice, that in addition to “input”, there is one other label/definition here that was not initialized, “product”. The MUL instruction multiplies 2 unsigned numbers, which must be in registers(which was the reason I need an extra “five” register), and is hardcoded to place the result in Register0 and Register1. The MUL instruction anticipates that the result could be as large as 16bits; however, right now I only care about making my program work for 8bits, and then I will scale up. Long story short, I defined “product” to refer to Register0

.def product = r0

MUL iteration,five
ADD dots,product

I then ADD the “product” of the MULtiplication to the register “dots”, which I am using to accumulate the solution on each iteration, because the solution of each iteration depends on the solution of the last one.

That’s it! Granted you know how to store a test value in memory using .DB and .BYTE, you should be able to run this, or at least recreate it. If you need to know how to use .DB and .BYTE please check my templating post. Eventually, however given the 8 bit nature, this program will break for input above a given input because one of my registers will inevitably overflow; probably “product” or “dots”. In PART II, I will try to maximize the amount of input this program can tolerate by identifying which registers are overflowing and expand them to handle 16bit numbers….

Death of The For Loop: Conditional Loops in AVR Assembly (100% Pseudocode)

I think there are generally 3 fundamental Conditional Loop structures that you will encounter in practically all programming languages, from C, up. The While loop, the For loop, and the Do While loop. However, In AVR Assembly code at least, I feel that fundamentally, there are only 2 types of loops. I think there are simply: While Loops, and Do While Loops. For loops are simply While or Do While loops with a counter included. And once you include a counter, the structure can become arbitrarily complex based on where you place the counter and how you choose to increment it or compare it to a condition.

The While Loop will first evaluate the condition to determine whether or not code in the body should be executed, or simply skipped over. If the condition is false, the body is skipped. If the condition is true, the code in the body is executed, then the condition will be checked again, and if it is found true again, will continue to execute the code in the body ad infinitum until something causes the condition to not be true.

while(condition){
    //bodycode...
}

This boils down to a compare operation, a branching operation, and then some arbitrary body of code.

1. COMPARE
2. BRANCH to step 5
3. BODYCODE...
4. JUMP to step 1
5. ...

It turns out I also need a JUMP instruction to keep the body looping. I can even re-arrange the order of these instructions, which I often do when I am writing loops in assembly. I tend to prefer placing the jump at the beginning….

1. JUMP to step 3
2. BODYCODE...
3. COMPARE
4. BRANCH to step 2

…and it works out the same. You just need different addresses. Part of the reason I like this ordering is because if I simply remove that pesky extra JUMP instruction, then the While Loop structure reveals a Do While structure!

1. BODYCODE...
2. COMPARE
3. BRANCH

Which usually looks like this…

do{
  //bodycode...
}
while(condition)

It is efficient to use Do While structures when you have complete certainty that you want to perform the operation of the body at least once; which turns out to be a surprisingly common scenario to someone like me who previously avoided Do While loops when coding in any language.

But what happens if I remove the JUMP instruction from that first ordering? the one where I placed the COMPARE instruction on top and the JUMP on the bottom.

1. COMPARE
2. BRANCH
3. BODYCODE...

This actually reveals the If Conditional structure! Which usually looks like this…

if(condition){
  //bodycode...
}

Thus, a While Loop structure could effectively be thought of as an If Condition structure with a JUMP at the end. But also, as a Do While Loop structure with a JUMP at the beginning. The only structure we haven’t discovered yet is the For Loop structure…

for(int counter=0; counter<limit; counter++){
  //bodycode...
}

In Assembly, it seems like you can set up the for loop in a couple different ways. It is essentially just a While Loop with a counter.

1. CLEAR counter
2. COMPARE counter w/ limit
3. BRANCH to step 7 if conditional is false
4. BODYCODE....
5. INCREMENT counter
6. JUMP to step 2
7. ...

That being said, there is no reason you can’t write a Do For Loop structure.

1. CLEAR counter
2. BODYCODE....
3. INCREMENT counter
4. COMPARE counter w/ limit
5. BRANCH to step 2 if conditional is true

Or with the other ordering of the While Structure(the one with the JUMP at the beginning) with an added counter…

1. CLEAR counter
2. JUMP to step 5
3. BODYCODE...
4. INCREMENT counter
5. COMPARE counter w/ limit
6. BRANCH to step 3 if conditional is true

I find this version interesting because by simply changing step 2 to JUMP to step 4, instead of 5, we can include the counter as part of the conditional block instead of the body. At this point, using labels will be helpful…

init:
  1. CLEAR counter
  2. JUMP condition
body:
  3. BODYCODE...
condition:
  4. INCREMENT counter
  5. COMPARE counter w/ limit
  6. BRANCH body

This completely changes the way the counter works, because now it is counting the times the condition was checked, not the times that the body was executed. This might be useful for some particular problems. You can also have double conditionals by simply tacking more COMPARE’s and BRANCH’es to the end of this. Of course you can add as many counters as you have free registers. And of course you can modify them however you like; you need not be restricted to merely INCREMENT’ing.

In fact, we can reduce this even further by getting rid of the limit. Even though the limit will usually be a constant value, and thus won’t take up a register, it can still be redundant. If we set the counter to the limit, and choose to DECREMENT the counter instead of INCREMENT it, then we can use zero as our limit and simply BRANCH according to the counter’s position with respect to zero.

init:
  LOAD counter,limit
  JMP condition
body:
  BODYCODE...
  DECREMENT counter
condition:
  BRANCH to body

In fact, if we do this, we don’t even need the compare instruction in AVR. I think most Assembly languages will usually have a branching instruction that checks if the result of the last arithmetic operation was zero, which DECREMENT’ing the counter will eventually do.

Quick note: in Assembly, strings almost always have a zero appended to them, so that they can be “null terminated”. Because strings generally have a 0 appended to the end of them, when reading or copying them, null termination allows us to eliminate the counter and write a Do While loop that checks whether or not a byte read from memory is zero…

init:
  LOAD pointer
body:
  LOAD register,[pointer]
  INCREMENT pointer
condition:
  COMPARE register w/ 0
  BRANCH to body if not equal to 0

Reading or writing to Memory typically requires a pointer. A pointer is simply a register that contains an address to a location in memory. In the case of loading bytes from a string, I want the address of the first byte in the string. As I INCREMENT the pointer, I can load each consecutive byte of the string into a register, erasing the old one each time a new one is loaded. This is useful because before loading the new value, I can check the condition of the old one. But this isn’t very useful all by it self. More commonly, I am copying a string from one location in memory to another; and sometimes even back to the same location. This will require 2 pointers, though…

init:
  LOAD pointera
  LOAD pointerb
body:
  LOAD      register,[pointera]
  STORE    [pointerb],register
  INCREMENT pointera
  INCREMENT pointerb
condition:
  COMPARE   register,0
  BRANCH to body IF NOT EQUAL to 0

This loop is structured as a Do While, thus I am expecting the string to be AT LEAST one byte long! Because strings generally always have a zero appended to them for null termination. An empty string, often called a “null string”, is a string that is one byte long and only contains the value zero. Thus, so long as I am certain that I have fed this program the address of a valid string, it is okay to load the first byte right away before knowing what it is. The Do While Loop structure is perfect!

How about if I bring back that counter. Only, now that I don’t need it for counting up or down to a fixed value, I can use it to count the length of a string.

init:
  CLEAR counter
  LOAD pointer
  JUMP condition
body:
  INCREMENT counter
  INCREMENT pointer
condition:
  LOAD register,[pointer]
  COMPARE register w/ 0
  BRANCH to body IF NOT EQUAL body,0

Okay, so first notice I inserted a JUMP before the body, making it a While Loop structure(I’ve long abandoned thinking about the term For Loop). This is important, because when counting the Length of the string, I usually DON’T want to count the null terminator. Thus, it is important to check the condition of the first byte BEFORE deciding whether or not to increment the counter. Another interesting thing I have done is move the Load instruction into the condition segment, whereas in my prior examples, I would load the byte during the body segment. Theoretically, I could move it back to the body segment, but then I would have to add an extra instruction to the init segment pre-loading the first byte; otherwise the first conditional check will have nothing to compare. It would also make sense to increment the counter and pointer BEFORE loading the byte, otherwise we we redundantly load the same first byte twice.

init:
  CLEAR counter
  LOAD  pointer
  LOAD  register,[pointer]
  JUMP condition
body:
  INCREMENT counter
  INCREMENT pointer
  LOAD register,[pointer]
condition:
  COMPARE register,0
  BRANCH to body IF NOT EQUAL to 0

This is fine, because if we have reached the body at this point, we should be assured that the first byte of the string was not 0, thus a valid character and worthy of counting. In the very least, the next byte could be the null terminator, which case the conditional segment will catch it before getting a chance to branch to the body and increment the counter again.

There you have it! A demonstration of the immense flexibility of Assembly code when it comes to Conditional Loop structures! I guess I should name this blog post: Death of The For Loop.

“General Purpose Working Registers” in AVR Assembler

AVR gives us 32-8bit “General Purpose Working Registers” for all our calculating needs. There are a couple nuances and peculiarities to these registers that are worthy of noting.

Right off the bat, we can see that Registers 26 through 31 are special. These registers will be used for holding 16 bit addresses, and function as pointers. m328pdef.inc defines them as such…

; ***** CPU REGISTER DEFINITIONS *****************************************
.def	XL	= r26
.def	XH	= r27
.def	YL	= r28
.def	YH	= r29
.def	ZL	= r30
.def	ZH	= r31

However, you can simply use X, Y, or Z to refer to their respective register pair…

//there are only 3 word based instructions
movw X,Y       ;move word value in XH:XL to YH:YL
adiw X,1       ;add immediate to word XH:XL
sbiw Y,1       ;sub immediate from word YH:YL

Most of the datasheets seem to use the ordering XH:XL to describe register pairs. Such as when describing the destination of the MUL instruction, “R1:R0”, in the instruction set manual…

…though, there is no reason it needs to be this way. The following instructions will all do the same thing, and will all be valid….

movw XH:XL,YH:YL             ;pairing order
movw XL:XH,YL:YH             ;doesn't matter
movw XL:XH,YH:YL
movw XL,YL                   ;you don't even need it
movw XL,YL:YH                ;it's totally optional
movw XL,YH:YL
movw XL:XH,YL

…got the point? No? The following instructions will NOT be valid…

movw XH:XL,YH
movw XL:XH,YH
movw XH,YH
movw XH,YL:YH
movw XH,YH:YL

…got it? Register pair notation is optional. If you simply supply the lower register as an argument to the instruction, it will assume the location of the upper register. This should be a subtle hint that we cannot customize our register pairs; and definitely CANNOT swap register pairs by doing this. And finally, we CANNOT supply the upper registers only as arguments. In fact, you can’t even use the movw instruction with a word that starts on an odd numbered register at all! The following instructions are not valid…

movw r1,r3         ;can't use odd word boundaries
movw r1,r4         ;can't use odd word boundary for destination
movw r4,r1         ;can't read from odd word boundary either

….thus, we seem to have a barely visible word boundary in our Register Space. That’s kind of a bummer, but the upside is that the movw instruction is the only one of the 3 word based instructions that will operate on ANY of the General Purpose Working Registers(so long as it starts on an even numbered register of course).

movw r0,r2        ;valid
movw r16,r0       ;valid
movw r24,r26      ;valid

The other 2 word based instructions are Add Immediate to Word(ADIW) and Subtract Immediate from Word(SBIW), and they work only with the X, Y, and Z register pairs. But, wait… there is another! This isn’t well documented, but Registers 24 and 25 can also be used as a register pair with the ADIW and SBIW instructions. This kind of makes it like the X, Y, and Z register pairs, so sometimes I refer to it as the W register. This name is fitting, because even though it cannot be used as a pointer like X, Y, and Z, it can be thought of as a Word Register. This little easter egg seems to only be noted twice, once in the ADIW description and once in the SBIW description, making it a very elusive feature.

That’s enough about Register Pairs. On to Registers 0 through 23…

You’ll notice that bold line dividing R0 through R15 from the rest of the registers. R0 through R15 are very particular in that they cannot have values Loaded Immediately(LDI) into them. They will not Subtract Immediates(SUBI,SBCI), they will not And w/ Immediates(ANDI), they will not Or w/ Immediates(ORI), and they will not Compare w/ Immediates(CPI). Short of those not working with those 6 instructions, R0-R15 can be treated almost like any other General Purpose Working Register.

But that’s not all! Registers R0-R15 are actually used for some built in functionality as well. This should be important to know, because I often see people assume that R0 is always zero, so they end up using it as a “Zero Register”. I worry this could be bad practice, because if you had recently used any of the 6 Multiply Instructions, then their result might be left over in R1:R0. The 6 Multiply Instructions: MUL, MULS, MULSU, FMUL, FMULS, FMULSU all place a result into the register pair R1:R0. And as far as I know, it is not cleared automatically. The Load Program Memory(LPM) instruction also, by default, loads the byte pointed to by Z into R0, unless a specific destination register is otherwise specified…

LPM Z           ;loads byte into r0
LPM tmp,Z       ;loads byte into tmp

The only other built in functionality I have found that demands use of R0 through R15 is the Data Encryption Standard(DES) instruction; which is not standard in all AVR devices. This is a cryptographic instruction that essentially takes 8 bytes of data and encrypts it using 8 other bytes as a key. DES takes no arguments when written, but expects the 8 bytes of data to encrypt/decrypt to be in R0,R1,R2,R3,R4,R5,R6,R7, and it expects the 8 bytes used for the key to be in R8,R9,R10,R11,R12,R13,R14,R15. Some AVR devices feature AES, but it is implemented as a peripheral module, and doesn’t need the General Purpose Working Registers.

Registers R16 through R23 are the registers that are most available for quick calculating. They don’t have any special functionality and are thus about as General Purpose as a Working Register can get.

16bit AVR Assembler Template w/ Explaination

High level languages seem infinitely complex, and impossible to master; with endless libraries, written by tons of people. But I find some Assembly languages, particularly AVR, to be nice and concise with a digestibly sized instruction set that fits in a single 191 page pdf. I am actually struggling to learn C at the moment; because honestly, C trying to emulate all the functions of AVR Assembler just ends up being more confusing syntactically than the original Assembly. Anyway, here is a generic 16-bit template for AVR Assembler. I say 16-bit, because it is structured to be useful for 16-bit calculations. It takes advantage of some preprocessor tricks that help make 8-bit registers behave like 16-bit registers. It is mine, and much better for me than the AtmelStudio Assembler template, which I think just increments r16 indefinitely. I will give some basic introduction to the AVR Assembly concepts and tricks that make this useful for starting up a small 16-bit Assembly Program.

//DEPENDENCIES////////////////////////////////////
.include "macros.asm"

//REGISTER DEFINITIONS////////////////////////////
.def ac   = r16 .def acL   = r16 .def acH   = r17  ;accumulator
.def bc   = r18 .def bcL   = r18 .def bcH   = r19  ;b-cumulator
.def tmp  = r20 .def tmpL  = r20 .def tmpH  = r21  ;temp-regs
.def tmp2 = r22 .def tmp2L = r22 .def tmp2H = r23  ;temp-regs

//DATA ALLOCATION SEGMENT////////////////////////
.DSEG
     datacopy:  .byte strlen("All that is gold does not glitter")+1
     variableA: .byte 2
     variableB: .byte 1

//CODE SEGMENT///////////////////////////////////
.EQU RESET     = 0x0000 ;program always starts here
.EQU CONSTANTS = 0x0800 ;address to store constants
.EQU MAIN      = 0x0100 ;address to store program
.CSEG .org CONSTANTS
    constantdata: .db "All that is gold does not glitter",0

.org RESET
    rjmp MAIN

//MAIN PROGRAM/////////
.org MAIN
	COPYSTRINGPGM ramcopy,romdata

        //WRITE PROGRAM HERE....

	endloop:
		nop
		rjmp endloop
//PLACE SUBROUTINES HERE/////////////////////

.exit ///////////////////////////////////////
//NOTES//////////////////////////////////////

/////////////////////////////////////////////

.macro LDZPGM
    ldi ZL,low(@0)
    ldi ZH,high(@0)
    lsl ZL
    rol ZH
.endmacro

.macro LDY
    ldi yl, low(@0)
    ldi yh,high(@0)
.endmacro

.macro COPYSTRINGPGM
    LDZPGM   @1
    LDY	     @0
      LPM tmp,Z+
      ST  Y+, tmp
      CPI tmp,0
      BRNE PC-3
.endmacro

Okay, so let me explain. Starting at the top and working our way down slowly…

Because I like to keep all of my macros in a separate file from the get-go, I need to include it with the directive .INCLUDE. File formats can be .asm, .s, or .inc. I placed this directive at the top because Macros must be declared BEFORE any potential use; but really, the Include directive can be placed just about anywhere, and the assembler will essentially insert the entire document into the line where it was included. I could very well include documents at the bottom or even in the middle of the main document.

In the case of ATmega328p, the file “m328pdef.inc” will already be included by default as a dependency from when I tell AtmelStudio which device I wish to code for. I treat it like a third document in my program. It is where I can find all of the default device Constants(.equ) such as peripheral addresses and some register definitions(ex: .def XL = r26 .def XH = r27). Here is a small snippet from m328pdef.inc that declares constants for some I/O Peripheral Registers.

; ***** I/O REGISTER DEFINITIONS *****************************************
; NOTE:
; Definitions marked "MEMORY MAPPED"are extended I/O ports
; and cannot be used with IN/OUT instructions
//only use ST/LD type instructions
.equ	UDR0	= 0xc6	; MEMORY MAPPED
.equ	UBRR0L	= 0xc4	; MEMORY MAPPED
.equ	UBRR0H	= 0xc5	; MEMORY MAPPED
.equ	UCSR0C	= 0xc2	; MEMORY MAPPED

In My Example Template, I defined(.def) my registers at the top because that is just traditionally where programmers tend to stick global things. But actually, I can define registers anywhere. If at any point in my code, I want to change the name of a register, I can undefine(.undef) the name, and re-define the register with a new one. Though I haven’t used .UNDEF in My Example Template, it is useful to mention.

I can define multiple names for a single register. This will prompt a warning from AtmelStudio, but not an error. I have found this is useful for extending 8 bit registers into 16 bit registers. In My Example Template, by defining Register 20 as both tmp and tmpL, I can use either name to refer to the same register. This is useful if I want to switch back and forth between using tmp as an 8 bit temporary variable or using tmp as a 16 bit temporary variable while keeping the names aesthetic and not having to constantly undefine and re-define them. Thus, the following will Assemble with a warning, but no error.

.def ac   = r16 .def acL   = r16 .def acH   = r17
.def bc   = r18 .def bcL   = r18 .def bcH   = r19
.def tmp  = r20 .def tmpL  = r20 .def tmpH  = r21
.def tmp2 = r22 .def tmp2L = r22 .def tmp2H = r23

Actually, to be honest, the main reason I desire to have some consistency in naming my Low and High registers when dealing with 16 bit values, is because of a neat grammar trick I have found when writing macros. AVR Assembler uses Parameterized Macros that can take registers as arguments. If I wanted to write a Macro to add two 16bit numbers, normally it would require 4 arguments..

.def acL = r16 .def acH = r17
.def bcL = r18 .def bcH = r19

//when writing macros arguments look like this @0, @1, @2, @3, etc
.macro ADDWORDS
    add @0,@2        ;add the low bytes
    adc @1,@3        ;add the high bytes w/ carry
.endmacro

//now call the macro
ADDWORDS acL,acH,bcL,bcH       ;takes 4 arguments

However, because of the way Macros work, and the way I have named my registers, I can totally rewrite this macro with only 2 arguments, and the following will Assemble without Error; and is much more aesthetic.

.def acL = r16 .def acH = r17
.def bcL = r18 .def bcH = r19

.macro ADDWORDS
    add @0L,@1L        ;add the low bytes
    adc @0H,@1H        ;add the high bytes w/ carry
.endmacro

ADDWORDS ac,bc

This works because macros are actually taking the names of the registers as arguments during pr-processing, not the registers themselves. So it essentially just copies and pastes “ac” where it sees @0, and “bc” where it sees “@1”. Thus I have hacked Macros to require less arguments; allowing for prettier code. Again, though I haven’t actually done this in my example code, I have found it pretty invaluable, and useful to know about.

Let’s talk about the macros I have written. I think my second macro, which loads the Y pointer will be the easiest to explain. Observe the comments…

.DSEG
    ramcopy: .byte 69   ;ramcopy expresses a 16bit address
 
.CSEG                            
.macro LDY              ;Macro loads Y pointer w/16bit address
    ldi YL, low(@0)     ;Load the Lower 8bits into Y
    ldi YH,high(@0)     ;Load the upper 8bits into Y
.endmacro

LDY ramcopy   ;Y will now point to ramcopy

I use .DSEG to declare an address in RAM. Labels always express an address, which can be loaded into registers as an immediate value. However, since no Load Immediate Word instruction seems to exist in the AVR Instruction Set, I have to use the built-in “functions“(which I think are just built in weirdly formatted macros): low() and high() to chop the 16 bit address into bytes. Those bytes can now be treated as immediate values for loading or whatever.

Loading the Z register is a little different however, because I am going to use it to point to Program Space, where as I was using Y to point to RAM or Data Space.

.CSEG
    constantdata: .db 69  ;.db stores a constant in Program Space

.macro LDZPGM         
    ldi ZL,low(@0)  ;load the low
    ldi ZH,high(@0) ;load the high
    lsl ZL          ;use logical shift to avoid carries
    rol ZH          ;use rol to use carries
.endmacro

LDZPGM constantdata

Okay, so my comments explained nothing. I will explain this more later when I talk about the .ORG directive, but Data Space(RAM) and Program Space(ROM/FLASH) are organized differently, that is to say they use different addressing schemes. While Data Space maps a byte to every address, Program Space maps a Word to every address. For Z to point correctly it must be multiplied by 2. This is achieved in Assembly by Logically Shifting a register to the left by one place. Because I am dealing with a 16bit number and 8 bit registers I have to shift both the low and the high registers. This is where my comments will make sense. Both the Logical Shift Left(lsl) and the Roll Left(rol) instructions shift bits one place to the left, but they do it slightly differently. If a bit is lost while shifting a register, it actually is said to go into the Carry Bit. The bit that is gained is where these two instructions differ. Rolling uses the Carry Bit to fill the vacancy left by the shift, where as Logically Shifting will always just leave vacancies as zero. If you visualize it like this…

High Reg 0000 0000 Low Reg 0000 0000
Carry Bit 0
Shifting Left <———- <— <————–

…it becomes more obvious that I have to shift the Low Reg first, because it might generate a Carry. If it does, I want that bit to be loaded into the vacancy of the High Reg when I shift it. However, I DON’T want anything going into the vacancy left over by shifting the Low Reg.

Having to use two different instructions on High vs Low Registers when trying to do 16-bit AVR is common across most arithmetic operations; even simple adding. You usually must use the ADD instructions for the Low Reg and then the ADC(Add w/ Carry) instruction for the High Reg.

The rest of My Example Template falls within two very important directives: The Data Segment(.dseg) and the Code Segment(.cseg). The Code Segment is for writing instructions and constants to ROM(flash), and the Data Segment is for allocating Memory(RAM). There are Assembler Directives that only work in certain Segments. For example…

.BYTE cannot be placed in the Code Segment, it is for allocating Memory
.DB should not be placed in the Data Segment(though beware it can), it is for writing constants to Program Space
.DEF can be placed in either, and it doesn’t really matter much
.ORG can not only be placed, but used by both Code AND Data Segments for aligning code or data to specific addresses within Program Space or RAM.

An empty document will be implicitly considered Code Segment without the need for declaration. So that if I simply started writing instructions in an empty document such as…

ldi r16,1        ;load a 1 into register 16, just cuz

…then it would be considered Code Segment(.cseg) and would Assemble without error. Data Segment is not required. However, once a Data Segment(.dseg) is declared with .DSEG, then the rest of the document becomes Data Segment. Instructions cannot be written until the Code Segment is re-declared explicitly using .CSEG. The following program will assemble without error..

//DOCUMENT IMPLICITLY STARTS AS CODE SEGMENT
ldi r16,1             ;some code instruction

//NOW DATA SEGMENT IS DECLARED WITH .DSEG
.DSEG               
    //instructions cannot be written here
 
//CODE SEGMENT IS NOW RE-DECLARED, EXPLICITLY WITH .CSEG
.CSEG
    sts myvalue,r16  ;another code instruction

The Data Segment is used for reserving and organizing RAM. RAM always starts completely empty, I am not initializing any values here, I am merely reserving space in RAM for values that the program will create as it runs. The .BYTE directive must be followed by the number of bytes to reserve in RAM, and should be preceded by a Label which will be a reference to the address of the first byte reserved. Any reserved bytes will always begin at Data Space address 0x0100 by default; however, using .ORG, I can reserve bytes anywhere between here and the end of RAM(0x08FF). Unorganized bytes will be given addresses one after another. The Labels effectively make the Data Segment a table of 16-bit addresses in RAM; which I will frequently need to point to. Observe the comments in the following example…

.DSEG
                    .byte 1      ;technically, u don't need a label
   my16bits:        .byte 2      ;2bytes reserved 0x101-0x102
   my24bits:        .byte 3      ;3bytes reserved 0x103-0x105
   mybyte: my8bits: .byte 1      ;can use multiple labels

.org 0x0110                   
   myword:   .byte 2     ;2 bytes reserved 0x110-0x111

.org 0x0120   
   mystring: .byte 5+1   ;6 bytes reserved 0x120-0x125

.CSEG
  ldi tmp,255            ;load some value, just cuz
  sts mybyte,tmp         ;store value at 0x106
  sts myword,tmp         ;store value at 0x110
  sts myword+1,tmp       ;store value at 0x111

Kind of like how I could give multiple names to the same register, I can give multiple names to the same address by using multiple labels before .BYTE. However, if I try to organize overlapping data using .ORG I will get an error. I can use the .OVERLAP directive to allow overlapping of data or code for all lines below the directive until it is either explicitly turned off using .NOOVERLAP, or if a new Segment is declared using .DSEG or .CSEG.

.DSEG 
.overlap
.org 0x0100
   myword:   .byte 2     ;2 bytes reserved 0x100-0x101
.org 0x0100             
   lowbyte:  .byte 1     ;1 byte reserved 0x100
   highbyte: .byte 1     ;1 byte reserved 0x101
.nooverlap

Things I forgot to cover from the Data Segment of My Example Template… oh! STRLEN() is a built in macro that obviously returns the length of a string. It cannot be used during runtime, it is for the PreProcessor only; as far as I know. Also, I always try to add an extra byte to strings for Null Termination, explaining the STRLENG(“All that..”)+1.

FINALLY! The part I’ve been trying to get to… The Code Segment.

Because RAM is not capable of saving data when powered off, constant data such as strings must be programmed to ROM(Flash) using .DB or .DW to define constant bytes or words. The directives are quite flexible, they can take integers, strings, or multiple values delimited by commas. Successive constant definitions will be placed one after another in Program Space.

If I wish to read a string, I must either read it directly from ROM, or… copy it into the space that I allocated in RAM using .BYTE in the Data Segment, and then read it from RAM whenever I need it, like most programs do. This is exactly what my third Macro does!

.DSEG
    inputcopy: .byte strlen("All that is gold does not glitter")+1
.CSEG
    input:     .db "All that is gold does not glitter",0

.macro COPYSTRINGPGM
    LDZPGM   @1           ;use macro to load Z pointer
    LDY	     @0           ;use macro to load Y pointer
      LPM tmp,Z+          ;Load Byte to tmp register
      ST  Y+, tmp         ;Store byte in register to RAM
      CPI tmp,0           ;check for null terminator
      BRNE PC-3           ;branch if byte not equal to zero
.endmacro

COPYSTRINGPGM inputcopy, input

Macros within Macros are fine. I don’t have enough comment space to explain this one. The Macro COPYSTRINGPGM takes 2 arguments, the address of the location in RAM where I want to copy the string to, and the address of the location in Program Space(ROM) where I want to copy the string from. Using my LDZPGM macro I load the read address into the Z pointer, and using the LDY macro I load the destination address into the Y register. Only the Z pointer can be used to read from Program Space, because only Z is programmed to work with the Load Program Memory(LPM) instruction. The first non-macro’d instruction does exactly this. It loads one byte out of Program Space(at the address pointed to by Z), the byte is loaded into a temporary register, and then Z is incremented by 1, so that it automatically points to the address of the next byte that will be loaded from Program Space. I then Store(ST) the temporary register’s value in RAM(at the address pointed to by Y), and then Y is incremented by 1, so that it automatically points to the address in RAM where the next byte will be stored. So far, these two instructions have copied a single byte from ROM to RAM and are set up to do this again with the next byte in the string. The third instruction compares whatever value was placed into the temporary register(which has not changed yet) to zero. I have purposely placed a zero at the end of my string; as I would probably do with any string in assembly. This is called a null terminator. It makes it possible to write a algorithm that can parse a string without having to know the length in advance. I can simply say: parse, copy, or read until you find a zero. This works out because 0 will never correspond to a coherent string character(“0” actually equals 48 in ASCII). So this is what I did. The Compare Immediate(CPI) instruction virtually subtracts 0 from the value in tmp. This is just a weird way to check if a number is zero. The next instruction BRNE only cares about whether the result of the previous arithmetic operation either resulted in zero, or not. The sequence of instructions will BRanch if Not Equal to zero(in this case). Thus, until the temporary register contains a zero, the Branch if Not Equal(BRNE) instruction will keep repeating these 3 instructions.

Some final notes about this macro. The order in which things happen is often pretty important. A fun fact about the loop used in this macro is that it is actually the fundamental structure of a do-while loop. The body code is executed before the condition is checked. This means I am loading and storing the first byte, before even checking what it is. This should work fine for strings as small as the “null string”(a string containing no characters, thus only a terminator, 0), with no inherent upper limit on the string size. The other consequence is that I load and store the last byte before checking what it is. This also works out, because I want to copy the null terminator. The null terminator is useful when it is treated it like an extra secret character in the string. A confusing side effect of this is that strings are often 1 byte larger than you might think they are, so their size and length may not technically be the same.

The Code Segment can be organized using .ORG with one incredibly important difference. The 16-bit address must be logically shifted left by one place. This is because the Code Segment is all stored in Program Space, and the Program Space addressing is organized by WORDS, not Bytes. This is notoriously confusing in AtmelStudio’s Debugger/Simulator, since the Memory View only outputs Byte addresses…

…In My Example Template I told .ORG to place this constant string at address 0x0300, but it is at exactly double that value. Atmel Debugger/Simulator simply does not display word addressing, even when looking at Program Space; as lovely as that would be. Thus, the address 0x0300, when used in the Code Segment is referring to the 0x0300th WORD in Program Space, which would be the 0x0600th byte; because a Word is 2 Bytes. Taking an Address and Logically Shifting it Left is equivalent to multiplying it by 2.

I divided my Code Section into 3 partitions. You would think they should go in sequential order, but I did not do that….

.CSEG .org 0x0800
    constantdata: .db "All that is gold does not glitter",0

.org 0x0000
    rjmp MAIN

//MAIN PROGRAM/////////
.org 0x0100
	COPYSTRINGPGM ramcopy,romdata

        //WRITE PROGRAM HERE....

	endloop:
		nop
		rjmp endloop
//PLACE SUBROUTINES HERE/////////////////////

.exit ///////////////////////////////////////

I started off with whatever address I want to store my constants at. If I don’t specify an address of my own, it will simply store my constants at either address 0, or wherever I declare them mid-code(which I can do). It doesn’t matter the exact address, I just usually want my constants kept away from actual Program Instructions while writing. Constants are technically written to Flash(ROM) before any instructions are executed, so chronologically, my line ordering can still make some sense. After declaring constants, my program is ready to begin. Code execution always begins at 0x0000. A very common thing to do is place a jmp instruction here and actually begin your program at a later address. This is because AVR devices often place interrupt vectors here, meaning in a complex program, the first bunch of addresses will probably be a table of jumps to different interrupt routines. So I like to save this space. I started my program at 0x0100, just because that was a nice clean number, far above any interrupt vectors. Of course, this program only copies a string from ROM to RAM, and then loops nothing forever. I always place an end loop as a catch for the end of my program. The program counter or instruction pointer should not be allowed to run astray at the end of the program. AFTER the endloop, is typically where I place my sub-routines. The endloop will protect these subroutines from being executed arbitrarily. Finally, the .EXIT directive marks the end of all code. You may freely write comments or keep notes in the lines below this directive and it will be ignored by the assembler.