ARM Assembly Language (32-bit)

The ARM assembly language represents one of the most influential and widely-deployed processor architectures in modern computing, powering billions of devices from smartphones and tablets to embedded systems and increasingly, server infrastructure. As a Reduced Instruction Set Computer (RISC) architecture, ARM assembly provides a clean, efficient, and power-optimized instruction set that has revolutionized mobile computing and embedded systems development. The ARM architecture's emphasis on simplicity, energy efficiency, and scalability has made it the dominant platform for battery-powered devices while maintaining the performance capabilities required for demanding applications. Understanding ARM assembly language is essential for embedded systems programmers, mobile application developers, security researchers working with ARM-based devices, and anyone seeking to optimize performance on ARM processors. This comprehensive reference provides detailed coverage of ARM assembly programming, from the fundamental RISC principles and register architecture to advanced topics including conditional execution, NEON SIMD programming, and system-level development that enable developers to harness the full capabilities of ARM processors.

Architecture Overview and RISC Philosophy

Historical Context and Design Evolution

The ARM architecture traces its origins to Acorn Computers in the 1980s, where it was originally developed as the Acorn RISC Machine before evolving into the Advanced RISC Machine architecture that would eventually dominate mobile computing. The architecture's design philosophy centered on the RISC principles of simplicity, efficiency, and performance through reduced instruction complexity rather than instruction richness. This approach contrasted sharply with the Complex Instruction Set Computer (CISC) philosophy exemplified by x86 processors, instead emphasizing a smaller set of highly optimized instructions that could execute efficiently in simple hardware implementations.

The ARM architecture's evolution through multiple generations has consistently maintained backward compatibility while introducing enhancements that address emerging computing requirements. From the original ARM1 processor through the widely-deployed ARMv7-A architecture, each generation has refined the instruction set, improved performance characteristics, and added specialized features for specific application domains. The architecture's modular design enables implementations ranging from ultra-low-power microcontrollers to high-performance application processors, demonstrating the scalability inherent in the RISC approach.

RISC Design Principles and Implementation

ARM assembly language embodies the core RISC principles through its instruction set design, register architecture, and execution model. The architecture employs a load-store model where memory access occurs only through dedicated load and store instructions, while all computational operations work exclusively with register operands. This separation simplifies processor design, enables efficient pipelining, and provides predictable performance characteristics that facilitate compiler optimization and real-time system development.

arm

; Load-store architecture examples
ldr r0, [r1]        ; Load word from memory address in r1 to r0
str r2, [r3, #4]    ; Store r2 to memory at r3 + 4 offset
add r4, r5, r6      ; Add r5 and r6, store result in r4 (register-only operation)
sub r7, r8, #10     ; Subtract immediate value 10 from r8, store in r7

The ARM instruction set consists of approximately 100 instructions, significantly fewer than CISC architectures, yet provides comprehensive computational capabilities through orthogonal instruction design. Each instruction can typically be combined with various addressing modes, condition codes, and operand types, creating a rich programming environment despite the relatively small instruction count. This orthogonality enables efficient code generation and simplifies assembly language programming by providing consistent patterns across different instruction types.

Processor Modes and Privilege Levels

ARM processors operate in multiple modes that provide different privilege levels and access rights, enabling secure system design and efficient exception handling. The processor modes include User mode for application code, various privileged modes for system software and exception handlers, and specialized modes for interrupt processing and system management. Understanding these modes is crucial for system-level programming and security implementation on ARM platforms.

arm

; Mode switching and privilege examples
mrs r0, cpsr        ; Read current program status register
bic r0, r0, #0x1F   ; Clear mode bits
orr r0, r0, #0x13   ; Set supervisor mode
msr cpsr_c, r0      ; Write back to CPSR (privileged operation)

; Exception handling
swi #0              ; Software interrupt (system call)
bx lr               ; Return from exception (branch and exchange)

The privilege model enables operating systems to protect system resources while providing controlled access to hardware features. User mode applications execute with restricted privileges, while kernel code and device drivers operate in privileged modes with full hardware access. This separation forms the foundation for secure system design and enables robust multi-tasking operating systems on ARM platforms.

Register Architecture and Organization

General-Purpose Register Set

The ARM architecture provides sixteen 32-bit general-purpose registers (R0-R15) that serve as the primary storage for computation and address calculation. Unlike architectures with specialized register functions, ARM registers are largely orthogonal, meaning most registers can be used interchangeably for different purposes. However, certain registers have conventional uses and some have special hardware behaviors that influence programming practices and calling conventions.

arm

; General-purpose register usage examples
mov r0, #42         ; Load immediate value 42 into r0
mov r1, r0          ; Copy r0 contents to r1
add r2, r0, r1      ; Add r0 and r1, store result in r2
lsl r3, r2, #2      ; Logical shift left r2 by 2 bits, store in r3

; Register addressing and manipulation
mov r4, #0x1000     ; Load base address
ldr r5, [r4]        ; Load from base address
ldr r6, [r4, #4]    ; Load from base + offset
ldr r7, [r4, r5]    ; Load from base + index register

The register set's orthogonal design enables flexible programming approaches and efficient compiler code generation. Most arithmetic, logical, and data movement operations can use any register as source or destination, providing maximum flexibility for register allocation and optimization. This flexibility contrasts with architectures that have specialized registers for specific operations, enabling more efficient use of the available register resources.

Special-Purpose Registers

While most ARM registers are general-purpose, registers R13, R14, and R15 have special hardware behaviors and conventional uses that are crucial for proper ARM programming. R13 serves as the Stack Pointer (SP), R14 functions as the Link Register (LR) for function calls, and R15 operates as the Program Counter (PC) with special addressing behaviors.

arm

; Stack pointer operations
push {r0, r1, r2}   ; Push registers onto stack (decrements SP)
pop {r0, r1, r2}    ; Pop registers from stack (increments SP)
add sp, sp, #16     ; Manually adjust stack pointer

; Link register and function calls
bl function_name    ; Branch with link (saves return address in LR)
bx lr              ; Return to caller (branch and exchange to LR)
mov lr, pc         ; Manually save return address

; Program counter behavior
mov r0, pc         ; Read current PC value (PC + 8 due to pipeline)
add pc, pc, #4     ; Jump forward 4 bytes (relative branch)
ldr pc, [r1]       ; Indirect jump through memory

The Link Register (LR) automatically receives the return address when branch-with-link instructions execute, enabling efficient function call implementation without stack manipulation for simple functions. The Program Counter's behavior reflects the ARM pipeline architecture, where PC reads return the address of the current instruction plus 8 bytes, accounting for the instruction fetch pipeline stages.

Current Program Status Register (CPSR)

The Current Program Status Register (CPSR) contains processor state information including condition code flags, processor mode bits, and control flags that affect instruction execution. The CPSR enables ARM's unique conditional execution capability and provides essential state information for system programming and exception handling.

arm

; CPSR flag manipulation
cmp r0, r1          ; Compare r0 with r1, set condition flags
moveq r2, #1        ; Move 1 to r2 if equal (conditional execution)
movne r2, #0        ; Move 0 to r2 if not equal

; Direct CPSR access (privileged mode)
mrs r0, cpsr        ; Read CPSR into r0
msr cpsr_f, r0      ; Write flags portion of CPSR
msr cpsr_c, r0      ; Write control portion of CPSR

; Condition code testing
tst r0, #0x80       ; Test bit 7 of r0
bne bit_set         ; Branch if bit was set (not zero)
teq r1, r2          ; Test equivalence (XOR without storing result)
bne not_equal       ; Branch if not equal

The condition code flags (Negative, Zero, Carry, Overflow) reflect the results of arithmetic and logical operations, enabling efficient conditional execution without explicit comparison instructions in many cases. The processor mode bits determine the current privilege level and available register set, while control bits affect interrupt handling and processor behavior.

Instruction Set Architecture and Encoding

Instruction Format and Conditional Execution

ARM instructions use a fixed 32-bit encoding that provides consistent instruction length and simplified instruction fetch. Every ARM instruction includes a 4-bit condition field that enables conditional execution based on the current state of the condition code flags in the CPSR. This conditional execution capability is unique among major processor architectures and enables highly efficient code generation for conditional operations.

arm

; Conditional execution examples
cmp r0, #10         ; Compare r0 with 10
addgt r1, r1, #1    ; Add 1 to r1 if r0 > 10 (greater than)
suble r2, r2, #1    ; Subtract 1 from r2 if r0 <= 10 (less than or equal)
moveq r3, #0        ; Move 0 to r3 if r0 == 10 (equal)

; Complex conditional sequences
cmp r0, r1          ; Compare two registers
movlt r2, r0        ; r2 = min(r0, r1) - part 1
movge r2, r1        ; r2 = min(r0, r1) - part 2
movlt r3, r1        ; r3 = max(r0, r1) - part 1  
movge r3, r0        ; r3 = max(r0, r1) - part 2

The conditional execution capability eliminates many branch instructions that would be required in other architectures, improving code density and pipeline efficiency. By avoiding branches for simple conditional operations, ARM code can maintain better instruction throughput and reduce the performance impact of pipeline stalls.

Addressing Modes and Memory Access

ARM provides sophisticated addressing modes that enable efficient access to various data structures and memory layouts. The addressing modes include immediate addressing, register addressing, and various forms of indexed addressing that support arrays, structures, and pointer-based data access with minimal instruction overhead.

arm

; Immediate addressing
mov r0, #255        ; Load immediate value (8-bit value with rotation)
mov r1, #0x1000     ; Load immediate address
add r2, r3, #4      ; Add immediate offset

; Register addressing  
mov r0, r1          ; Copy register contents
add r2, r3, r4      ; Add two registers

; Memory addressing modes
ldr r0, [r1]        ; Load from address in r1
ldr r0, [r1, #4]    ; Load from r1 + 4 (offset addressing)
ldr r0, [r1, r2]    ; Load from r1 + r2 (register offset)
ldr r0, [r1, r2, lsl #2] ; Load from r1 + (r2 << 2) (scaled register)

; Pre-indexed and post-indexed addressing
ldr r0, [r1, #4]!   ; Load from r1 + 4, then r1 = r1 + 4 (pre-indexed)
ldr r0, [r1], #4    ; Load from r1, then r1 = r1 + 4 (post-indexed)

The scaled register addressing mode enables efficient array access by automatically scaling index values by 1, 2, 4, or 8 bytes, corresponding to the sizes of common data types. Pre-indexed and post-indexed addressing modes support efficient pointer manipulation and array traversal without requiring separate address calculation instructions.

Data Processing Instructions

ARM data processing instructions provide comprehensive arithmetic, logical, and data manipulation capabilities. These instructions can optionally update condition code flags and support various operand types including immediate values, registers, and shifted registers. The instruction set's orthogonal design enables consistent operation across different instruction types.

arm

; Arithmetic operations
add r0, r1, r2      ; Add r1 and r2, store in r0
adc r0, r1, r2      ; Add with carry
sub r0, r1, r2      ; Subtract r2 from r1
sbc r0, r1, r2      ; Subtract with carry
rsb r0, r1, r2      ; Reverse subtract (r2 - r1)

; Logical operations
and r0, r1, r2      ; Bitwise AND
orr r0, r1, r2      ; Bitwise OR
eor r0, r1, r2      ; Bitwise XOR (exclusive OR)
bic r0, r1, r2      ; Bit clear (r1 AND NOT r2)
mvn r0, r1          ; Move NOT (bitwise complement)

; Shift operations
lsl r0, r1, #2      ; Logical shift left by 2 bits
lsr r0, r1, #4      ; Logical shift right by 4 bits
asr r0, r1, #3      ; Arithmetic shift right by 3 bits
ror r0, r1, #8      ; Rotate right by 8 bits
rrx r0, r1          ; Rotate right through carry

The shift operations can be combined with other data processing instructions as part of the operand specification, enabling complex operations in single instructions. This capability supports efficient implementation of mathematical operations, bit manipulation algorithms, and data structure access patterns.

Control Flow and Program Structure

Branch Instructions and Program Flow

ARM provides various branch instructions that enable implementation of conditional logic, loops, and function calls. The branch instructions include conditional and unconditional variants, with some instructions providing automatic return address saving for function call implementation.

arm

; Unconditional branches
b label             ; Branch to label
bl function         ; Branch with link (save return address)
bx r0              ; Branch and exchange (can switch instruction sets)
blx r1             ; Branch with link and exchange

; Conditional branches
beq equal_label     ; Branch if equal (Z flag set)
bne not_equal       ; Branch if not equal (Z flag clear)
blt less_than       ; Branch if less than (signed)
bgt greater_than    ; Branch if greater than (signed)
blo below           ; Branch if below (unsigned)
bhi above           ; Branch if above (unsigned)

; Compare and branch patterns
cmp r0, #10         ; Compare r0 with 10
bge end_loop        ; Branch if greater than or equal
add r1, r1, r0      ; Loop body
add r0, r0, #1      ; Increment counter
b loop_start        ; Continue loop
end_loop:

The branch and exchange (BX) instruction enables switching between ARM and Thumb instruction sets, providing flexibility for mixed-mode programming and interoperability between different code sections. The automatic return address saving in branch-with-link instructions simplifies function call implementation and reduces stack manipulation overhead.

Loop Constructs and Iteration

ARM assembly supports efficient loop implementation through various instruction combinations and addressing modes. While ARM lacks dedicated loop instructions like some architectures, the combination of conditional execution, flexible addressing modes, and efficient branch instructions enables highly optimized loop constructs.

arm

; Simple counting loop
mov r0, #10         ; Initialize counter
loop_start:
    ; Loop body instructions
    subs r0, r0, #1 ; Decrement counter and set flags
    bne loop_start  ; Continue if not zero

; Array processing loop
mov r0, #array_base ; Array base address
mov r1, #0          ; Index
mov r2, #array_size ; Array size
process_loop:
    ldr r3, [r0, r1, lsl #2] ; Load array[index] (4-byte elements)
    ; Process element in r3
    add r1, r1, #1  ; Increment index
    cmp r1, r2      ; Compare with size
    blt process_loop ; Continue if index < size

; Post-indexed addressing loop
mov r0, #array_base ; Array pointer
mov r1, #array_end  ; End address
copy_loop:
    ldr r2, [r0], #4 ; Load and increment pointer
    str r2, [r3], #4 ; Store and increment destination
    cmp r0, r1       ; Check for end
    blt copy_loop    ; Continue if not at end

Post-indexed addressing modes enable efficient pointer-based loops where address calculation and memory access occur in single instructions. This capability reduces instruction count and improves performance for array processing and memory copy operations.

Function Calls and Stack Management

ARM function calls utilize the Link Register (LR) for return address storage and follow established calling conventions for parameter passing and register preservation. The ARM Architecture Procedure Call Standard (AAPCS) defines consistent interfaces that enable interoperability between assembly language functions and high-level language code.

arm

; Function call sequence
mov r0, #param1     ; First parameter in r0
mov r1, #param2     ; Second parameter in r1
mov r2, #param3     ; Third parameter in r2
mov r3, #param4     ; Fourth parameter in r3
; Additional parameters go on stack
bl function_name    ; Call function

; Function prologue
function_name:
    push {r4-r11, lr} ; Save callee-saved registers and return address
    sub sp, sp, #16   ; Allocate local variable space
    
    ; Function body
    add r0, r0, r1    ; Use parameters
    str r0, [sp, #0]  ; Store local variable
    
    ; Function epilogue
    add sp, sp, #16   ; Deallocate local variables
    pop {r4-r11, pc}  ; Restore registers and return

; Leaf function (no function calls)
leaf_function:
    add r0, r0, r1    ; Simple operation
    bx lr             ; Return directly

The calling convention specifies that registers R0-R3 pass the first four parameters, with additional parameters passed on the stack. Registers R4-R11 are callee-saved and must be preserved across function calls, while R0-R3 and R12 are caller-saved and may be modified by called functions.

Memory Management and System Programming

Memory Architecture and Address Spaces

ARM processors implement sophisticated memory management capabilities including virtual memory, memory protection, and cache management. The Memory Management Unit (MMU) provides address translation, access control, and memory attribute management that enable secure multi-tasking operating systems and efficient memory utilization.

arm

; Memory management operations (privileged mode)
mcr p15, 0, r0, c2, c0, 0  ; Write Translation Table Base Register
mcr p15, 0, r1, c3, c0, 0  ; Write Domain Access Control Register
mcr p15, 0, r2, c1, c0, 0  ; Write Control Register (enable MMU)

; Cache management
mcr p15, 0, r0, c7, c5, 0  ; Invalidate entire instruction cache
mcr p15, 0, r1, c7, c6, 0  ; Invalidate entire data cache
mcr p15, 0, r2, c7, c10, 4 ; Data Synchronization Barrier

; TLB management
mcr p15, 0, r0, c8, c7, 0  ; Invalidate entire TLB
mcr p15, 0, r1, c8, c6, 1  ; Invalidate TLB entry by MVA

The coprocessor interface (CP15) provides access to system control registers that manage memory mapping, cache behavior, and processor configuration. Understanding these interfaces is essential for operating system development and low-level system programming on ARM platforms.

Exception Handling and Interrupts

ARM processors provide comprehensive exception handling capabilities including interrupts, data aborts, prefetch aborts, and software interrupts. The exception handling mechanism automatically saves processor state and vectors to appropriate handler routines, enabling robust system software implementation.

arm

; Exception vector table (located at 0x00000000 or 0xFFFF0000)
reset_vector:       b reset_handler
undefined_vector:   b undefined_handler
swi_vector:         b swi_handler
prefetch_vector:    b prefetch_handler
data_abort_vector:  b data_abort_handler
reserved_vector:    nop
irq_vector:         b irq_handler
fiq_vector:         b fiq_handler

; Interrupt service routine structure
irq_handler:
    sub lr, lr, #4      ; Adjust return address
    push {r0-r3, r12, lr} ; Save registers
    
    ; Identify and handle interrupt source
    ldr r0, =interrupt_controller
    ldr r1, [r0, #status_offset]
    ; Process interrupt
    
    pop {r0-r3, r12, lr}  ; Restore registers
    movs pc, lr           ; Return from interrupt

Exception handling requires careful attention to processor mode changes, register banking, and return address adjustment. The ARM architecture provides separate register banks for different processor modes, enabling efficient context switching without explicit register saving in many cases.

Coprocessor Interface and System Control

ARM processors support coprocessor interfaces that enable extension of the instruction set and integration of specialized processing units. The most commonly used coprocessor is CP15, which provides access to system control and configuration registers.

arm

; Coprocessor register access
mrc p15, 0, r0, c0, c0, 0  ; Read Main ID Register
mrc p15, 0, r1, c1, c0, 0  ; Read Control Register
mcr p15, 0, r2, c1, c0, 0  ; Write Control Register

; Performance monitoring
mrc p15, 0, r0, c9, c12, 0 ; Read Performance Monitor Control Register
mcr p15, 0, r1, c9, c12, 1 ; Write Performance Counter Enable Set
mrc p15, 0, r2, c9, c13, 0 ; Read Cycle Count Register

; Debug and trace support
mrc p14, 0, r0, c0, c0, 0  ; Read Debug ID Register
mcr p14, 0, r1, c0, c2, 2  ; Write Debug Control Register

Coprocessor instructions enable access to specialized functionality including floating-point operations, SIMD processing, and system management features. The coprocessor interface provides a standardized mechanism for extending ARM capabilities while maintaining instruction set compatibility.

Advanced Programming Techniques

NEON SIMD Programming

ARM NEON technology provides advanced SIMD (Single Instruction, Multiple Data) capabilities that enable parallel processing of multiple data elements in single instructions. NEON supports various data types including 8-bit, 16-bit, 32-bit, and 64-bit integers, as well as single-precision floating-point values.

arm

; NEON register usage
vld1.32 {d0, d1}, [r0]!    ; Load 8 32-bit values, post-increment
vadd.i32 q0, q0, q1        ; Add 4 32-bit integers in parallel
vmul.f32 q2, q0, q1        ; Multiply 4 single-precision floats
vst1.32 {d4, d5}, [r1]!    ; Store 8 32-bit values, post-increment

; Vector operations
vmov.i32 q0, #0            ; Initialize vector to zero
vdup.32 q1, r0             ; Duplicate scalar to all vector elements
vmax.s32 q2, q0, q1        ; Element-wise maximum
vmin.u16 d0, d1, d2        ; Element-wise minimum (unsigned 16-bit)

; Advanced NEON operations
vtbl.8 d0, {d1, d2}, d3    ; Table lookup
vzip.16 q0, q1             ; Interleave elements
vuzp.32 q2, q3             ; De-interleave elements
vrev64.8 q0, q1            ; Reverse elements within 64-bit lanes

NEON programming requires understanding of vector data types, lane operations, and memory alignment requirements. Effective use of NEON instructions can provide significant performance improvements for multimedia processing, signal processing, and mathematical computations.

Thumb Instruction Set

The Thumb instruction set provides 16-bit instructions that improve code density while maintaining most ARM functionality. Thumb instructions can reduce code size by 30-40% compared to ARM instructions, making them valuable for memory-constrained applications.

arm

; Thumb instruction examples (syntax similar to ARM)
.thumb                     ; Switch to Thumb mode
mov r0, #10               ; 16-bit instruction
add r1, r0, r2            ; 16-bit instruction
ldr r3, [r4, #8]          ; 16-bit instruction with limited offset
bl function_name          ; 32-bit Thumb instruction

; Mixed ARM/Thumb programming
.arm                      ; ARM mode
bx r0                     ; Branch and exchange to address in r0
                          ; (can switch to Thumb if bit 0 set)

.thumb
add r1, r1, #1            ; Thumb instruction
bx lr                     ; Return (may switch back to ARM)

Thumb-2 technology extends the Thumb instruction set with 32-bit instructions that provide ARM-equivalent functionality while maintaining code density benefits. The ability to mix ARM and Thumb code enables optimization for both performance and code size requirements.

Optimization Techniques and Performance

ARM assembly optimization requires understanding of processor pipeline characteristics, memory hierarchy behavior, and instruction scheduling considerations. Modern ARM processors employ sophisticated out-of-order execution engines, but careful instruction selection and data layout can still provide significant performance benefits.

arm

; Loop optimization techniques
; Unrolled loop for better throughput
mov r0, #array_base
mov r1, #count
unrolled_loop:
    ldr r2, [r0], #4      ; Load element 1
    ldr r3, [r0], #4      ; Load element 2  
    ldr r4, [r0], #4      ; Load element 3
    ldr r5, [r0], #4      ; Load element 4
    ; Process 4 elements
    subs r1, r1, #4       ; Decrement counter by 4
    bgt unrolled_loop     ; Continue if more elements

; Conditional execution for branch elimination
cmp r0, r1
movlt r2, r0              ; r2 = min(r0, r1)
movge r2, r1
movlt r3, r1              ; r3 = max(r0, r1)
movge r3, r0

; Efficient bit manipulation
and r0, r1, #0xFF         ; Extract low byte
orr r0, r0, r2, lsl #8    ; Insert byte at position
bic r0, r0, #0xF0         ; Clear specific bits

Performance optimization on ARM requires balancing instruction count, memory access patterns, and pipeline efficiency. The conditional execution capability can eliminate branches and improve instruction throughput, while careful use of addressing modes can reduce instruction count and improve cache utilization.

The ARM assembly language provides a powerful and efficient foundation for embedded systems programming, mobile application development, and system-level software implementation. Its RISC design philosophy, conditional execution capabilities, and comprehensive instruction set enable developers to create high-performance, energy-efficient applications across a wide range of computing platforms. Mastery of ARM assembly programming opens opportunities for embedded systems development, mobile platform optimization, security research, and system programming that require direct hardware control and optimal resource utilization. The architecture's continued evolution and widespread adoption ensure its relevance for future computing challenges while maintaining the simplicity and efficiency that have made ARM the dominant platform for mobile and embedded computing.

ARM Assembly Language (32-bit) ​

Architecture Overview and RISC Philosophy ​

Historical Context and Design Evolution ​

RISC Design Principles and Implementation ​

Processor Modes and Privilege Levels ​

Register Architecture and Organization ​

General-Purpose Register Set ​

Special-Purpose Registers ​

Current Program Status Register (CPSR) ​

Instruction Set Architecture and Encoding ​

Instruction Format and Conditional Execution ​

Addressing Modes and Memory Access ​

Data Processing Instructions ​

Control Flow and Program Structure ​

Branch Instructions and Program Flow ​

Loop Constructs and Iteration ​

Function Calls and Stack Management ​

Memory Management and System Programming ​

Memory Architecture and Address Spaces ​

Exception Handling and Interrupts ​

Coprocessor Interface and System Control ​

Advanced Programming Techniques ​

NEON SIMD Programming ​

Thumb Instruction Set ​

Optimization Techniques and Performance ​