Appearance
AArch64 Assembly Language (ARM64)
The AArch64 assembly language represents the evolutionary pinnacle of ARM processor architecture, introducing a completely redesigned 64-bit instruction set that maintains the power efficiency and performance characteristics that made ARM dominant in mobile computing while extending capabilities for high-performance computing, server applications, and emerging workloads. Introduced with the ARMv8-A architecture, AArch64 provides a clean break from 32-bit ARM limitations while preserving the RISC design philosophy that has proven so successful across diverse computing platforms. The architecture's 64-bit capabilities enable applications to address vast memory spaces, process larger datasets, and achieve higher computational throughput while maintaining the energy efficiency that is crucial for battery-powered devices and data center operations. Understanding AArch64 assembly language is essential for developers working on modern ARM-based systems including Apple Silicon Macs, AWS Graviton servers, mobile devices, and embedded systems that require maximum performance and efficiency. This comprehensive reference provides detailed coverage of AArch64 assembly programming, from the architectural enhancements over 32-bit ARM to advanced optimization techniques that leverage the full capabilities of modern 64-bit ARM processors.
Architectural Evolution and 64-bit Enhancements
Transition from 32-bit ARM to AArch64
The development of AArch64 represented a fundamental reimagining of the ARM architecture rather than a simple extension of the existing 32-bit design. While maintaining backward compatibility through the AArch32 execution state, AArch64 introduced a completely new instruction set (A64) that addressed limitations of the 32-bit architecture while incorporating lessons learned from decades of ARM processor development. The transition enabled ARM to compete effectively in high-performance computing markets while maintaining the power efficiency advantages that had made ARM successful in mobile and embedded applications.
The architectural changes in AArch64 extend far beyond simple register width expansion, encompassing instruction encoding improvements, enhanced SIMD capabilities, simplified exception handling, and modernized system programming interfaces. The design team took the opportunity to remove legacy features that complicated implementation while adding capabilities that enable efficient execution of modern software workloads. This clean-slate approach resulted in an architecture that is both more powerful and simpler to implement than its 32-bit predecessors.
Design Philosophy and Implementation Goals
AArch64 embodies a refined RISC philosophy that emphasizes simplicity, orthogonality, and performance scalability. The instruction set design prioritizes regular encoding patterns that simplify processor implementation while providing comprehensive computational capabilities. Unlike the 32-bit ARM architecture's conditional execution model, AArch64 adopts a more conventional approach with dedicated conditional branch instructions, simplifying instruction encoding and enabling more efficient processor implementations.
aarch64
// AArch64 instruction examples showing simplified encoding
mov x0, #42 // Load immediate value into 64-bit register
add x1, x0, x2 // Add two 64-bit registers
ldr x3, [x4, #8] // Load from memory with offset
str x5, [x6], #16 // Store with post-increment
// Conditional execution using branches instead of predication
cmp x0, x1 // Compare two registers
b.eq equal_label // Branch if equal
b.ne not_equal_label // Branch if not equal
b.lt less_than_label // Branch if less than
The architectural philosophy emphasizes providing powerful instructions that map efficiently to common programming patterns while maintaining implementation simplicity. The instruction set includes enhanced addressing modes, improved immediate value handling, and specialized instructions for common operations that enable compilers to generate highly efficient code.
Execution States and Compatibility
AArch64 processors can operate in multiple execution states that provide different capabilities and compatibility levels. The AArch64 execution state provides native 64-bit operation with the A64 instruction set, while the AArch32 execution state maintains compatibility with existing 32-bit ARM code through support for both A32 (ARM) and T32 (Thumb) instruction sets.
aarch64
// AArch64 native execution
.text
.global _start
_start:
mov x0, #1 // 64-bit register operation
mov x8, #93 // System call number (exit)
svc #0 // Supervisor call
// Exception level transitions
mrs x0, CurrentEL // Read current exception level
lsr x0, x0, #2 // Extract EL field
cmp x0, #1 // Compare with EL1
b.eq kernel_mode // Branch if in kernel mode
The exception level model in AArch64 provides four privilege levels (EL0-EL3) that enable secure system design and virtualization support. EL0 provides unprivileged user mode execution, EL1 supports operating system kernels, EL2 enables hypervisor implementation, and EL3 provides secure monitor functionality for TrustZone security extensions.
Register Architecture and Organization
General-Purpose Register Expansion
AArch64 provides thirty-one 64-bit general-purpose registers (X0-X30) plus a dedicated zero register and stack pointer, representing a significant expansion from the sixteen registers available in 32-bit ARM. This expanded register set addresses one of the primary limitations of 32-bit ARM programming, where register pressure often forced frequent memory access and limited optimization opportunities.
aarch64
// 64-bit register operations
mov x0, #0x123456789ABCDEF0 // Load 64-bit immediate (limited cases)
add x1, x2, x3 // Add two 64-bit registers
mul x4, x5, x6 // Multiply two 64-bit registers
// 32-bit register views (W registers)
mov w0, #42 // Load into 32-bit view (clears upper 32 bits)
add w1, w2, w3 // 32-bit addition
ldr w4, [x5] // Load 32-bit value
// Register naming and relationships
// X0-X30: 64-bit general-purpose registers
// W0-W30: 32-bit views of X registers (lower 32 bits)
// XZR/WZR: Zero register (reads as 0, writes ignored)
// SP: Stack pointer (dedicated register)
The register naming convention provides clear distinction between 64-bit (X) and 32-bit (W) operations, with 32-bit operations automatically clearing the upper 32 bits of the target register. This behavior eliminates potential security vulnerabilities from uninitialized register contents and provides clean semantics for mixed-size operations.
Special-Purpose Registers and System State
AArch64 maintains dedicated registers for stack pointer (SP) and program counter (PC) operations, while eliminating the link register concept from 32-bit ARM in favor of using general-purpose register X30 for return addresses. The architecture provides comprehensive system state access through system registers that control processor behavior, memory management, and security features.
aarch64
// Stack pointer operations
mov sp, x0 // Set stack pointer
add sp, sp, #16 // Adjust stack pointer
ldr x1, [sp, #8] // Load from stack with offset
// Return address handling
bl function_name // Branch with link (saves return address in X30)
ret // Return using X30
ret x5 // Return using specified register
// System register access
mrs x0, MIDR_EL1 // Read Main ID Register
mrs x1, MPIDR_EL1 // Read Multiprocessor Affinity Register
msr TTBR0_EL1, x2 // Write Translation Table Base Register
The system register interface provides access to processor identification, configuration, and control registers through a unified naming scheme that includes the target exception level. This organization simplifies system programming and enables precise control over processor behavior at different privilege levels.
Vector and SIMD Register Architecture
AArch64 provides thirty-two 128-bit vector registers (V0-V31) that support advanced SIMD operations and floating-point arithmetic. These registers can be accessed at different granularities (B, H, S, D, Q) to support various data types and vector operations, providing significantly enhanced parallel processing capabilities compared to 32-bit ARM NEON.
aarch64
// Vector register access modes
// V0-V31: 128-bit vector registers
// Q0-Q31: 128-bit quadword view
// D0-D31: 64-bit doubleword view
// S0-S31: 32-bit single word view
// H0-H31: 16-bit halfword view
// B0-B31: 8-bit byte view
// SIMD operations
ld1 {v0.4s}, [x0] // Load 4 single-precision floats
add v1.4s, v0.4s, v2.4s // Add 4 floats in parallel
fmul v3.2d, v1.2d, v2.2d // Multiply 2 double-precision floats
st1 {v3.2d}, [x1] // Store 2 doubles
// Scalar floating-point operations
fadd d0, d1, d2 // Add two double-precision values
fmul s3, s4, s5 // Multiply two single-precision values
fcvt d6, s7 // Convert single to double precision
The vector register architecture supports both scalar floating-point operations and advanced SIMD processing with comprehensive data type support. The unified register file simplifies programming and enables efficient data movement between scalar and vector operations.
Instruction Set Architecture and Encoding
Instruction Format and Encoding Improvements
AArch64 uses fixed 32-bit instruction encoding that provides regular patterns and simplified decoding compared to the variable-length encodings found in some other architectures. The instruction format eliminates the conditional execution field present in 32-bit ARM, instead providing dedicated conditional branch instructions that enable more efficient processor implementations.
aarch64
// Regular instruction encoding patterns
add x0, x1, x2 // Register-register addition
add x0, x1, #100 // Register-immediate addition
ldr x0, [x1, #8] // Load with immediate offset
ldr x0, [x1, x2, lsl #3] // Load with scaled register offset
// Immediate value handling
mov x0, #0xFFFF // 16-bit immediate with optional shift
movk x0, #0x1234, lsl #16 // Insert 16-bit value at specific position
movz x1, #42 // Zero remaining bits
movn x2, #0 // Move NOT immediate
The instruction encoding provides consistent patterns across different instruction types, enabling efficient instruction decode and simplifying processor implementation. The immediate value handling supports construction of arbitrary 64-bit constants through a sequence of move instructions with different shift amounts.
Enhanced Addressing Modes
AArch64 provides sophisticated addressing modes that enable efficient access to various data structures while maintaining implementation simplicity. The addressing modes include immediate offsets, register offsets with optional scaling, and pre/post-indexed addressing that supports efficient pointer manipulation.
aarch64
// Basic addressing modes
ldr x0, [x1] // Base register addressing
ldr x0, [x1, #8] // Base plus immediate offset
ldr x0, [x1, x2] // Base plus register offset
ldr x0, [x1, x2, lsl #3] // Base plus scaled register offset
// Pre-indexed and post-indexed addressing
ldr x0, [x1, #8]! // Load with pre-increment
ldr x0, [x1], #8 // Load with post-increment
str x0, [x1, #-16]! // Store with pre-decrement
str x0, [x1], #16 // Store with post-increment
// PC-relative addressing
adr x0, label // Load address relative to PC
adrp x1, symbol // Load page address relative to PC
ldr x2, [x1, #:lo12:symbol] // Load from page offset
The PC-relative addressing modes enable position-independent code generation and efficient access to global data and function addresses. The ADRP instruction loads the page address of a symbol, while subsequent instructions can access specific offsets within that page.
Data Processing and Arithmetic Instructions
AArch64 provides comprehensive arithmetic and logical operations that support both 32-bit and 64-bit operands. The instruction set includes enhanced immediate value support, optional condition flag setting, and specialized instructions for common operations that enable efficient code generation.
aarch64
// Basic arithmetic operations
add x0, x1, x2 // Add two 64-bit registers
adds x0, x1, x2 // Add and set condition flags
adc x0, x1, x2 // Add with carry
sub x0, x1, x2 // Subtract
subs x0, x1, x2 // Subtract and set flags
mul x0, x1, x2 // Multiply (low 64 bits)
smulh x0, x1, x2 // Signed multiply high
umulh x0, x1, x2 // Unsigned multiply high
// Logical operations
and x0, x1, x2 // Bitwise AND
orr x0, x1, x2 // Bitwise OR
eor x0, x1, x2 // Bitwise XOR
bic x0, x1, x2 // Bit clear (AND NOT)
orn x0, x1, x2 // OR NOT
eon x0, x1, x2 // XOR NOT
// Shift and rotate operations
lsl x0, x1, #4 // Logical shift left
lsr x0, x1, #8 // Logical shift right
asr x0, x1, #12 // Arithmetic shift right
ror x0, x1, #16 // Rotate right
The arithmetic instructions provide both 32-bit and 64-bit variants with consistent naming conventions. The optional condition flag setting enables efficient implementation of conditional operations without requiring separate comparison instructions in many cases.
Control Flow and Program Structure
Branch Instructions and Conditional Execution
AArch64 replaces the conditional execution model of 32-bit ARM with dedicated conditional branch instructions that provide cleaner instruction encoding and more efficient processor implementation. The branch instructions support various condition codes and provide both short-range and long-range branching capabilities.
aarch64
// Conditional branches
cmp x0, x1 // Compare two registers
b.eq equal_label // Branch if equal
b.ne not_equal_label // Branch if not equal
b.lt less_than_label // Branch if less than (signed)
b.gt greater_than_label // Branch if greater than (signed)
b.lo below_label // Branch if below (unsigned)
b.hi above_label // Branch if above (unsigned)
// Unconditional branches
b target_label // Branch to label
bl function_name // Branch with link
br x0 // Branch to register
blr x1 // Branch with link to register
ret // Return (equivalent to br x30)
// Compare and branch
cbz x0, zero_label // Compare and branch if zero
cbnz x1, nonzero_label // Compare and branch if not zero
tbz x2, #5, bit_clear // Test bit and branch if zero
tbnz x3, #10, bit_set // Test bit and branch if not zero
The compare-and-branch instructions enable efficient implementation of common conditional patterns without requiring separate comparison and branch instructions. The test-bit-and-branch instructions provide efficient bit testing capabilities for flag processing and bit manipulation algorithms.
Loop Constructs and Iteration Patterns
AArch64 supports efficient loop implementation through various instruction combinations and addressing modes. The architecture's enhanced register set and addressing capabilities enable highly optimized loop constructs that minimize instruction count and maximize throughput.
aarch64
// Simple counting loop
mov x0, #100 // Initialize counter
loop_start:
// Loop body instructions
subs x0, x0, #1 // Decrement and set flags
b.ne loop_start // Continue if not zero
// Array processing with post-increment
mov x0, #array_base // Array pointer
mov x1, #array_end // End address
process_loop:
ldr x2, [x0], #8 // Load and increment pointer
// Process element in x2
cmp x0, x1 // Check for end
b.lt process_loop // Continue if not at end
// Vectorized loop with SIMD
mov x0, #vector_array // Vector array base
mov x1, #element_count // Number of vector elements
vector_loop:
ld1 {v0.4s}, [x0], #16 // Load 4 floats, increment pointer
fmul v0.4s, v0.4s, v1.4s // Multiply by constant vector
st1 {v0.4s}, [x2], #16 // Store result, increment pointer
subs x1, x1, #1 // Decrement counter
b.ne vector_loop // Continue if more elements
The post-indexed addressing modes enable efficient pointer-based loops where address calculation and memory access occur in single instructions. SIMD instructions can process multiple data elements per iteration, providing significant performance improvements for suitable algorithms.
Function Calls and Procedure Linkage
AArch64 follows the Procedure Call Standard (PCS) that defines consistent parameter passing, register usage, and stack management conventions. The calling convention takes advantage of the expanded register set to pass more parameters in registers, reducing stack traffic and improving function call performance.
aarch64
// Function call parameter passing
// X0-X7: Parameter and result registers
// X8: Indirect result location register
// X9-X15: Temporary registers
// X16-X17: Intra-procedure-call temporary registers
// X18: Platform register (reserved)
// X19-X28: Callee-saved registers
// X29: Frame pointer
// X30: Link register
// Function call sequence
mov x0, #param1 // First parameter
mov x1, #param2 // Second parameter
mov x2, #param3 // Third parameter
bl function_name // Call function
// Return value in X0
// Function prologue
function_name:
stp x29, x30, [sp, #-16]! // Save frame pointer and link register
mov x29, sp // Set up frame pointer
sub sp, sp, #32 // Allocate local variable space
// Save callee-saved registers if used
stp x19, x20, [sp, #16] // Save registers to stack
// Function body
add x0, x0, x1 // Use parameters
str x0, [sp, #8] // Store local variable
// Function epilogue
ldp x19, x20, [sp, #16] // Restore callee-saved registers
add sp, sp, #32 // Deallocate local variables
ldp x29, x30, [sp], #16 // Restore frame pointer and link register
ret // Return to caller
The calling convention specifies that the first eight parameters are passed in registers X0-X7, with additional parameters passed on the stack. The expanded register set enables more efficient function calls with reduced stack manipulation compared to 32-bit ARM.
Memory Management and System Programming
Virtual Memory and Address Translation
AArch64 implements a sophisticated virtual memory system that supports multiple page sizes, multiple address spaces, and advanced memory management features. The architecture provides up to 48-bit virtual addresses and supports various page sizes including 4KB, 16KB, and 64KB pages, enabling flexible memory management strategies.
aarch64
// Translation table base register setup
mov x0, #ttb_address // Translation table base address
msr TTBR0_EL1, x0 // Set user space translation table
msr TTBR1_EL1, x1 // Set kernel space translation table
// Memory attribute configuration
mov x0, #mair_value // Memory attribute indirection register value
msr MAIR_EL1, x0 // Set memory attributes
// Translation control register
mov x0, #tcr_value // Translation control register value
msr TCR_EL1, x0 // Configure address translation
// TLB maintenance
tlbi vmalle1 // Invalidate all TLB entries for EL1
tlbi vaae1, x0 // Invalidate TLB entry by address
dsb sy // Data synchronization barrier
isb // Instruction synchronization barrier
The memory management system provides separate translation tables for user and kernel address spaces, enabling efficient context switching and memory protection. The memory attribute system supports various caching and shareability policies that enable optimization for different memory types and usage patterns.
Exception Handling and System Calls
AArch64 provides a streamlined exception handling model with four exception levels and comprehensive exception vector tables. The exception handling mechanism automatically saves minimal processor state and provides efficient transitions between privilege levels.
aarch64
// Exception vector table (simplified)
.align 11 // Vector table must be 2KB aligned
exception_vectors:
// Current EL with SP_EL0
b sync_current_el_sp0 // Synchronous exception
.align 7
b irq_current_el_sp0 // IRQ interrupt
.align 7
b fiq_current_el_sp0 // FIQ interrupt
.align 7
b serror_current_el_sp0 // System error
.align 7
// Current EL with SP_ELx
b sync_current_el_spx // Synchronous exception
.align 7
b irq_current_el_spx // IRQ interrupt
// ... additional vectors
// System call implementation
svc_handler:
// System call number in X8
// Parameters in X0-X7
cmp x8, #__NR_syscalls // Check system call number
b.hs invalid_syscall // Branch if invalid
adr x9, sys_call_table // Load system call table address
ldr x9, [x9, x8, lsl #3] // Load function pointer
blr x9 // Call system call handler
eret // Exception return
The exception handling model provides automatic saving of minimal state (SPSR and ELR) while requiring explicit saving of general-purpose registers. This approach enables efficient exception handling while providing flexibility for different exception types.
Cache and Memory Ordering
AArch64 provides comprehensive cache management and memory ordering capabilities that enable efficient implementation of multi-processor systems and device drivers. The architecture supports various cache maintenance operations and memory barrier instructions that ensure correct program behavior in complex memory hierarchies.
aarch64
// Cache maintenance operations
dc civac, x0 // Clean and invalidate data cache by address
dc cvac, x1 // Clean data cache by address
ic ivau, x2 // Invalidate instruction cache by address
dc zva, x3 // Zero cache line by address
// Memory barriers
dmb sy // Data memory barrier (system)
dmb ish // Data memory barrier (inner shareable)
dsb sy // Data synchronization barrier (system)
dsb ish // Data synchronization barrier (inner shareable)
isb // Instruction synchronization barrier
// Atomic operations
ldxr x0, [x1] // Load exclusive
stxr w2, x3, [x1] // Store exclusive (returns status)
clrex // Clear exclusive monitor
// Load-acquire and store-release
ldar x0, [x1] // Load acquire
stlr x2, [x3] // Store release
The memory ordering model provides acquire-release semantics that enable efficient implementation of synchronization primitives without requiring full memory barriers. The exclusive access instructions support atomic operations and lock-free programming techniques.
Advanced Programming Techniques
Advanced SIMD and Vector Processing
AArch64 provides significantly enhanced SIMD capabilities compared to 32-bit ARM, with support for various data types, advanced vector operations, and efficient data movement between scalar and vector registers. The vector instruction set enables high-performance implementation of multimedia, signal processing, and mathematical algorithms.
aarch64
// Vector load and store operations
ld1 {v0.16b}, [x0] // Load 16 bytes
ld1 {v1.8h}, [x1] // Load 8 halfwords
ld1 {v2.4s}, [x2] // Load 4 words
ld1 {v3.2d}, [x3] // Load 2 doublewords
ld1 {v4.4s, v5.4s}, [x4] // Load 8 words into two registers
// Vector arithmetic operations
add v0.16b, v1.16b, v2.16b // Add 16 bytes
mul v3.8h, v4.8h, v5.8h // Multiply 8 halfwords
fmul v6.4s, v7.4s, v8.4s // Multiply 4 single-precision floats
fadd v9.2d, v10.2d, v11.2d // Add 2 double-precision floats
// Advanced vector operations
tbl v0.16b, {v1.16b}, v2.16b // Table lookup
zip1 v3.8h, v4.8h, v5.8h // Interleave lower elements
zip2 v6.8h, v7.8h, v8.8h // Interleave upper elements
rev64 v9.16b, v10.16b // Reverse bytes in 64-bit lanes
// Reduction operations
addv h0, v1.8h // Add across vector (horizontal add)
fmaxv s2, v3.4s // Maximum across vector
saddlv d4, v5.16b // Sum and widen across vector
The vector instruction set supports lane-wise operations, cross-lane operations, and data reorganization instructions that enable efficient implementation of complex algorithms. The ability to operate on multiple data types within the same instruction stream provides flexibility for mixed-precision computations.
Cryptographic Extensions
AArch64 includes optional cryptographic extensions that provide hardware acceleration for common cryptographic algorithms including AES, SHA, and polynomial multiplication. These extensions enable high-performance implementation of security protocols and cryptographic applications.
aarch64
// AES encryption operations
aese v0.16b, v1.16b // AES single round encryption
aesmc v2.16b, v0.16b // AES mix columns
aesd v3.16b, v4.16b // AES single round decryption
aesimc v5.16b, v3.16b // AES inverse mix columns
// SHA hash operations
sha1h s0, s1 // SHA1 hash update (choose)
sha1c q0, s2, v3.4s // SHA1 hash update (choose)
sha1p q4, s5, v6.4s // SHA1 hash update (parity)
sha1m q7, s8, v9.4s // SHA1 hash update (majority)
// SHA256 operations
sha256h q0, q1, v2.4s // SHA256 hash update (part 1)
sha256h2 q3, q4, v5.4s // SHA256 hash update (part 2)
sha256su0 v6.4s, v7.4s // SHA256 schedule update 0
sha256su1 v8.4s, v9.4s, v10.4s // SHA256 schedule update 1
The cryptographic extensions provide significant performance improvements for security-critical applications and enable efficient implementation of protocols such as TLS, IPSec, and disk encryption. The instructions operate on vector registers and can be combined with other SIMD operations for maximum efficiency.
Performance Optimization and Tuning
AArch64 optimization requires understanding of processor microarchitecture, memory hierarchy behavior, and instruction scheduling considerations. Modern AArch64 processors employ sophisticated out-of-order execution engines, but careful instruction selection and data layout can still provide significant performance benefits.
aarch64
// Loop optimization with software pipelining
mov x0, #array_base // Array base address
mov x1, #count // Element count
ldr x2, [x0], #8 // Preload first element
optimized_loop:
// Process current element (x2)
add x3, x2, #1 // Example processing
// Load next element while processing current
ldr x2, [x0], #8 // Load next, increment pointer
str x3, [x4], #8 // Store result, increment output
subs x1, x1, #1 // Decrement counter
b.ne optimized_loop // Continue if more elements
// Branch prediction optimization
// Arrange code so common case falls through
cmp x0, #threshold
b.ge uncommon_case // Uncommon case branches
// Common case code continues here
common_case:
// Frequently executed code
b continue_execution
uncommon_case:
// Rarely executed code
b continue_execution
continue_execution:
// Continuation point
Performance optimization on AArch64 benefits from understanding branch prediction behavior, cache line utilization, and instruction-level parallelism. The expanded register set enables more aggressive compiler optimizations and reduces memory traffic compared to register-constrained architectures.
The AArch64 assembly language provides a powerful and modern foundation for high-performance computing applications, system software development, and embedded systems programming. Its clean 64-bit design, enhanced SIMD capabilities, and comprehensive system programming features enable developers to create efficient applications that fully utilize modern ARM processor capabilities. Mastery of AArch64 assembly programming is essential for performance-critical applications, system-level development, security research, and any domain requiring direct hardware control and optimal resource utilization on ARM platforms. The architecture's continued evolution and growing adoption across diverse computing platforms ensure its relevance for future computing challenges while maintaining the power efficiency and performance characteristics that have made ARM successful across mobile, embedded, and server computing markets.