Reverse engineering has always been one of the most intellectually demanding disciplines in cybersecurity. Analysts spend hours staring at assembly instructions, mentally reconstructing data structures, and tracing execution paths through stripped binaries. The cognitive load is enormous, and the demand for skilled reverse engineers far outstrips supply.
That gap is exactly where large language models are making their mark. Over the past two years, AI-assisted reverse engineering has gone from a curiosity to a legitimate force multiplier. Ghidra plugins that rename thousands of functions in minutes, Binary Ninja integrations that explain shellcode in natural language, and automated vulnerability scanners that reason about memory corruption patterns are all shipping today. This is not speculative technology. It is production tooling that working analysts rely on.
This guide covers the current state of AI-augmented reverse engineering, the tools that matter, practical workflows for integrating them into your analysis pipeline, and the limitations you need to understand before trusting any of it.
The Intersection of AI and Reverse Engineering
Reverse engineering is fundamentally a pattern recognition and translation problem. You take machine code, lift it into increasingly abstract representations, and ultimately reconstruct the developer's original intent. LLMs excel at exactly this kind of structured translation.
The key insight driving current tooling is that decompiled C pseudocode is close enough to natural language that general-purpose LLMs can reason about it effectively. When Ghidra or IDA Pro produces decompiler output, that output is syntactically valid C. LLMs trained on billions of lines of source code can infer function purposes, suggest meaningful variable names, identify common library patterns, and flag suspicious constructs.
Three categories of AI assistance have emerged:
- Semantic enrichment — Renaming functions, variables, and types based on behavioral analysis of decompiled code
- Explanation and summarization — Generating natural language descriptions of what code blocks do
- Vulnerability detection — Identifying patterns associated with buffer overflows, use-after-free, format string bugs, and other exploitable conditions
Each category has different reliability profiles. Semantic enrichment is surprisingly accurate for well-known library code. Explanation quality varies with model capability. Vulnerability detection remains the least reliable but most actively researched.
AI-Assisted Decompilation: Ghidra Plugins
Ghidra's open architecture and Python/Java scripting support have made it the primary platform for AI reverse engineering experimentation. Several plugins have matured into daily-driver tools.
GhidrAssist
GhidrAssist connects Ghidra directly to LLM APIs (OpenAI, Anthropic, or local models via Ollama) and provides contextual analysis of the currently selected function. It sends the decompiled output along with cross-references and string data to the model and returns structured analysis.
Installation is straightforward:
# Clone the repository
git clone https://github.com/unkmc/GhidrAssist.git
# Copy the extension to Ghidra's extensions directory
cp -r GhidrAssist $GHIDRA_INSTALL_DIR/Extensions/Ghidra/
# Restart Ghidra and enable via File > Install Extensions
Once configured with an API key, you can right-click any function and select "Explain Function" or "Suggest Names." The plugin sends the decompiled C along with context about calling conventions, imported symbols, and cross-references.
What makes GhidrAssist particularly effective is its context window management. Rather than dumping the entire binary's decompilation into a prompt, it constructs focused queries that include the target function, its immediate callers and callees, and relevant string references. This focused approach produces significantly better results than naive prompting.
GEPETTO
GEPETTO (GPT Explanation of ProcEdures To Transform Operations) was one of the first production-quality Ghidra AI plugins. It focuses specifically on function explanation and renaming:
# GEPETTO configuration in ghidra_script.py
# Set your API provider and model
import ghidra_bridge
b = ghidra_bridge.GhidraBridge()
# Get the current function's decompiled output
current_function = b.currentProgram.getFunctionManager().getFunctionContaining(
b.currentAddress
)
decomp = b.DecompInterface()
decomp.openProgram(b.currentProgram)
results = decomp.decompileFunction(current_function, 60, None)
c_code = results.getDecompiledFunction().getC()
# Send to LLM for analysis
prompt = f"""Analyze this decompiled C function. Suggest:
1. A descriptive function name
2. Parameter names and likely types
3. A one-paragraph explanation of its purpose
```c
{c_code}
```"""
GEPETTO excels at batch processing. You can run it across an entire binary's function list and get a first-pass renaming that transforms a sea of FUN_00401000 labels into meaningful names like decrypt_config_buffer or parse_c2_response. The accuracy rate on common patterns (file I/O, network operations, crypto routines) typically exceeds 80%.
VulChatGPT
VulChatGPT extends the GEPETTO concept with a specific focus on vulnerability identification. It analyzes decompiled functions for common vulnerability patterns and produces structured reports:
# VulChatGPT analysis prompt structure
vulnerability_prompt = """
Analyze this decompiled function for security vulnerabilities.
Focus on:
- Buffer overflow conditions (unchecked memcpy, strcpy, sprintf)
- Integer overflow/underflow in size calculations
- Use-after-free patterns
- Format string vulnerabilities
- Race conditions in shared resource access
- Unvalidated input used in sensitive operations
For each finding, provide:
- Vulnerability type
- Affected line(s)
- Severity estimate
- Exploitation difficulty
"""
The value here is not that the LLM catches vulnerabilities that a skilled analyst would miss. Rather, it accelerates triage. When analyzing a binary with 2,000 functions, having an automated first pass that flags the 50 most suspicious functions saves days of manual review.
Binary Ninja's Sidekick and AI Analysis
Binary Ninja has taken a more integrated approach to AI assistance. Rather than relying on third-party plugins, Vector 35 has built AI features directly into the platform.
Sidekick provides an interactive chat interface within Binary Ninja that is context-aware. It knows which function you are viewing, what your current selection is, and what analysis has already been performed. This tight integration means you can ask questions like "What does this function's third parameter control?" and get answers grounded in the actual binary analysis.
The Sidekick API also enables scripting:
# Binary Ninja Sidekick scripting example
from binaryninja import BinaryViewType
bv = BinaryViewType.get_view_of_file("/path/to/binary")
# Iterate functions and get AI-generated summaries
for func in bv.functions:
if func.name.startswith("sub_"):
# Request AI analysis for unnamed functions
analysis = bv.query_sidekick(
f"Analyze function at {hex(func.start)} and suggest a name"
)
if analysis.confidence > 0.8:
func.name = analysis.suggested_name
Binary Ninja's type propagation engine feeds directly into the AI context, so the model sees not just raw decompiled code but also recovered types, structures, and enum values. This produces notably better results than tools that only send raw pseudocode.
IDA Pro AI Integrations
IDA Pro's dominance in commercial reverse engineering means it has attracted significant AI integration effort.
BinaryAI
BinaryAI uses embedding-based similarity search to match functions against a database of known open-source code. Rather than asking an LLM to guess what a function does, BinaryAI computes a vector embedding of the function's control flow graph and data flow patterns, then searches for similar functions in its indexed corpus.
# BinaryAI IDA plugin usage
import binaryai as bai
import idautils
import idc
# Initialize client
client = bai.Client(token="your_api_token")
# Analyze current function
func_addr = idc.get_screen_ea()
func_bytes = idc.get_bytes(func_addr, idc.get_func_attr(func_addr, idc.FUNCATTR_END) - func_addr)
# Query the BinaryAI database
results = client.search_function(func_bytes)
for match in results[:5]:
print(f"Match: {match.name} from {match.source} (confidence: {match.score:.2f})")
This approach is complementary to LLM-based analysis. BinaryAI excels at identifying statically linked library functions (zlib, OpenSSL, SQLite), while LLMs are better at understanding custom application logic.
Gepetto for IDA
The IDA version of Gepetto operates similarly to its Ghidra counterpart but leverages IDA's superior decompiler output (Hex-Rays) for better results:
# Gepetto for IDA: batch rename workflow
import ida_hexrays
import ida_funcs
import openai
def analyze_all_functions():
renamed = 0
for func_ea in idautils.Functions():
func = ida_funcs.get_func(func_ea)
cfunc = ida_hexrays.decompile(func_ea)
if cfunc:
pseudocode = str(cfunc)
# Send to LLM for naming
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Suggest a function name for:\n{pseudocode}"
}]
)
new_name = response.choices[0].message.content.strip()
if new_name and not new_name.startswith("sub_"):
ida_funcs.set_func_name(func_ea, new_name, ida_funcs.SN_CHECK)
renamed += 1
print(f"Renamed {renamed} functions")
Automated Function Naming and Type Recovery
Function naming is where AI provides the most immediate, tangible value in reverse engineering. The workflow is simple but powerful:
- Decompile the function to pseudocode
- Send pseudocode plus context (strings, imports, cross-references) to an LLM
- Parse the response for a suggested name and parameter types
- Apply the suggestions and propagate types through the call graph
The propagation step is critical. When you correctly identify that FUN_00405a20 is actually aes_cbc_encrypt, the type information for its parameters (key buffer, IV, plaintext, length) can propagate to every caller, dramatically improving readability across the entire binary.
# Type propagation after AI-assisted naming
# Ghidra script for cascading type application
from ghidra.program.model.data import StructureDataType, PointerDataType
from ghidra.program.model.symbol import SourceType
def apply_crypto_types(func, analysis_result):
"""Apply recovered types from AI analysis to a function and its callers."""
dtm = currentProgram.getDataTypeManager()
# Create structure based on AI suggestion
if analysis_result.get("struct_type"):
struct = StructureDataType(analysis_result["struct_name"], 0)
for field in analysis_result["fields"]:
struct.add(
dtm.getDataType(field["type"]),
field["size"],
field["name"],
field["comment"]
)
dtm.addDataType(struct, None)
# Rename function
func.setName(analysis_result["name"], SourceType.USER_DEFINED)
# Update parameter types
params = func.getParameters()
for i, param_info in enumerate(analysis_result.get("params", [])):
if i < len(params):
params[i].setName(param_info["name"], SourceType.USER_DEFINED)
LLM-Powered Vulnerability Detection in Binaries
Vulnerability detection in compiled code has traditionally relied on pattern matching (Flawfinder, RATS) or heavyweight formal methods (symbolic execution, abstract interpretation). LLMs offer a middle ground: they can reason about code semantics at a level between syntactic pattern matching and formal verification.
Current approaches work best for well-known vulnerability classes:
# Structured vulnerability analysis pipeline
import json
def analyze_function_for_vulns(decompiled_code, context):
prompt = f"""You are a binary security analyst. Analyze this decompiled function
for exploitable vulnerabilities.
Function context:
- Binary: {context['binary_name']}
- Architecture: {context['arch']}
- Calling convention: {context['calling_convention']}
- Known imports used: {', '.join(context['imports'])}
Decompiled code:
```c
{decompiled_code}
Respond in JSON format: {{ "vulnerabilities": [ {{ "type": "buffer_overflow|use_after_free|format_string|integer_overflow|race_condition", "severity": "critical|high|medium|low", "line_reference": "approximate line in decompiled output", "description": "what the vulnerability is", "exploitation_notes": "how it might be exploited", "confidence": 0.0-1.0 }} ], "overall_risk": "critical|high|medium|low|none" }}"""
response = call_llm(prompt)
return json.loads(response)
The key limitation is false positive rate. LLMs tend to over-flag potential issues, especially around pointer arithmetic that is actually safe due to bounds checks elsewhere in the code. Always treat LLM vulnerability reports as leads for manual investigation, not as confirmed findings.
## The Capstone, Unicorn, and Keystone Toolkit
Before AI entered the picture, the Capstone/Unicorn/Keystone suite established the foundation for programmatic binary analysis. These tools remain essential building blocks, and they pair well with LLM-assisted workflows.
**Capstone** is a disassembly framework supporting multiple architectures:
```python
from capstone import Cs, CS_ARCH_X86, CS_MODE_64
CODE = b"\x55\x48\x89\xe5\x48\x83\xec\x10\x89\x7d\xfc\x8b\x45\xfc\x83\xc0\x01"
md = Cs(CS_ARCH_X86, CS_MODE_64)
md.detail = True
for insn in md.disasm(CODE, 0x1000):
print(f"0x{insn.address:x}:\t{insn.mnemonic}\t{insn.op_str}")
# Feed to LLM for semantic analysis
# Groups tell us instruction categories (jump, call, ret, etc.)
groups = [insn.group_name(g) for g in insn.groups]
Unicorn provides CPU emulation for dynamic analysis:
from unicorn import Uc, UC_ARCH_X86, UC_MODE_64
from unicorn.x86_const import UC_X86_REG_RAX, UC_X86_REG_RIP
# Initialize emulator
mu = Uc(UC_ARCH_X86, UC_MODE_64)
# Map memory and write code
ADDRESS = 0x1000000
mu.mem_map(ADDRESS, 2 * 1024 * 1024)
mu.mem_write(ADDRESS, CODE)
# Set up stack
STACK_ADDR = 0x2000000
mu.mem_map(STACK_ADDR, 1024 * 1024)
# Emulate and collect execution trace
trace = []
def hook_code(uc, address, size, user_data):
trace.append(address)
mu.hook_add(unicorn.UC_HOOK_CODE, hook_code)
mu.emu_start(ADDRESS, ADDRESS + len(CODE))
print(f"RAX = {mu.reg_read(UC_X86_REG_RAX)}")
Keystone handles assembly, completing the cycle:
from keystone import Ks, KS_ARCH_X86, KS_MODE_64
ks = Ks(KS_ARCH_X86, KS_MODE_64)
encoding, count = ks.asm("mov rax, 0x1337; ret")
print(f"Assembled {count} instructions: {bytes(encoding).hex()}")
The power of combining these tools with LLMs is in building automated analysis pipelines. You can use Capstone to disassemble, send the output to an LLM for semantic analysis, use the LLM's suggestions to guide Unicorn emulation paths, and use Keystone to generate test payloads.
PyGhidra: Python-First Reverse Engineering Pipelines
PyGhidra (formerly pyhidra) lets you run Ghidra's analysis engine as a Python library without launching the GUI. This is transformative for building automated pipelines:
import pyghidra
# Analyze a binary headlessly
with pyghidra.open_program("/path/to/malware.bin") as flat_api:
program = flat_api.getCurrentProgram()
listing = program.getListing()
func_manager = program.getFunctionManager()
# Iterate all functions
results = {}
for func in func_manager.getFunctions(True):
# Get decompiled output
decomp = flat_api.DecompInterface()
decomp.openProgram(program)
decomp_result = decomp.decompileFunction(func, 120, None)
if decomp_result.depiledFunction():
c_code = decomp_result.getDecompiledFunction().getC()
# Send to LLM for analysis
analysis = analyze_with_llm(c_code)
results[func.getName()] = analysis
# Generate report
generate_analysis_report(results, "/path/to/report.json")
PyGhidra enables CI/CD-style binary analysis. You can set up a pipeline that automatically processes new malware samples, generates AI-enriched analysis reports, and flags high-priority items for human review:
#!/bin/bash
# Automated malware analysis pipeline
SAMPLE_DIR="/incoming/samples"
REPORT_DIR="/reports"
for sample in "$SAMPLE_DIR"/*.bin; do
filename=$(basename "$sample" .bin)
echo "Analyzing: $filename"
python3 analyze_binary.py \
--input "$sample" \
--output "$REPORT_DIR/${filename}.json" \
--model claude-sonnet \
--max-functions 500 \
--confidence-threshold 0.7
# Flag high-severity findings
python3 triage_report.py "$REPORT_DIR/${filename}.json"
done
Building Custom Analysis Scripts with LLM Assistance
The most powerful approach is not using off-the-shelf plugins but building custom analysis scripts tailored to your specific targets. LLMs can help you write these scripts:
# Custom malware config extractor using LLM-guided analysis
import struct
import re
class ConfigExtractor:
def __init__(self, binary_path, llm_client):
self.binary_path = binary_path
self.llm = llm_client
self.config = {}
def find_config_function(self, decompiled_functions):
"""Use LLM to identify the configuration initialization function."""
for name, code in decompiled_functions.items():
response = self.llm.query(
f"Does this function initialize a malware configuration "
f"structure? Look for patterns like: setting C2 URLs, "
f"encryption keys, sleep intervals, persistence mechanisms.\n\n"
f"```c\n{code}\n```\n\n"
f"Respond with YES or NO and a confidence score."
)
if "YES" in response and self.parse_confidence(response) > 0.75:
return name, code
return None, None
def extract_config_values(self, config_func_code):
"""Use LLM to identify and extract configuration values."""
response = self.llm.query(
f"Extract all configuration values from this function. "
f"Identify C2 servers, encryption keys, file paths, "
f"registry keys, and timing values.\n\n"
f"```c\n{config_func_code}\n```\n\n"
f"Respond in JSON format with field names and values."
)
return self.parse_json_response(response)
def parse_confidence(self, response):
match = re.search(r'(\d+\.?\d*)', response.split("confidence")[-1])
return float(match.group(1)) if match else 0.0
def parse_json_response(self, response):
try:
json_match = re.search(r'\{.*\}', response, re.DOTALL)
if json_match:
return json.loads(json_match.group())
except json.JSONDecodeError:
pass
return {}
Ethical Considerations and Limitations
AI-assisted reverse engineering raises important questions that practitioners must address.
Accuracy and overconfidence. LLMs produce plausible-sounding analysis that may be completely wrong. A function that looks like AES-128 encryption to the model might actually be a custom XOR cipher with a coincidentally similar structure. Never trust AI analysis without verification. Treat it as a hypothesis generator, not an oracle.
Intellectual property. Using AI to reverse engineer proprietary software raises legal questions that vary by jurisdiction. The DMCA in the United States has exemptions for security research and interoperability, but the boundaries are not always clear. Document your purpose and consult legal counsel when working on commercial targets.
Model data leakage. When you send decompiled code to a cloud LLM API, that code may be logged, used for training, or subject to subpoena. For sensitive targets (nation-state malware, classified binaries, client engagements under NDA), use local models exclusively. Ollama with a capable model like Llama 3 or Mixtral provides reasonable analysis quality without any data leaving your machine.
Adversarial robustness. Malware authors are already exploring techniques to confuse AI analysis. Inserting misleading string references, adding dead code that resembles benign library functions, and using obfuscation patterns that specifically target LLM comprehension are all emerging techniques. Expect an arms race.
Skill atrophy. Over-reliance on AI analysis can erode fundamental reverse engineering skills. Junior analysts who learn to click "Explain Function" before reading the code themselves may never develop the deep understanding needed for novel challenges. Use AI as an accelerator, not a replacement for learning.
The Future: Autonomous Binary Analysis Agents
The next frontier is agentic systems that can perform multi-step binary analysis autonomously. Rather than answering single questions about individual functions, these systems will navigate binaries, form and test hypotheses, and produce comprehensive analysis reports.
Early prototypes already exist. Research systems can take a binary, identify its entry point, trace execution paths, identify interesting functions, analyze them for vulnerabilities, and produce structured reports, all without human intervention. The architecture typically looks like this:
# Autonomous binary analysis agent architecture
agent:
planning_model: claude-sonnet
analysis_model: claude-sonnet
tools:
- ghidra_decompile
- ghidra_xrefs
- ghidra_strings
- unicorn_emulate
- capstone_disasm
- yara_scan
workflow:
- step: initial_triage
actions: [identify_packer, check_imports, extract_strings]
- step: unpack_if_needed
condition: is_packed
actions: [dump_unpacked, reanalyze]
- step: function_analysis
actions: [identify_interesting_functions, analyze_top_50]
- step: vulnerability_scan
actions: [check_memory_safety, check_crypto_usage]
- step: report_generation
actions: [compile_findings, assign_severity, generate_report]
The limiting factor is not model capability but tool integration reliability. Getting an LLM to correctly invoke Ghidra APIs, interpret the results, and decide on next steps requires careful engineering of the agent loop. Hallucinated API calls, misinterpreted addresses, and infinite analysis loops are common failure modes.
Within the next year, expect commercial platforms to ship agent-based analysis as a core feature. The analysts who understand both the AI capabilities and the underlying reverse engineering fundamentals will be the ones who use these tools most effectively. The technology amplifies expertise. It does not replace it.
Getting Started Today
If you want to integrate AI into your reverse engineering workflow right now, start here:
Install GhidrAssist and configure it with your preferred LLM provider. Use it on a binary you have already analyzed manually to calibrate your trust in its output.
Set up a local model with Ollama for sensitive work. Llama 3 70B with Q4_K_M quantization provides good analysis quality on a machine with 48GB RAM.
Build a PyGhidra script that batch-processes functions and generates a JSON report. Start with function naming and expand to vulnerability analysis.
Establish verification habits. For every AI suggestion you accept, spend thirty seconds confirming it against the actual code. This builds intuition for when the AI is right and when it is confabulating.
Track your metrics. Measure how many AI suggestions you accept versus reject, and how often accepted suggestions turn out to be wrong on deeper analysis. This data will help you calibrate your workflow.
The tools are here. The models are capable enough to be useful. The analysts who integrate them thoughtfully into disciplined workflows will have a significant advantage over those who either ignore AI entirely or trust it blindly.