تخطَّ إلى المحتوى

pyinstxtractor

Overview

pyinstxtractor is a forensic analysis tool that extracts the contents of PyInstaller-generated executables. It recovers the Python bytecode (.pyc files), resource files, and other embedded content, enabling security researchers and developers to analyze compiled Python applications. This is essential for malware analysis, code review, and compliance verification.

Key Features

  • Bytecode Extraction: Recover compiled Python code from executables
  • Resource Recovery: Extract embedded data files and assets
  • Cross-Platform: Works with Windows .exe, macOS .app bundles, and Linux ELF binaries
  • Archive Analysis: Inspects the PyInstaller archive structure
  • Batch Processing: Extract multiple executables efficiently
  • Header Detection: Automatically identifies Python version and archive format
  • Decompilation Support: Prepares bytecode for tools like uncompyle6
  • Error Handling: Graceful handling of corrupted or unusual archives

Installation

From Repository

# Clone repository
git clone https://github.com/extremecoders-re/pyinstxtractor
cd pyinstxtractor

# Make executable
chmod +x pyinstxtractor.py

# Verify installation
python pyinstxtractor.py --help

System-Wide Installation

# Copy to PATH
sudo cp pyinstxtractor.py /usr/local/bin/pyinstxtractor
sudo chmod +x /usr/local/bin/pyinstxtractor

# Verify
pyinstxtractor --help

Virtual Environment

# Create and activate environment
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or: venv\Scripts\activate  # Windows

# Install any dependencies
pip install uncompyle6  # optional, for decompilation

Basic Usage

Extract Executable

# Extract to directory
python pyinstxtractor.py application.exe

# Output structure created
# ├── application.exe_extracted/
# │   ├── base_library.zip
# │   ├── archive.pkg
# │   ├── PYZ-00.pyz_extracted/
# │   ├── [individual .pyc files]
# │   └── [resource files]

Extraction Process

# Simple extraction
python pyinstxtractor.py myapp.exe

# Extract with specific output directory
python pyinstxtractor.py -d ./extracted myapp.exe

# Process in-place
python pyinstxtractor.py ./dist/application.exe

Command-Line Options

OptionDescriptionExample
file.exeTarget executablepyinstxtractor.py app.exe
-d DIROutput directorypyinstxtractor.py -d output app.exe
-hHelp messagepyinstxtractor.py -h
--verboseVerbose outputpyinstxtractor.py --verbose app.exe

Understanding PyInstaller Archives

Archive Structure

PyInstaller Executable
├── Bootloader (C executable)
├── Python runtime libraries
├── Archive (TOC - Table of Contents)
│   ├── PYZ archive (bytecode)
│   │   ├── Compiled modules (.pyc)
│   │   ├── Bytecode files
│   │   └── Library index
│   ├── PKG archive (resources)
│   │   ├── Data files
│   │   ├── Configuration
│   │   └── Assets
│   └── Other files
└── Metadata

Magic Numbers and Headers

HeaderMeaningBytes
PyI\x00PyInstaller archive marker4
0x50494632PYZ archive magic4
COOKIEArchive metadata8
.pycPython compiled file4

Extraction Examples

Example 1: Basic Windows Application

# Extract Windows executable
python pyinstxtractor.py windows_app.exe

# Check contents
ls -la windows_app.exe_extracted/

# Find main script
find windows_app.exe_extracted -name "*.pyc" | head -5

Example 2: macOS Application Bundle

# Extract from macOS bundle
python pyinstxtractor.py MyApp.app/Contents/MacOS/MyApp

# View extracted structure
tree -L 2 MyApp.app/Contents/MacOS/MyApp_extracted

# Examine binary
file MyApp.app/Contents/MacOS/MyApp

Example 3: Linux ELF Executable

# Extract Linux binary
python pyinstxtractor.py ./linux_application

# Check file type
file linux_application

# Extract contents
ls linux_application_extracted/

# Find Python modules
find linux_application_extracted -type f -name "*.pyc"

Example 4: Batch Extraction

#!/bin/bash

# Extract multiple executables
for exe in *.exe; do
    echo "Extracting $exe..."
    python pyinstxtractor.py "$exe"
done

# Verify all extractions
ls -d *_extracted/

Working with Extracted Bytecode

Decompiling Python Bytecode

# Install decompiler
pip install uncompyle6

# Decompile single .pyc file
uncompyle6 module.pyc > module.py

# Batch decompile
for pyc in *.pyc; do
    uncompyle6 "$pyc" > "${pyc%.pyc}.py"
done

Examining PYZ Archives

# PYZ is a ZIP archive, extract it
unzip archive.pyz -d pyz_contents

# List contents
unzip -l archive.pyz

# Extract specific file
unzip archive.pyz module.pyc

Working with base_library.zip

# base_library.zip contains Python standard library
unzip -l base_library.zip | head -20

# Extract entire library
unzip base_library.zip -d stdlib

# View specific module
unzip -p base_library.zip os.pyc | xxd | head -20

Analysis Workflow

Step 1: Identify Executable Type

# Check file type
file suspicious_app.exe
# Output: PE32 executable (console) Intel 80386, for MS Windows

# Verify PyInstaller signature
strings suspicious_app.exe | grep -i pyinstaller

# Check for pyi bootloader
objdump -h suspicious_app.exe | grep -i pyi

Step 2: Extract Archive

# Extract with output
python pyinstxtractor.py suspicious_app.exe

# Verify extraction success
ls suspicious_app.exe_extracted/ | wc -l

# Check for main module
find suspicious_app.exe_extracted -name "__main__.pyc"

Step 3: Examine Structure

# List extracted files
tree suspicious_app.exe_extracted -L 2

# Find entry point
grep -r "if __name__" suspicious_app.exe_extracted 2>/dev/null || \
  find . -name "__main__.pyc"

# Identify dependencies
find . -name "*.so" -o -name "*.dll"

Step 4: Decompile Code

# Find main script
pyc_file=$(find . -name "__main__.pyc" | head -1)

# Decompile
uncompyle6 "$pyc_file" > main.py

# Review code
cat main.py | head -50

Malware Analysis Workflow

Suspicious Binary Detection

# Extract and analyze
python pyinstxtractor.py suspect.exe

# Check for network connections
strings suspect.exe_extracted/*.pyc | grep -E "(http|socket|request|urllib)"

# Look for encoded strings
find . -name "*.pyc" -exec strings {} \; | grep -E "([A-Za-z0-9+/]{50,}=)"

# Search for common malware patterns
grep -r "subprocess\|os.system\|eval\|exec" suspect.exe_extracted/

Resource and Data Extraction

# Find resource files
find suspect.exe_extracted -type f ! -name "*.pyc" ! -name "*.zip"

# Extract embedded files
for file in suspect.exe_extracted/PKG*; do
    unzip -l "$file" 2>/dev/null
done

# Save extracted resources
mkdir -p resources
unzip -j PKG-00.pyz -d resources/

Code Behavior Analysis

# Search for suspicious patterns
grep -r "crypto\|cipher\|encrypt\|decrypt" *.pyc

# Find file operations
grep -r "open\|write\|read" main.pyc

# Identify C2 infrastructure
strings *.pyc | grep -E "^(https?|ftp)://"

# Check for registry/system calls
grep -r "winreg\|ctypes\|windll" *.pyc

Technical Details

Python Bytecode Format

Python .pyc File Structure:
├── Magic Number (4 bytes)    # Python version signature
├── Timestamp (4 bytes)        # File modification time
├── Code Object
│   ├── Constants
│   ├── Names
│   ├── Varnames
│   ├── Instructions (bytecode)
│   └── Nested code objects
└── [More code objects]

Reading Magic Numbers

# Extract and display magic numbers
xxd -l 16 extracted_module.pyc

# Example output:
# 00000000: 6261 632d 2030 372b 0000 0000 9b6d e362

# Identify Python version
python << 'EOF'
import importlib.util
import struct

with open('module.pyc', 'rb') as f:
    magic = f.read(4)
    print(f"Magic: {magic.hex()}")
    # Map to Python version
EOF

Archive Metadata Extraction

# Read PyInstaller cookie (metadata)
python << 'EOF'
import struct

with open('application.exe', 'rb') as f:
    # Seek to end and read backwards for cookie
    f.seek(-24, 2)  # 24 bytes from end
    cookie = f.read(24)
    print(f"Cookie (hex): {cookie.hex()}")
    
    # Parse archive offset
    offset, length = struct.unpack('<2I', cookie[:8])
    print(f"Archive offset: {offset}")
    print(f"Archive length: {length}")
EOF

Troubleshooting

”Error: Failed to parse archive”

# Check file integrity
file suspicious_app.exe

# Verify it's actually a PyInstaller executable
strings suspicious_app.exe | grep -i "pyinstaller"

# Try with manual offset
python << 'EOF'
# Manual analysis if automated extraction fails
import struct

with open('app.exe', 'rb') as f:
    data = f.read()
    
# Search for PyInstaller signature
idx = data.find(b'PyI\x00')
if idx != -1:
    print(f"Found PyInstaller signature at offset: {idx}")
else:
    print("No PyInstaller signature found")
EOF

Missing or Corrupted Bytecode

# Check extraction directory
ls -la application.exe_extracted/ | wc -l

# Verify PYZ extraction
unzip -t application.exe_extracted/base_library.zip

# Try repairing
python << 'EOF'
import zipfile
zf = 'base_library.zip'
try:
    with zipfile.ZipFile(zf, 'r') as z:
        z.testzip()
        print("ZIP file is valid")
except Exception as e:
    print(f"Corruption detected: {e}")
EOF

Decompilation Failures

# Check Python version compatibility
python << 'EOF'
import struct

with open('module.pyc', 'rb') as f:
    magic = struct.unpack('I', f.read(4))[0]
    # Magic number maps to Python version
    print(f"Magic: {hex(magic)}")
    
# Common magic numbers:
# 0x33f0d0a (3.11), 0x445f0a (3.10), 0x431f0a (3.9), etc.
EOF

# Use version-specific decompiler
uncompyle6 --python=3.11 module.pyc > module.py

Advanced Techniques

Custom Extraction Script

#!/usr/bin/env python3
"""
Advanced PyInstaller extraction with analysis
"""
import os
import struct
import zipfile
from pathlib import Path

class PyInstallerAnalyzer:
    def __init__(self, executable):
        self.exe = executable
        self.extracted_dir = f"{executable}_analyzed"
        
    def extract(self):
        """Extract using pyinstxtractor"""
        os.system(f"python pyinstxtractor.py {self.exe}")
        
    def analyze_archive(self):
        """Analyze extracted archive structure"""
        base_lib = Path(self.extracted_dir) / "base_library.zip"
        
        if base_lib.exists():
            with zipfile.ZipFile(base_lib, 'r') as z:
                print(f"Base library files: {len(z.namelist())}")
                print("First 10 modules:")
                for name in z.namelist()[:10]:
                    info = z.getinfo(name)
                    print(f"  {name} ({info.file_size} bytes)")
    
    def find_main_module(self):
        """Locate main entry point"""
        for root, dirs, files in os.walk(self.extracted_dir):
            for file in files:
                if file == '__main__.pyc':
                    return os.path.join(root, file)
        return None

# Usage
if __name__ == '__main__':
    analyzer = PyInstallerAnalyzer('app.exe')
    analyzer.extract()
    analyzer.analyze_archive()
    main = analyzer.find_main_module()
    print(f"Main module: {main}")

Batch Analysis with Reporting

#!/bin/bash

REPORT="extraction_report.txt"
> "$REPORT"

for exe in *.exe; do
    echo "=== Analyzing $exe ===" | tee -a "$REPORT"
    
    # Extract
    python pyinstxtractor.py "$exe" 2>&1 | tee -a "$REPORT"
    
    # Find main module
    main_pyc=$(find "${exe}_extracted" -name "__main__.pyc")
    echo "Main module: $main_pyc" | tee -a "$REPORT"
    
    # Count dependencies
    dep_count=$(find "${exe}_extracted" -name "*.pyc" | wc -l)
    echo "Modules found: $dep_count" | tee -a "$REPORT"
    
    # Search for suspicious patterns
    suspicious=$(grep -r "socket\|subprocess\|eval\|exec" "${exe}_extracted" 2>/dev/null | wc -l)
    echo "Suspicious patterns: $suspicious" | tee -a "$REPORT"
    
    echo "" | tee -a "$REPORT"
done

echo "Report saved to: $REPORT"

Security Implications

Legitimate Uses

  • Auditing your own compiled applications
  • Malware analysis and threat research
  • Code review and compliance verification
  • Educational purposes and learning

Defensive Measures

  • Code Obfuscation: Use PyArmor or Cython for compiled code
  • Encryption: Add bytecode encryption layers
  • Version Hiding: Remove Python version strings from binary
  • Custom Bootloader: Modify PyInstaller’s startup sequence
  • Code Signing: Verify executable authenticity

Comparison with Alternatives

ToolPurposeSpeedAccuracy
pyinstxtractorArchive extractionFastExcellent
uncompyle6DecompilationSlowGood
pycdcDecompilationFastExcellent
GhidraBinary analysisSlowGood
IDA ProBinary analysisSlowExcellent

Ethical Considerations

  • Authorization: Only analyze executables you own or have permission to analyze
  • Intellectual Property: Respect copyright and trade secrets
  • Responsible Disclosure: Report vulnerabilities properly
  • Legal Compliance: Follow applicable laws regarding reverse engineering
  • Attribution: Credit original authors when sharing analysis

Resources

Workflow Summary

StepToolCommand
Extractpyinstxtractorpyinstxtractor.py app.exe
Analyzestrings/grepgrep -r "pattern" extracted/
Decompressunzipunzip archive.pyz
Decompileuncompyle6uncompyle6 module.pyc
ReviewText editorcat main.py