FOCA Cheat Sheet¶

"Clase de la hoja" id="copy-btn" class="copy-btn" onclick="copyAllCommands()" Copiar todos los comandos id="pdf-btn" class="pdf-btn" onclick="generatePDF()" Generar PDF seleccionado/button ■/div titulada

Sinopsis¶

FOCA (Fingerprinting Organizations with Collected Archives) es una poderosa herramienta de análisis de metadatos y de inteligencia de documentos utilizada para extraer información oculta de documentos y archivos. Se especializa en descubrir metadatos, información de red, usuarios, carpetas, versiones de software y otros datos sensibles que las organizaciones exponen inadvertidamente a través de documentos disponibles públicamente.

NOVEDAD Nota: Herramienta basada en Windows con el requisito .NET Framework. Siempre asegúrese de tener una autorización adecuada antes de analizar los documentos de destino.

Instalación y configuración¶

Requisitos del sistema e instalación¶

# System Requirements:
# - Windows 7/8/10/11 (32-bit or 64-bit)
# - .NET Framework 4.5 or later
# - Microsoft Office (for advanced document analysis)
# - Internet connection for online searches

# Download FOCA:
# 1. Visit https://github.com/ElevenPaths/FOCA
# 2. Download latest release
# 3. Extract to desired directory
# 4. Run FOCA.exe as administrator

# Alternative: Install via Chocolatey
choco install foca

# Verify installation
# Launch FOCA.exe and check version in Help > About

Configuración inicial¶

# Configuration steps:
# 1. Launch FOCA.exe
# 2. Go to Options > Configuration
# 3. Configure search engines and APIs
# 4. Set download directories
# 5. Configure proxy settings if needed
# 6. Set analysis preferences

# Key configuration options:
# - Search engines: Google, Bing, DuckDuckGo
# - Download folder: C:\FOCA\Downloads
# - Temporary folder: C:\FOCA\Temp
# - Maximum file size: 50MB
# - Timeout settings: 30 seconds
# - Proxy configuration: Manual/Automatic

Configuración de Search Engine API¶

<!-- FOCA Configuration File: FOCA.exe.config -->
<configuration>
  <appSettings>
    <!-- Google Custom Search API -->
    <add key="GoogleAPIKey" value="your_google_api_key" />
    <add key="GoogleSearchEngineID" value="your_search_engine_id" />

    <!-- Bing Search API -->
    <add key="BingAPIKey" value="your_bing_api_key" />

    <!-- Shodan API -->
    <add key="ShodanAPIKey" value="your_shodan_api_key" />

    <!-- VirusTotal API -->
    <add key="VirusTotalAPIKey" value="your_virustotal_api_key" />

    <!-- Download settings -->
    <add key="MaxFileSize" value="52428800" /> <!-- 50MB -->
    <add key="DownloadTimeout" value="30000" /> <!-- 30 seconds -->
    <add key="MaxConcurrentDownloads" value="5" />

    <!-- Proxy settings -->
    <add key="UseProxy" value="false" />
    <add key="ProxyAddress" value="proxy.company.com" />
    <add key="ProxyPort" value="8080" />
    <add key="ProxyUsername" value="username" />
    <add key="ProxyPassword" value="password" />
  </appSettings>
</configuration>

Document Discovery and Collection¶

Search Engine Integration¶

# Google Search Configuration:
# 1. Create Google Custom Search Engine
# 2. Get API key from Google Cloud Console
# 3. Configure in FOCA Options

# Search operators for document discovery:
# site:target.com filetype:pdf
# site:target.com filetype:doc
# site:target.com filetype:docx
# site:target.com filetype:xls
# site:target.com filetype:xlsx
# site:target.com filetype:ppt
# site:target.com filetype:pptx

# Advanced search operators:
# site:target.com (filetype:pdf OR filetype:doc OR filetype:xls)
# site:target.com "confidential" filetype:pdf
# site:target.com "internal" filetype:doc
# site:target.com inurl:admin filetype:pdf

Colección de documentos manuales¶

# Manual URL addition in FOCA:
# 1. Go to Project > URLs
# 2. Add URLs manually or import from file
# 3. Use bulk import for large lists

# URL format examples:
https://target.com/documents/report.pdf
https://target.com/files/presentation.pptx
https://target.com/downloads/manual.doc
https://subdomain.target.com/docs/guide.pdf

# Bulk import file format (urls.txt):
https://target.com/doc1.pdf
https://target.com/doc2.docx
https://target.com/doc3.xlsx
https://target.com/doc4.pptx

# Import commands:
# File > Import > URLs from file
# Select urls.txt file
# Choose import options

Descubrimiento automático de documentos¶

# PowerShell script for automated document discovery
param(
    [Parameter(Mandatory=$true)]
    [string]$Domain,

    [string[]]$FileTypes = @("pdf", "doc", "docx", "xls", "xlsx", "ppt", "pptx"),
    [string]$OutputFile = "discovered_documents.txt",
    [int]$MaxResults = 100
)

# Function to search Google for documents
function Search-GoogleDocuments {
    param($domain, $filetype, $maxResults)

    $searchQuery = "site:$domain filetype:$filetype"
    $apiKey = "your_google_api_key"
    $searchEngineId = "your_search_engine_id"

    $results = @()
    $startIndex = 1

    while ($results.Count -lt $maxResults -and $startIndex -le 100) {
        $url = "https://www.googleapis.com/customsearch/v1?key=$apiKey&cx;=$searchEngineId&q;=$searchQuery&start;=$startIndex"

        try {
            $response = Invoke-RestMethod -Uri $url -Method Get

            if ($response.items) {
                foreach ($item in $response.items) {
                    $results += $item.link
                }
                $startIndex += 10
            } else {
                break
            }
        } catch {
            Write-Warning "Error searching for $filetype files: $($_.Exception.Message)"
            break
        }

        Start-Sleep -Seconds 1  # Rate limiting
    }

    return $results
}

# Main execution
$allDocuments = @()

foreach ($fileType in $FileTypes) {
    Write-Host "Searching for $fileType files on $Domain..."
    $documents = Search-GoogleDocuments -domain $Domain -filetype $fileType -maxResults $MaxResults
    $allDocuments += $documents
    Write-Host "Found $($documents.Count) $fileType files"
}

# Remove duplicates and save results
|  |  |  | $uniqueDocuments = $allDocuments | Sort-Object | Get-Unique |  |  |  |
$uniqueDocuments | Out-File -FilePath $OutputFile -Encoding UTF8

Write-Host "Total unique documents found: $($uniqueDocuments.Count)"
Write-Host "Results saved to: $OutputFile"

# Usage example:
# .\Discover-Documents.ps1 -Domain "example.com" -OutputFile "example_docs.txt"

Análisis y Extracción de Metadatos¶

Análisis de los metadatos¶

# FOCA Metadata Analysis Process:
# 1. Load project or create new one
# 2. Add documents via search or manual import
# 3. Download documents automatically
# 4. Analyze metadata from downloaded files
# 5. Review extracted information

# Metadata types extracted by FOCA:
# - Author information
# - Creation and modification dates
# - Software versions used
# - Computer names and usernames
# - Network paths and shared folders
# - Printer information
# - Email addresses
# - Company information
# - Document templates
# - Revision history
# - Comments and tracked changes

Extracción avanzada de metadatos¶

// C# code for custom metadata extraction
using System;
using System.IO;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

public class MetadataExtractor
{
    public class DocumentMetadata
    {
        public string FileName { get; set; }
        public string Author { get; set; }
        public string Creator { get; set; }
        public DateTime? CreationDate { get; set; }
        public DateTime? ModificationDate { get; set; }
        public string Application { get; set; }
        public string Company { get; set; }
        public string Subject { get; set; }
        public string Title { get; set; }
        public string Keywords { get; set; }
        public string Comments { get; set; }
        public string LastModifiedBy { get; set; }
        public int? RevisionNumber { get; set; }
        public TimeSpan? TotalEditTime { get; set; }
        public string Template { get; set; }
        public List<string> UserNames { get; set; } = new List<string>();
        public List<string> ComputerNames { get; set; } = new List<string>();
        public List<string> NetworkPaths { get; set; } = new List<string>();
        public List<string> EmailAddresses { get; set; } = new List<string>();
        public List<string> PrinterNames { get; set; } = new List<string>();
    }

    public DocumentMetadata ExtractPdfMetadata(string filePath)
    {
        var metadata = new DocumentMetadata { FileName = Path.GetFileName(filePath) };

        try
        {
            using (var reader = new PdfReader(filePath))
            {
                var info = reader.Info;

                metadata.Author = info.ContainsKey("Author") ? info["Author"] : null;
                metadata.Creator = info.ContainsKey("Creator") ? info["Creator"] : null;
                metadata.Subject = info.ContainsKey("Subject") ? info["Subject"] : null;
                metadata.Title = info.ContainsKey("Title") ? info["Title"] : null;
                metadata.Keywords = info.ContainsKey("Keywords") ? info["Keywords"] : null;

                if (info.ContainsKey("CreationDate"))
                {
                    metadata.CreationDate = ParsePdfDate(info["CreationDate"]);
                }

                if (info.ContainsKey("ModDate"))
                {
                    metadata.ModificationDate = ParsePdfDate(info["ModDate"]);
                }

                // Extract text content for additional analysis
                var text = ExtractTextFromPdf(reader);
                AnalyzeTextContent(text, metadata);
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error extracting PDF metadata: {ex.Message}");
        }

        return metadata;
    }

    public DocumentMetadata ExtractWordMetadata(string filePath)
    {
        var metadata = new DocumentMetadata { FileName = Path.GetFileName(filePath) };

        try
        {
            using (var doc = WordprocessingDocument.Open(filePath, false))
            {
                var coreProps = doc.PackageProperties;
                var extendedProps = doc.ExtendedFilePropertiesPart?.Properties;
                var customProps = doc.CustomFilePropertiesPart?.Properties;

                // Core properties
                metadata.Author = coreProps.Creator;
                metadata.LastModifiedBy = coreProps.LastModifiedBy;
                metadata.CreationDate = coreProps.Created;
                metadata.ModificationDate = coreProps.Modified;
                metadata.Subject = coreProps.Subject;
                metadata.Title = coreProps.Title;
                metadata.Keywords = coreProps.Keywords;
                metadata.Comments = coreProps.Description;

                // Extended properties
                if (extendedProps != null)
                {
                    metadata.Application = extendedProps.Application?.Text;
                    metadata.Company = extendedProps.Company?.Text;
                    metadata.Template = extendedProps.Template?.Text;

                    if (extendedProps.TotalTime != null)
                    {
                        metadata.TotalEditTime = TimeSpan.FromMinutes(
                            double.Parse(extendedProps.TotalTime.Text)
                        );
                    }
                }

                // Extract revision information
                ExtractRevisionInfo(doc, metadata);

                // Extract text content for additional analysis
                var text = ExtractTextFromWord(doc);
                AnalyzeTextContent(text, metadata);
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error extracting Word metadata: {ex.Message}");
        }

        return metadata;
    }

    private void AnalyzeTextContent(string text, DocumentMetadata metadata)
    {
        if (string.IsNullOrEmpty(text)) return;

        // Extract email addresses
        var emailRegex = new Regex(@"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b");
        var emailMatches = emailRegex.Matches(text);
        foreach (Match match in emailMatches)
        {
            if (!metadata.EmailAddresses.Contains(match.Value))
            {
                metadata.EmailAddresses.Add(match.Value);
            }
        }

        // Extract computer names (Windows format)
        var computerRegex = new Regex(@"\\\\([A-Za-z0-9\-_]+)\\");
        var computerMatches = computerRegex.Matches(text);
        foreach (Match match in computerMatches)
        {
            var computerName = match.Groups[1].Value;
            if (!metadata.ComputerNames.Contains(computerName))
            {
                metadata.ComputerNames.Add(computerName);
            }
        }

        // Extract network paths
        var pathRegex = new Regex(@"\\\\[A-Za-z0-9\-_]+\\[A-Za-z0-9\-_\\]+");
        var pathMatches = pathRegex.Matches(text);
        foreach (Match match in pathMatches)
        {
            if (!metadata.NetworkPaths.Contains(match.Value))
            {
                metadata.NetworkPaths.Add(match.Value);
            }
        }

        // Extract usernames from paths
        var userRegex = new Regex(@"C:\\Users\\([A-Za-z0-9\-_\.]+)\\");
        var userMatches = userRegex.Matches(text);
        foreach (Match match in userMatches)
        {
            var username = match.Groups[1].Value;
            if (!metadata.UserNames.Contains(username) && username != "Public")
            {
                metadata.UserNames.Add(username);
            }
        }
    }

    private void ExtractRevisionInfo(WordprocessingDocument doc, DocumentMetadata metadata)
    {
        try
        {
            var mainPart = doc.MainDocumentPart;
            if (mainPart?.Document?.Body != null)
            {
                // Look for revision tracking information
                var insertions = mainPart.Document.Body.Descendants<Inserted>();
                var deletions = mainPart.Document.Body.Descendants<Deleted>();

                foreach (var insertion in insertions)
                {
                    var author = insertion.Author?.Value;
                    if (!string.IsNullOrEmpty(author) && !metadata.UserNames.Contains(author))
                    {
                        metadata.UserNames.Add(author);
                    }
                }

                foreach (var deletion in deletions)
                {
                    var author = deletion.Author?.Value;
                    if (!string.IsNullOrEmpty(author) && !metadata.UserNames.Contains(author))
                    {
                        metadata.UserNames.Add(author);
                    }
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error extracting revision info: {ex.Message}");
        }
    }

    public void GenerateMetadataReport(List<DocumentMetadata> documents, string outputPath)
    {
        var report = new StringBuilder();
        report.AppendLine("FOCA Metadata Analysis Report");
        report.AppendLine("Generated: " + DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss"));
        report.AppendLine(new string('=', 50));
        report.AppendLine();

        // Summary statistics
        report.AppendLine("SUMMARY STATISTICS");
        report.AppendLine($"Total documents analyzed: {documents.Count}");

        var uniqueAuthors = documents.Where(d => !string.IsNullOrEmpty(d.Author))
                                   .Select(d => d.Author).Distinct().ToList();
        report.AppendLine($"Unique authors found: {uniqueAuthors.Count}");

        var uniqueUsers = documents.SelectMany(d => d.UserNames).Distinct().ToList();
        report.AppendLine($"Unique usernames found: {uniqueUsers.Count}");

        var uniqueComputers = documents.SelectMany(d => d.ComputerNames).Distinct().ToList();
        report.AppendLine($"Unique computer names found: {uniqueComputers.Count}");

        var uniqueEmails = documents.SelectMany(d => d.EmailAddresses).Distinct().ToList();
        report.AppendLine($"Unique email addresses found: {uniqueEmails.Count}");

        report.AppendLine();

        // Detailed findings
        report.AppendLine("DETAILED FINDINGS");
        report.AppendLine();

        if (uniqueAuthors.Any())
        {
            report.AppendLine("AUTHORS:");
            foreach (var author in uniqueAuthors.OrderBy(a => a))
            {
                var docCount = documents.Count(d => d.Author == author);
                report.AppendLine($"  {author} ({docCount} documents)");
            }
            report.AppendLine();
        }

        if (uniqueUsers.Any())
        {
            report.AppendLine("USERNAMES:");
            foreach (var user in uniqueUsers.OrderBy(u => u))
            {
                report.AppendLine($"  {user}");
            }
            report.AppendLine();
        }

        if (uniqueComputers.Any())
        {
            report.AppendLine("COMPUTER NAMES:");
            foreach (var computer in uniqueComputers.OrderBy(c => c))
            {
                report.AppendLine($"  {computer}");
            }
            report.AppendLine();
        }

        if (uniqueEmails.Any())
        {
            report.AppendLine("EMAIL ADDRESSES:");
            foreach (var email in uniqueEmails.OrderBy(e => e))
            {
                report.AppendLine($"  {email}");
            }
            report.AppendLine();
        }

        // Software analysis
        var applications = documents.Where(d => !string.IsNullOrEmpty(d.Application))
                                  .GroupBy(d => d.Application)
                                  .OrderByDescending(g => g.Count())
                                  .ToList();

        if (applications.Any())
        {
            report.AppendLine("SOFTWARE APPLICATIONS:");
            foreach (var app in applications)
            {
                report.AppendLine($"  {app.Key} ({app.Count()} documents)");
            }
            report.AppendLine();
        }

        File.WriteAllText(outputPath, report.ToString());
    }
}

// Usage example
var extractor = new MetadataExtractor();
var documents = new List<DocumentMetadata>();

// Process all documents in a directory
var documentFiles = Directory.GetFiles(@"C:\FOCA\Downloads", "*.*", SearchOption.AllDirectories)
|  |  |  | .Where(f => f.EndsWith(".pdf") |  | f.EndsWith(".docx") |  | f.EndsWith(".doc")); |  |  |  |

foreach (var file in documentFiles)
{
    DocumentMetadata metadata = null;

    if (file.EndsWith(".pdf"))
    {
        metadata = extractor.ExtractPdfMetadata(file);
    }
|  |  |  | else if (file.EndsWith(".docx") |  | file.EndsWith(".doc")) |  |  |  |
    {
        metadata = extractor.ExtractWordMetadata(file);
    }

    if (metadata != null)
    {
        documents.Add(metadata);
    }
}

// Generate report
extractor.GenerateMetadataReport(documents, @"C:\FOCA\metadata_report.txt");

Network Information Discovery¶

DNS and Network Analysis¶

# FOCA Network Analysis Features:
# 1. DNS resolution of discovered domains
# 2. Network range identification
# 3. Technology fingerprinting
# 4. Server information extraction
# 5. Network infrastructure mapping

# DNS Analysis in FOCA:
# - Automatic DNS resolution of found domains
# - Reverse DNS lookups
# - DNS record enumeration (A, AAAA, MX, NS, TXT)
# - Subdomain discovery from documents
# - Network range calculation

# Technology Fingerprinting:
# - Web server identification
# - Operating system detection
# - Application framework identification
# - Database technology discovery
# - Content management system detection

Infraestructura de redes Mapping¶

# Python script for enhanced network analysis
import dns.resolver
import socket
import requests
import json
import ipaddress
from urllib.parse import urlparse
import whois
import ssl
import subprocess

class NetworkAnalyzer:
    def __init__(self):
        self.discovered_domains = set()
        self.discovered_ips = set()
        self.network_ranges = set()
        self.technologies = {}

    def analyze_document_urls(self, document_urls):
        """Analyze URLs found in documents for network information"""

        for url in document_urls:
            try:
                parsed = urlparse(url)
                domain = parsed.netloc

                if domain:
                    self.discovered_domains.add(domain)

                    # Resolve domain to IP
                    try:
                        ip = socket.gethostbyname(domain)
                        self.discovered_ips.add(ip)

                        # Determine network range
                        network = self.get_network_range(ip)
                        if network:
                            self.network_ranges.add(str(network))

                    except socket.gaierror:
                        print(f"Could not resolve {domain}")

            except Exception as e:
                print(f"Error analyzing URL {url}: {e}")

    def get_network_range(self, ip):
        """Determine network range for IP address"""
        try:
            # Use whois to get network information
            result = subprocess.run(['whois', ip], capture_output=True, text=True)
            whois_output = result.stdout

            # Parse CIDR from whois output
            for line in whois_output.split('\n'):
                if 'CIDR:' in line or 'route:' in line:
                    cidr = line.split(':')[1].strip()
                    if '/' in cidr:
                        return ipaddress.ip_network(cidr, strict=False)

            # Fallback to /24 network
            return ipaddress.ip_network(f"{ip}/24", strict=False)

        except Exception as e:
            print(f"Error getting network range for {ip}: {e}")
            return None

    def perform_dns_enumeration(self, domain):
        """Perform comprehensive DNS enumeration"""
        dns_records = {}

        record_types = ['A', 'AAAA', 'MX', 'NS', 'TXT', 'CNAME', 'SOA']

        for record_type in record_types:
            try:
                answers = dns.resolver.resolve(domain, record_type)
                dns_records[record_type] = [str(answer) for answer in answers]
            except (dns.resolver.NXDOMAIN, dns.resolver.NoAnswer):
                dns_records[record_type] = []
            except Exception as e:
                print(f"Error resolving {record_type} for {domain}: {e}")
                dns_records[record_type] = []

        return dns_records

    def fingerprint_web_technology(self, url):
        """Fingerprint web technologies"""
        try:
            response = requests.get(url, timeout=10, verify=False)

            technology_info = {
                'server': response.headers.get('Server', ''),
                'x_powered_by': response.headers.get('X-Powered-By', ''),
                'content_type': response.headers.get('Content-Type', ''),
                'status_code': response.status_code,
                'technologies': []
            }

            # Analyze response headers
            headers = response.headers
            content = response.text.lower()

            # Common technology indicators
            tech_indicators = {
                'Apache': ['apache'],
                'Nginx': ['nginx'],
                'IIS': ['microsoft-iis'],
                'PHP': ['php', 'x-powered-by: php'],
                'ASP.NET': ['asp.net', 'x-aspnet-version'],
                'WordPress': ['wp-content', 'wordpress'],
                'Drupal': ['drupal'],
                'Joomla': ['joomla'],
                'jQuery': ['jquery'],
                'Bootstrap': ['bootstrap'],
                'Angular': ['angular'],
                'React': ['react'],
                'Vue.js': ['vue.js', 'vuejs']
            }

            for tech, indicators in tech_indicators.items():
                for indicator in indicators:
                    if (indicator in str(headers).lower() or 
                        indicator in content):
                        technology_info['technologies'].append(tech)
                        break

            # SSL/TLS information
            if url.startswith('https://'):
                ssl_info = self.get_ssl_info(urlparse(url).netloc)
                technology_info['ssl'] = ssl_info

            return technology_info

        except Exception as e:
            print(f"Error fingerprinting {url}: {e}")
            return None

    def get_ssl_info(self, hostname):
        """Get SSL certificate information"""
        try:
            context = ssl.create_default_context()
            with socket.create_connection((hostname, 443), timeout=10) as sock:
                with context.wrap_socket(sock, server_hostname=hostname) as ssock:
                    cert = ssock.getpeercert()

                    return {
                        'subject': dict(x[0] for x in cert['subject']),
                        'issuer': dict(x[0] for x in cert['issuer']),
                        'version': cert['version'],
                        'serial_number': cert['serialNumber'],
                        'not_before': cert['notBefore'],
                        'not_after': cert['notAfter'],
                        'san': cert.get('subjectAltName', [])
                    }
        except Exception as e:
            print(f"Error getting SSL info for {hostname}: {e}")
            return None

    def generate_network_report(self, output_file):
        """Generate comprehensive network analysis report"""

        report = {
            'summary': {
                'domains_discovered': len(self.discovered_domains),
                'ip_addresses_discovered': len(self.discovered_ips),
                'network_ranges': len(self.network_ranges)
            },
            'domains': list(self.discovered_domains),
            'ip_addresses': list(self.discovered_ips),
            'network_ranges': list(self.network_ranges),
            'technologies': self.technologies,
            'dns_records': {},
            'ssl_certificates': {}
        }

        # Perform DNS enumeration for each domain
        for domain in self.discovered_domains:
            print(f"Enumerating DNS for {domain}...")
            report['dns_records'][domain] = self.perform_dns_enumeration(domain)

        # Fingerprint technologies for each domain
        for domain in self.discovered_domains:
            print(f"Fingerprinting {domain}...")
            for protocol in ['http', 'https']:
                url = f"{protocol}://{domain}"
                tech_info = self.fingerprint_web_technology(url)
                if tech_info:
                    report['technologies'][url] = tech_info

        # Save report
        with open(output_file, 'w') as f:
            json.dump(report, f, indent=2, default=str)

        print(f"Network analysis report saved to {output_file}")
        return report

# Usage example
analyzer = NetworkAnalyzer()

# Example document URLs (would come from FOCA analysis)
document_urls = [
    "https://example.com/documents/report.pdf",
    "https://subdomain.example.com/files/presentation.pptx",
    "https://internal.example.com/docs/manual.doc"
]

# Analyze network information
analyzer.analyze_document_urls(document_urls)

# Generate comprehensive report
report = analyzer.generate_network_report("network_analysis_report.json")

print(f"Discovered {len(analyzer.discovered_domains)} domains")
print(f"Discovered {len(analyzer.discovered_ips)} IP addresses")
print(f"Identified {len(analyzer.network_ranges)} network ranges")

User and Organization Intelligence¶

Profiling and Analysis¶

# FOCA User Intelligence Features:
# 1. Author extraction from document metadata
# 2. Username discovery from file paths
# 3. Email address identification
# 4. Organizational structure mapping
# 5. User behavior analysis

# User Information Sources in FOCA:
# - Document author fields
# - Last modified by fields
# - File path usernames (C:\Users\username\)
# - Email addresses in content
# - Digital signatures
# - Revision tracking information
# - Comments and annotations

Análisis avanzado del usuario¶

# Python script for advanced user intelligence analysis
import re
import json
from collections import defaultdict, Counter
from datetime import datetime
import networkx as nx
import matplotlib.pyplot as plt

class UserIntelligenceAnalyzer:
    def __init__(self):
        self.users = {}
        self.email_domains = defaultdict(list)
        self.organizational_structure = defaultdict(list)
        self.user_relationships = defaultdict(set)
        self.document_timeline = []

    def analyze_user_metadata(self, documents_metadata):
        """Analyze user information from document metadata"""

        for doc in documents_metadata:
            # Extract user information
            users_in_doc = set()

            # Primary author
            if doc.get('author'):
                self.add_user(doc['author'], 'author', doc)
                users_in_doc.add(doc['author'])

            # Last modified by
            if doc.get('last_modified_by'):
                self.add_user(doc['last_modified_by'], 'modifier', doc)
                users_in_doc.add(doc['last_modified_by'])

            # Users from file paths
            for path in doc.get('file_paths', []):
                username = self.extract_username_from_path(path)
                if username:
                    self.add_user(username, 'file_path', doc)
                    users_in_doc.add(username)

            # Users from revision tracking
            for user in doc.get('revision_users', []):
                self.add_user(user, 'revision', doc)
                users_in_doc.add(user)

            # Email addresses
            for email in doc.get('email_addresses', []):
                username = email.split('@')[0]
                domain = email.split('@')[1]
                self.add_user(username, 'email', doc, email)
                self.email_domains[domain].append(username)
                users_in_doc.add(username)

            # Build user relationships (users who worked on same documents)
            for user1 in users_in_doc:
                for user2 in users_in_doc:
                    if user1 != user2:
                        self.user_relationships[user1].add(user2)

            # Document timeline
            if doc.get('creation_date'):
                self.document_timeline.append({
                    'date': doc['creation_date'],
                    'document': doc['filename'],
                    'users': list(users_in_doc)
                })

    def add_user(self, username, source_type, document, email=None):
        """Add user information to the database"""

        if username not in self.users:
            self.users[username] = {
                'username': username,
                'email': email,
                'documents': [],
                'roles': set(),
                'first_seen': None,
                'last_seen': None,
                'activity_pattern': defaultdict(int)
            }

        user = self.users[username]
        user['documents'].append(document['filename'])
        user['roles'].add(source_type)

        if email and not user['email']:
            user['email'] = email

        # Update activity timeline
        if document.get('creation_date'):
            date = document['creation_date']
            if not user['first_seen'] or date < user['first_seen']:
                user['first_seen'] = date
            if not user['last_seen'] or date > user['last_seen']:
                user['last_seen'] = date

            # Activity pattern by day of week
            day_of_week = date.strftime('%A')
            user['activity_pattern'][day_of_week] += 1

    def extract_username_from_path(self, path):
        """Extract username from file path"""
        patterns = [
            r'C:\\Users\\([^\\]+)\\',
            r'/home/([^/]+)/',
            r'/Users/([^/]+)/',
            r'\\\\[^\\]+\\([^\\]+)\\',
        ]

        for pattern in patterns:
            match = re.search(pattern, path, re.IGNORECASE)
            if match:
                username = match.group(1)
                # Filter out common system accounts
                if username.lower() not in ['public', 'default', 'administrator', 'guest']:
                    return username

        return None

    def identify_organizational_structure(self):
        """Identify organizational structure from user data"""

        # Analyze email domains to identify departments/organizations
        for domain, users in self.email_domains.items():
            if len(users) > 1:
                self.organizational_structure[domain] = users

        # Analyze user collaboration patterns
        collaboration_groups = self.find_collaboration_groups()

        return {
            'email_domains': dict(self.email_domains),
            'collaboration_groups': collaboration_groups,
            'organizational_chart': self.build_organizational_chart()
        }

    def find_collaboration_groups(self):
        """Find groups of users who frequently collaborate"""

        # Build collaboration network
        G = nx.Graph()

        for user, collaborators in self.user_relationships.items():
            for collaborator in collaborators:
                if G.has_edge(user, collaborator):
                    G[user][collaborator]['weight'] += 1
                else:
                    G.add_edge(user, collaborator, weight=1)

        # Find communities/groups
        try:
            communities = nx.community.greedy_modularity_communities(G)
            return [list(community) for community in communities]
        except:
            # Fallback to simple clustering
            return self.simple_clustering()

    def simple_clustering(self):
        """Simple clustering based on shared documents"""
        clusters = []
        processed_users = set()

        for user, collaborators in self.user_relationships.items():
            if user not in processed_users:
                cluster = {user}
                cluster.update(collaborators)

                # Add users who collaborate with any member of the cluster
                expanded = True
                while expanded:
                    expanded = False
                    for cluster_user in list(cluster):
                        new_collaborators = self.user_relationships[cluster_user] - cluster
                        if new_collaborators:
                            cluster.update(new_collaborators)
                            expanded = True

                clusters.append(list(cluster))
                processed_users.update(cluster)

        return clusters

    def build_organizational_chart(self):
        """Build organizational chart based on user analysis"""

        org_chart = {
            'departments': {},
            'roles': defaultdict(list),
            'hierarchy': {}
        }

        # Group by email domains (departments)
        for domain, users in self.email_domains.items():
            org_chart['departments'][domain] = {
                'users': users,
                'document_count': sum(len(self.users[user]['documents']) for user in users if user in self.users),
                'active_period': self.get_department_active_period(users)
            }

        # Identify roles based on activity patterns
        for username, user_data in self.users.items():
            role_indicators = self.analyze_user_role(user_data)
            org_chart['roles'][role_indicators['primary_role']].append(username)

        return org_chart

    def analyze_user_role(self, user_data):
        """Analyze user role based on activity patterns"""

        doc_count = len(user_data['documents'])
        roles = user_data['roles']

        # Determine primary role
        if 'author' in roles and doc_count > 5:
            primary_role = 'content_creator'
        elif 'modifier' in roles and doc_count > 10:
            primary_role = 'editor'
        elif 'revision' in roles:
            primary_role = 'reviewer'
        elif doc_count > 20:
            primary_role = 'power_user'
        else:
            primary_role = 'regular_user'

        return {
            'primary_role': primary_role,
            'document_count': doc_count,
            'activity_level': 'high' if doc_count > 10 else 'medium' if doc_count > 3 else 'low'
        }

    def get_department_active_period(self, users):
        """Get active period for a department"""
        all_dates = []

        for user in users:
            if user in self.users:
                user_data = self.users[user]
                if user_data['first_seen']:
                    all_dates.append(user_data['first_seen'])
                if user_data['last_seen']:
                    all_dates.append(user_data['last_seen'])

        if all_dates:
            return {
                'start': min(all_dates),
                'end': max(all_dates)
            }

        return None

    def generate_user_intelligence_report(self, output_file):
        """Generate comprehensive user intelligence report"""

        org_structure = self.identify_organizational_structure()

        report = {
            'summary': {
                'total_users': len(self.users),
                'email_domains': len(self.email_domains),
                'collaboration_groups': len(org_structure['collaboration_groups']),
                'total_documents': len(self.document_timeline)
            },
            'users': self.users,
            'organizational_structure': org_structure,
            'user_relationships': {k: list(v) for k, v in self.user_relationships.items()},
            'timeline': sorted(self.document_timeline, key=lambda x: x['date']),
            'insights': self.generate_insights()
        }

        # Convert datetime objects to strings for JSON serialization
        def json_serial(obj):
            if isinstance(obj, datetime):
                return obj.isoformat()
            raise TypeError(f"Type {type(obj)} not serializable")

        with open(output_file, 'w') as f:
            json.dump(report, f, indent=2, default=json_serial)

        print(f"User intelligence report saved to {output_file}")
        return report

    def generate_insights(self):
        """Generate actionable insights from user analysis"""

        insights = []

        # Most active users
        most_active = sorted(self.users.items(), 
                           key=lambda x: len(x[1]['documents']), 
                           reverse=True)[:5]

        insights.append({
            'type': 'most_active_users',
            'description': 'Users with highest document activity',
            'data': [(user, len(data['documents'])) for user, data in most_active]
        })

        # Largest email domains
        largest_domains = sorted(self.email_domains.items(), 
                               key=lambda x: len(x[1]), 
                               reverse=True)[:5]

        insights.append({
            'type': 'largest_departments',
            'description': 'Email domains with most users',
            'data': [(domain, len(users)) for domain, users in largest_domains]
        })

        # Users with potential security risks
        risky_users = []
        for username, user_data in self.users.items():
            if (len(user_data['documents']) > 15 and 
                'file_path' in user_data['roles'] and
                user_data['email']):
                risky_users.append(username)

        insights.append({
            'type': 'high_exposure_users',
            'description': 'Users with high document exposure and identifiable information',
            'data': risky_users
        })

        return insights

    def visualize_user_network(self, output_file='user_network.png'):
        """Create visualization of user collaboration network"""

        G = nx.Graph()

        # Add nodes and edges
        for user, collaborators in self.user_relationships.items():
            G.add_node(user)
            for collaborator in collaborators:
                G.add_edge(user, collaborator)

        # Create visualization
        plt.figure(figsize=(12, 8))
        pos = nx.spring_layout(G, k=1, iterations=50)

        # Draw network
        nx.draw_networkx_nodes(G, pos, node_color='lightblue', 
                              node_size=300, alpha=0.7)
        nx.draw_networkx_edges(G, pos, alpha=0.5)
        nx.draw_networkx_labels(G, pos, font_size=8)

        plt.title("User Collaboration Network")
        plt.axis('off')
        plt.tight_layout()
        plt.savefig(output_file, dpi=300, bbox_inches='tight')
        plt.close()

        print(f"User network visualization saved to {output_file}")

# Usage example
analyzer = UserIntelligenceAnalyzer()

# Example metadata (would come from FOCA analysis)
documents_metadata = [
    {
        'filename': 'report.pdf',
        'author': 'john.smith',
        'last_modified_by': 'jane.doe',
        'creation_date': datetime(2024, 1, 15),
        'email_addresses': ['john.smith@example.com', 'jane.doe@example.com'],
        'file_paths': ['C:\\Users\\john.smith\\Documents\\report.pdf'],
        'revision_users': ['john.smith', 'jane.doe', 'bob.wilson']
    },
    # More documents...
]

# Analyze user intelligence
analyzer.analyze_user_metadata(documents_metadata)

# Generate comprehensive report
report = analyzer.generate_user_intelligence_report("user_intelligence_report.json")

# Create network visualization
analyzer.visualize_user_network("user_collaboration_network.png")

print(f"Analyzed {len(analyzer.users)} users")
print(f"Found {len(analyzer.email_domains)} email domains")
print(f"Identified {len(analyzer.user_relationships)} user relationships")

Técnicas de análisis avanzado¶

Análisis temporal y patrones¶

# Advanced temporal analysis for FOCA data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import numpy as np

class TemporalAnalyzer:
    def __init__(self, documents_data):
        self.df = pd.DataFrame(documents_data)
        self.df['creation_date'] = pd.to_datetime(self.df['creation_date'])
        self.df['modification_date'] = pd.to_datetime(self.df['modification_date'])

    def analyze_document_creation_patterns(self):
        """Analyze document creation patterns over time"""

        # Group by month
        monthly_creation = self.df.groupby(self.df['creation_date'].dt.to_period('M')).size()

        # Group by day of week
        dow_creation = self.df.groupby(self.df['creation_date'].dt.day_name()).size()

        # Group by hour of day
        hourly_creation = self.df.groupby(self.df['creation_date'].dt.hour).size()

        return {
            'monthly_pattern': monthly_creation.to_dict(),
            'day_of_week_pattern': dow_creation.to_dict(),
            'hourly_pattern': hourly_creation.to_dict()
        }

    def identify_work_patterns(self):
        """Identify organizational work patterns"""

        # Business hours analysis (9 AM - 5 PM)
        business_hours = self.df[
            (self.df['creation_date'].dt.hour >= 9) & 
            (self.df['creation_date'].dt.hour <= 17)
        ]

        # Weekend work
        weekend_work = self.df[
            self.df['creation_date'].dt.dayofweek.isin([5, 6])
        ]

        # After hours work
        after_hours = self.df[
            (self.df['creation_date'].dt.hour < 9) | 
            (self.df['creation_date'].dt.hour > 17)
        ]

        return {
            'business_hours_percentage': len(business_hours) / len(self.df) * 100,
            'weekend_work_percentage': len(weekend_work) / len(self.df) * 100,
            'after_hours_percentage': len(after_hours) / len(self.df) * 100,
            'peak_hours': self.df['creation_date'].dt.hour.mode().tolist()
        }

    def detect_anomalies(self):
        """Detect temporal anomalies in document creation"""

        # Daily document counts
        daily_counts = self.df.groupby(self.df['creation_date'].dt.date).size()

        # Statistical anomaly detection
        mean_count = daily_counts.mean()
        std_count = daily_counts.std()
        threshold = mean_count + 2 * std_count

        anomalous_days = daily_counts[daily_counts > threshold]

        return {
            'anomalous_days': anomalous_days.to_dict(),
            'normal_range': (mean_count - std_count, mean_count + std_count),
            'peak_activity_days': daily_counts.nlargest(5).to_dict()
        }

    def analyze_user_activity_timeline(self):
        """Analyze individual user activity timelines"""

        user_timelines = {}

        for user in self.df['author'].dropna().unique():
            user_docs = self.df[self.df['author'] == user]

            if len(user_docs) > 0:
                user_timelines[user] = {
                    'first_document': user_docs['creation_date'].min(),
                    'last_document': user_docs['creation_date'].max(),
                    'total_documents': len(user_docs),
                    'activity_span_days': (user_docs['creation_date'].max() - 
                                         user_docs['creation_date'].min()).days,
                    'average_documents_per_month': len(user_docs) / max(1, 
                        (user_docs['creation_date'].max() - 
                         user_docs['creation_date'].min()).days / 30)
                }

        return user_timelines

    def generate_temporal_visualizations(self, output_dir='temporal_analysis'):
        """Generate temporal analysis visualizations"""

        import os
        os.makedirs(output_dir, exist_ok=True)

        # 1. Monthly document creation trend
        plt.figure(figsize=(12, 6))
        monthly_data = self.df.groupby(self.df['creation_date'].dt.to_period('M')).size()
        monthly_data.plot(kind='line', marker='o')
        plt.title('Document Creation Trend Over Time')
        plt.xlabel('Month')
        plt.ylabel('Number of Documents')
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig(f'{output_dir}/monthly_trend.png', dpi=300)
        plt.close()

        # 2. Day of week heatmap
        plt.figure(figsize=(10, 6))
        dow_hour = self.df.groupby([
            self.df['creation_date'].dt.day_name(),
            self.df['creation_date'].dt.hour
        ]).size().unstack(fill_value=0)

        # Reorder days
        day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
        dow_hour = dow_hour.reindex(day_order)

        sns.heatmap(dow_hour, cmap='YlOrRd', annot=False, fmt='d')
        plt.title('Document Creation Heatmap (Day of Week vs Hour)')
        plt.xlabel('Hour of Day')
        plt.ylabel('Day of Week')
        plt.tight_layout()
        plt.savefig(f'{output_dir}/activity_heatmap.png', dpi=300)
        plt.close()

        # 3. User activity timeline
        plt.figure(figsize=(14, 8))

        # Get top 10 most active users
        top_users = self.df['author'].value_counts().head(10).index

        for i, user in enumerate(top_users):
            user_docs = self.df[self.df['author'] == user]
            plt.scatter(user_docs['creation_date'], [i] * len(user_docs), 
                       alpha=0.6, s=50, label=user)

        plt.yticks(range(len(top_users)), top_users)
        plt.xlabel('Date')
        plt.ylabel('User')
        plt.title('User Activity Timeline (Top 10 Users)')
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig(f'{output_dir}/user_timeline.png', dpi=300)
        plt.close()

        print(f"Temporal visualizations saved to {output_dir}/")

# Usage example
temporal_analyzer = TemporalAnalyzer(documents_data)

# Analyze patterns
creation_patterns = temporal_analyzer.analyze_document_creation_patterns()
work_patterns = temporal_analyzer.identify_work_patterns()
anomalies = temporal_analyzer.detect_anomalies()
user_timelines = temporal_analyzer.analyze_user_activity_timeline()

# Generate visualizations
temporal_analyzer.generate_temporal_visualizations()

print("Temporal Analysis Results:")
print(f"Peak creation hours: {work_patterns['peak_hours']}")
print(f"Business hours work: {work_patterns['business_hours_percentage']:.1f}%")
print(f"Weekend work: {work_patterns['weekend_work_percentage']:.1f}%")
print(f"Anomalous activity days: {len(anomalies['anomalous_days'])}")

Evaluación del riesgo de seguridad¶

# Security risk assessment based on FOCA findings
class SecurityRiskAssessment:
    def __init__(self, foca_data):
        self.documents = foca_data['documents']
        self.users = foca_data['users']
        self.network_info = foca_data['network_info']
        self.risk_score = 0
        self.risk_factors = []

    def assess_information_disclosure_risk(self):
        """Assess risk from information disclosure in documents"""

        risk_indicators = {
            'high_risk': {
                'patterns': [
                    r'password\s*[:=]\s*\w+',
                    r'api[_-]?key\s*[:=]\s*[a-zA-Z0-9]+',
                    r'secret\s*[:=]\s*\w+',
                    r'confidential',
                    r'internal\s+use\s+only',
                    r'proprietary',
                    r'ssn\s*[:=]?\s*\d{3}-?\d{2}-?\d{4}',
                    r'credit\s+card\s*[:=]?\s*\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}'
                ],
                'score': 10
            },
            'medium_risk': {
                'patterns': [
                    r'internal',
                    r'private',
                    r'restricted',
                    r'employee\s+id\s*[:=]?\s*\d+',
                    r'phone\s*[:=]?\s*\+?\d{10,15}',
                    r'address\s*[:=]?.*\d{5}'
                ],
                'score': 5
            },
            'low_risk': {
                'patterns': [
                    r'draft',
                    r'preliminary',
                    r'work\s+in\s+progress',
                    r'todo',
                    r'fixme'
                ],
                'score': 2
            }
        }

        total_risk = 0
        findings = []

        for doc in self.documents:
            content = doc.get('content', '').lower()

            for risk_level, config in risk_indicators.items():
                for pattern in config['patterns']:
                    matches = re.findall(pattern, content, re.IGNORECASE)
                    if matches:
                        finding = {
                            'document': doc['filename'],
                            'risk_level': risk_level,
                            'pattern': pattern,
                            'matches': len(matches),
                            'score': config['score'] * len(matches)
                        }
                        findings.append(finding)
                        total_risk += finding['score']

        return {
            'total_risk_score': total_risk,
            'findings': findings,
            'risk_level': self.categorize_risk(total_risk)
        }

    def assess_metadata_exposure_risk(self):
        """Assess risk from metadata exposure"""

        risk_factors = []
        total_score = 0

        # User information exposure
        unique_users = set()
        for doc in self.documents:
            if doc.get('author'):
                unique_users.add(doc['author'])
            if doc.get('last_modified_by'):
                unique_users.add(doc['last_modified_by'])

        if len(unique_users) > 10:
            risk_factors.append({
                'factor': 'High user exposure',
                'description': f'{len(unique_users)} unique users identified',
                'score': 15
            })
            total_score += 15
        elif len(unique_users) > 5:
            risk_factors.append({
                'factor': 'Medium user exposure',
                'description': f'{len(unique_users)} unique users identified',
                'score': 8
            })
            total_score += 8

        # Email domain exposure
        email_domains = set()
        for doc in self.documents:
            for email in doc.get('email_addresses', []):
                domain = email.split('@')[1] if '@' in email else None
                if domain:
                    email_domains.add(domain)

        if len(email_domains) > 3:
            risk_factors.append({
                'factor': 'Multiple email domains exposed',
                'description': f'{len(email_domains)} email domains found',
                'score': 10
            })
            total_score += 10

        # Internal path exposure
        internal_paths = []
        for doc in self.documents:
            for path in doc.get('file_paths', []):
                if any(indicator in path.lower() for indicator in ['c:\\users\\', 'internal', 'private']):
                    internal_paths.append(path)

        if len(internal_paths) > 5:
            risk_factors.append({
                'factor': 'Internal file path exposure',
                'description': f'{len(internal_paths)} internal paths exposed',
                'score': 12
            })
            total_score += 12

        # Software version exposure
        software_versions = []
        for doc in self.documents:
            if doc.get('application'):
                software_versions.append(doc['application'])

        if len(set(software_versions)) > 5:
            risk_factors.append({
                'factor': 'Software fingerprinting risk',
                'description': f'{len(set(software_versions))} different applications identified',
                'score': 6
            })
            total_score += 6

        return {
            'total_risk_score': total_score,
            'risk_factors': risk_factors,
            'risk_level': self.categorize_risk(total_score)
        }

    def assess_network_exposure_risk(self):
        """Assess network infrastructure exposure risk"""

        risk_factors = []
        total_score = 0

        # Subdomain exposure
        subdomains = self.network_info.get('subdomains', [])
        if len(subdomains) > 20:
            risk_factors.append({
                'factor': 'High subdomain exposure',
                'description': f'{len(subdomains)} subdomains discovered',
                'score': 15
            })
            total_score += 15
        elif len(subdomains) > 10:
            risk_factors.append({
                'factor': 'Medium subdomain exposure',
                'description': f'{len(subdomains)} subdomains discovered',
                'score': 8
            })
            total_score += 8

        # Internal subdomain exposure
        internal_subdomains = [s for s in subdomains if any(
            keyword in s.lower() for keyword in ['internal', 'intranet', 'private', 'dev', 'test', 'staging']
        )]

        if len(internal_subdomains) > 0:
            risk_factors.append({
                'factor': 'Internal subdomain exposure',
                'description': f'{len(internal_subdomains)} internal subdomains found',
                'score': 20
            })
            total_score += 20

        # Technology stack exposure
        technologies = self.network_info.get('technologies', {})
        if len(technologies) > 10:
            risk_factors.append({
                'factor': 'Technology stack fingerprinting',
                'description': f'{len(technologies)} technologies identified',
                'score': 8
            })
            total_score += 8

        return {
            'total_risk_score': total_score,
            'risk_factors': risk_factors,
            'risk_level': self.categorize_risk(total_score)
        }

    def categorize_risk(self, score):
        """Categorize risk level based on score"""
        if score >= 50:
            return 'CRITICAL'
        elif score >= 30:
            return 'HIGH'
        elif score >= 15:
            return 'MEDIUM'
        elif score >= 5:
            return 'LOW'
        else:
            return 'MINIMAL'

    def generate_comprehensive_risk_assessment(self):
        """Generate comprehensive risk assessment report"""

        info_disclosure = self.assess_information_disclosure_risk()
        metadata_exposure = self.assess_metadata_exposure_risk()
        network_exposure = self.assess_network_exposure_risk()

        total_risk_score = (
            info_disclosure['total_risk_score'] +
            metadata_exposure['total_risk_score'] +
            network_exposure['total_risk_score']
        )

        assessment = {
            'overall_risk_score': total_risk_score,
            'overall_risk_level': self.categorize_risk(total_risk_score),
            'risk_categories': {
                'information_disclosure': info_disclosure,
                'metadata_exposure': metadata_exposure,
                'network_exposure': network_exposure
            },
            'recommendations': self.generate_recommendations(total_risk_score),
            'executive_summary': self.generate_executive_summary(total_risk_score)
        }

        return assessment

    def generate_recommendations(self, risk_score):
        """Generate security recommendations based on risk assessment"""

        recommendations = []

        if risk_score >= 50:
            recommendations.extend([
                "IMMEDIATE ACTION REQUIRED: Critical security risks identified",
                "Conduct emergency security review of all public documents",
                "Implement document classification and handling procedures",
                "Review and restrict access to internal systems and documents",
                "Consider taking affected systems offline until remediation"
            ])

        if risk_score >= 30:
            recommendations.extend([
                "Implement document metadata sanitization procedures",
                "Review and update information security policies",
                "Conduct security awareness training for all staff",
                "Implement data loss prevention (DLP) solutions",
                "Regular security audits of public-facing documents"
            ])

        if risk_score >= 15:
            recommendations.extend([
                "Establish document review process before publication",
                "Implement metadata removal tools and procedures",
                "Review subdomain and network exposure",
                "Update security awareness training materials",
                "Consider implementing document watermarking"
            ])

        recommendations.extend([
            "Regular OSINT assessments of organizational exposure",
            "Monitor for new document publications and leaks",
            "Implement automated metadata scanning tools",
            "Establish incident response procedures for information disclosure",
            "Regular review of public-facing digital assets"
        ])

        return recommendations

    def generate_executive_summary(self, risk_score):
        """Generate executive summary of risk assessment"""

        risk_level = self.categorize_risk(risk_score)

        summary = f"""
        EXECUTIVE SUMMARY - FOCA Security Risk Assessment

        Overall Risk Level: {risk_level}
        Risk Score: {risk_score}/100

        This assessment analyzed {len(self.documents)} documents and associated metadata
        to identify potential security risks from information disclosure.

        Key Findings:
        - {len(set(doc.get('author', '') for doc in self.documents if doc.get('author')))} unique users identified
        - {len(self.network_info.get('subdomains', []))} subdomains discovered
        - {len(self.network_info.get('technologies', {}))} technologies fingerprinted

        Risk Level Interpretation:
        - CRITICAL (50+): Immediate action required, significant security exposure
        - HIGH (30-49): High priority remediation needed within 30 days
        - MEDIUM (15-29): Moderate risk, address within 90 days
        - LOW (5-14): Low risk, include in regular security review cycle
        - MINIMAL (0-4): Minimal risk, maintain current security posture

        Recommendation: {"Immediate remediation required" if risk_score >= 50 else 
                        "Priority security review needed" if risk_score >= 30 else
                        "Include in next security review cycle" if risk_score >= 15 else
                        "Monitor and maintain current security measures"}
        """

        return summary.strip()

# Usage example
foca_data = {
    'documents': documents_metadata,  # From previous examples
    'users': user_data,              # From user intelligence analysis
    'network_info': network_analysis  # From network analysis
}

risk_assessor = SecurityRiskAssessment(foca_data)
comprehensive_assessment = risk_assessor.generate_comprehensive_risk_assessment()

print("Security Risk Assessment Results:")
print(f"Overall Risk Level: {comprehensive_assessment['overall_risk_level']}")
print(f"Risk Score: {comprehensive_assessment['overall_risk_score']}")
print("\nTop Recommendations:")
for i, rec in enumerate(comprehensive_assessment['recommendations'][:5], 1):
    print(f"{i}. {rec}")

Integración y automatización¶

FOCA API y automatización¶

# FOCA automation and integration framework
import os
import subprocess
import json
import time
from pathlib import Path
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class FOCAAutomation:
    def __init__(self, foca_path, workspace_dir):
        self.foca_path = foca_path
        self.workspace_dir = Path(workspace_dir)
        self.workspace_dir.mkdir(exist_ok=True)

    def create_automated_project(self, target_domain, project_name):
        """Create and configure FOCA project automatically"""

        project_config = {
            'name': project_name,
            'target': target_domain,
            'search_engines': ['google', 'bing'],
            'file_types': ['pdf', 'doc', 'docx', 'xls', 'xlsx', 'ppt', 'pptx'],
            'max_results': 100,
            'download_files': True,
            'analyze_metadata': True
        }

        # Save project configuration
        config_file = self.workspace_dir / f"{project_name}_config.json"
        with open(config_file, 'w') as f:
            json.dump(project_config, f, indent=2)

        return config_file

    def automated_document_discovery(self, target_domain, output_file):
        """Automated document discovery using multiple methods"""

        discovered_urls = set()

        # Method 1: Google dorking
        google_urls = self.google_dork_search(target_domain)
        discovered_urls.update(google_urls)

        # Method 2: Bing search
        bing_urls = self.bing_search(target_domain)
        discovered_urls.update(bing_urls)

        # Method 3: Site crawling
        crawled_urls = self.crawl_site_for_documents(target_domain)
        discovered_urls.update(crawled_urls)

        # Method 4: Certificate transparency logs
        ct_urls = self.search_certificate_transparency(target_domain)
        discovered_urls.update(ct_urls)

        # Save results
        with open(output_file, 'w') as f:
            for url in sorted(discovered_urls):
                f.write(f"{url}\n")

        return list(discovered_urls)

    def google_dork_search(self, domain):
        """Perform Google dorking for document discovery"""

        file_types = ['pdf', 'doc', 'docx', 'xls', 'xlsx', 'ppt', 'pptx']
        discovered_urls = set()

        for file_type in file_types:
            query = f"site:{domain} filetype:{file_type}"

            # Use requests with proper headers to avoid blocking
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }

            try:
                # Note: In practice, use Google Custom Search API
                # This is a simplified example
                search_url = f"https://www.google.com/search?q={query}"
                response = requests.get(search_url, headers=headers)

                # Parse results (simplified - use proper HTML parsing)
                # Extract URLs from search results
                # Add to discovered_urls set

                time.sleep(2)  # Rate limiting

            except Exception as e:
                print(f"Error searching for {file_type} files: {e}")

        return discovered_urls

    def crawl_site_for_documents(self, domain):
        """Crawl website for document links"""

        discovered_urls = set()

        try:
            # Use Selenium for JavaScript-heavy sites
            options = webdriver.ChromeOptions()
            options.add_argument('--headless')
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')

            driver = webdriver.Chrome(options=options)

            # Start from main domain
            driver.get(f"https://{domain}")

            # Find all links
            links = driver.find_elements(By.TAG_NAME, "a")

            for link in links:
                href = link.get_attribute('href')
                if href and any(ext in href.lower() for ext in ['.pdf', '.doc', '.xls', '.ppt']):
                    discovered_urls.add(href)

            # Also check common document directories
            common_paths = ['/documents/', '/files/', '/downloads/', '/resources/', '/docs/']

            for path in common_paths:
                try:
                    driver.get(f"https://{domain}{path}")
                    links = driver.find_elements(By.TAG_NAME, "a")

                    for link in links:
                        href = link.get_attribute('href')
                        if href and any(ext in href.lower() for ext in ['.pdf', '.doc', '.xls', '.ppt']):
                            discovered_urls.add(href)

                except Exception:
                    continue

            driver.quit()

        except Exception as e:
            print(f"Error crawling {domain}: {e}")

        return discovered_urls

    def search_certificate_transparency(self, domain):
        """Search certificate transparency logs for subdomains"""

        discovered_urls = set()

        try:
            # Query crt.sh for certificate transparency data
            ct_url = f"https://crt.sh/?q=%.{domain}&output;=json"
            response = requests.get(ct_url, timeout=30)

            if response.status_code == 200:
                certificates = response.json()

                subdomains = set()
                for cert in certificates:
                    name_value = cert.get('name_value', '')
                    for name in name_value.split('\n'):
                        if domain in name and not name.startswith('*'):
                            subdomains.add(name.strip())

                # Check each subdomain for documents
                for subdomain in subdomains:
                    try:
                        # Quick check for common document paths
                        for path in ['/documents/', '/files/', '/downloads/']:
                            test_url = f"https://{subdomain}{path}"
                            response = requests.head(test_url, timeout=5)
                            if response.status_code == 200:
                                discovered_urls.add(test_url)
                    except:
                        continue

        except Exception as e:
            print(f"Error searching certificate transparency: {e}")

        return discovered_urls

    def bulk_download_documents(self, url_list, download_dir):
        """Bulk download documents from URL list"""

        download_path = Path(download_dir)
        download_path.mkdir(exist_ok=True)

        downloaded_files = []

        for url in url_list:
            try:
                response = requests.get(url, timeout=30, stream=True)

                if response.status_code == 200:
                    # Extract filename from URL or Content-Disposition header
                    filename = self.extract_filename(url, response.headers)
                    file_path = download_path / filename

                    with open(file_path, 'wb') as f:
                        for chunk in response.iter_content(chunk_size=8192):
                            f.write(chunk)

                    downloaded_files.append(str(file_path))
                    print(f"Downloaded: {filename}")

                time.sleep(1)  # Rate limiting

            except Exception as e:
                print(f"Error downloading {url}: {e}")

        return downloaded_files

    def extract_filename(self, url, headers):
        """Extract filename from URL or headers"""

        # Try Content-Disposition header first
        content_disposition = headers.get('Content-Disposition', '')
        if 'filename=' in content_disposition:
            filename = content_disposition.split('filename=')[1].strip('"')
            return filename

        # Extract from URL
        filename = url.split('/')[-1]
        if '?' in filename:
            filename = filename.split('?')[0]

        # Ensure valid filename
        if not filename or '.' not in filename:
            filename = f"document_{hash(url)}.pdf"

        return filename

    def automated_metadata_analysis(self, file_list):
        """Perform automated metadata analysis on downloaded files"""

        analysis_results = []

        for file_path in file_list:
            try:
                # Use appropriate metadata extractor based on file type
                if file_path.lower().endswith('.pdf'):
                    metadata = self.extract_pdf_metadata(file_path)
                elif file_path.lower().endswith(('.doc', '.docx')):
                    metadata = self.extract_office_metadata(file_path)
                elif file_path.lower().endswith(('.xls', '.xlsx')):
                    metadata = self.extract_excel_metadata(file_path)
                else:
                    continue

                analysis_results.append({
                    'file_path': file_path,
                    'metadata': metadata
                })

            except Exception as e:
                print(f"Error analyzing {file_path}: {e}")

        return analysis_results

    def generate_automated_report(self, analysis_results, output_file):
        """Generate automated FOCA analysis report"""

        report = {
            'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
            'total_files_analyzed': len(analysis_results),
            'summary': self.generate_analysis_summary(analysis_results),
            'detailed_results': analysis_results,
            'risk_assessment': self.perform_automated_risk_assessment(analysis_results),
            'recommendations': self.generate_automated_recommendations(analysis_results)
        }

        with open(output_file, 'w') as f:
            json.dump(report, f, indent=2, default=str)

        return report

    def run_full_automated_analysis(self, target_domain, project_name):
        """Run complete automated FOCA analysis"""

        print(f"Starting automated FOCA analysis for {target_domain}")

        # Step 1: Create project
        config_file = self.create_automated_project(target_domain, project_name)
        print(f"Created project configuration: {config_file}")

        # Step 2: Document discovery
        urls_file = self.workspace_dir / f"{project_name}_urls.txt"
        discovered_urls = self.automated_document_discovery(target_domain, urls_file)
        print(f"Discovered {len(discovered_urls)} document URLs")

        # Step 3: Download documents
        download_dir = self.workspace_dir / f"{project_name}_downloads"
        downloaded_files = self.bulk_download_documents(discovered_urls, download_dir)
        print(f"Downloaded {len(downloaded_files)} documents")

        # Step 4: Metadata analysis
        analysis_results = self.automated_metadata_analysis(downloaded_files)
        print(f"Analyzed {len(analysis_results)} documents")

        # Step 5: Generate report
        report_file = self.workspace_dir / f"{project_name}_report.json"
        report = self.generate_automated_report(analysis_results, report_file)
        print(f"Generated analysis report: {report_file}")

        return report

# Usage example
foca_automation = FOCAAutomation(
    foca_path="C:\\FOCA\\FOCA.exe",
    workspace_dir="C:\\FOCA_Automation"
)

# Run full automated analysis
report = foca_automation.run_full_automated_analysis("example.com", "example_analysis")

print("Automated FOCA Analysis Complete!")
print(f"Total files analyzed: {report['total_files_analyzed']}")
print(f"Risk level: {report['risk_assessment']['overall_risk_level']}")

Mejores prácticas y optimización¶

Optimización del rendimiento¶

# FOCA Performance Optimization Tips:

# 1. Configure appropriate timeouts
# Options > Configuration > Network
# - Connection timeout: 30 seconds
# - Download timeout: 60 seconds
# - Maximum file size: 50MB

# 2. Optimize search settings
# - Limit search results per engine: 100
# - Use specific file type filters
# - Exclude common non-target file types

# 3. Parallel processing
# - Enable multiple concurrent downloads: 5-10
# - Use multiple search engines simultaneously
# - Process different file types in parallel

# 4. Storage optimization
# - Use SSD storage for temporary files
# - Regular cleanup of downloaded files
# - Compress analysis results

# 5. Memory management
# - Close unnecessary applications
# - Increase virtual memory if needed
# - Monitor memory usage during large analyses

Consideraciones jurídicas y éticas¶

# Legal and Ethical Guidelines for FOCA Usage:

# 1. Authorization Requirements:
# - Only analyze publicly available documents
# - Obtain written permission for internal assessments
# - Respect robots.txt and terms of service
# - Follow applicable laws and regulations

# 2. Data Handling:
# - Secure storage of downloaded documents
# - Proper disposal of sensitive information
# - Encryption of analysis results
# - Limited retention periods

# 3. Responsible Disclosure:
# - Report findings to appropriate parties
# - Allow reasonable time for remediation
# - Follow coordinated disclosure practices
# - Document all activities and findings

# 4. Privacy Considerations:
# - Minimize collection of personal information
# - Anonymize data when possible
# - Respect individual privacy rights
# - Comply with data protection regulations

Problemas comunes¶

# Common FOCA Issues and Solutions:

# Issue 1: Search engines not returning results
# Solution:
# - Verify API keys are configured correctly
# - Check internet connectivity
# - Verify search engine quotas
# - Try alternative search engines

# Issue 2: Documents not downloading
# Solution:
# - Check file size limits
# - Verify download directory permissions
# - Test individual URLs manually
# - Check for anti-bot protection

# Issue 3: Metadata extraction failures
# Solution:
# - Verify file integrity
# - Check file format compatibility
# - Update Microsoft Office components
# - Try alternative extraction tools

# Issue 4: Performance issues
# Solution:
# - Reduce concurrent operations
# - Increase system memory
# - Use faster storage (SSD)
# - Close unnecessary applications

# Issue 5: False positive results
# Solution:
# - Verify document authenticity
# - Cross-reference with multiple sources
# - Manual verification of findings
# - Update analysis rules and filters