Saltar a contenido

Expresiones regulares (RegEx) - Combinación de patrones

*Guía completa de las expresiones regulares para el emparejamiento de patrones y procesamiento de texto *

Las expresiones regulares (regex) son potentes herramientas de ajuste de patrones utilizadas en lenguajes de programación, editores de texto y herramientas de línea de comandos. Esta guía completa incluye sintaxis regex, patrones comunes y ejemplos prácticos para el procesamiento eficaz de textos.

Sintaxis básica

Carácteres Literales

# Exact character matching
hello           # Matches "hello" exactly
123             # Matches "123" exactly
Hello World     # Matches "Hello World" exactly

# Case sensitivity (depends on flags)
Hello           # Matches "Hello" but not "hello" (case sensitive)
(?i)Hello       # Matches "Hello", "hello", "HELLO" (case insensitive)

Metacharacters

# Special characters with meaning
.               # Matches any single character except newline
^               # Matches start of string/line
$               # Matches end of string/line
*               # Matches 0 or more of preceding element
+               # Matches 1 or more of preceding element
?               # Matches 0 or 1 of preceding element
|# OR operator (alternation)
()              # Grouping
[]              # Character class
\\\\{\\\\}              # Quantifiers
\               # Escape character

Clases de carácter

# Predefined character classes
\d              # Any digit (0-9)
\D              # Any non-digit
\w              # Any word character (a-z, A-Z, 0-9, _)
\W              # Any non-word character
\s              # Any whitespace character (space, tab, newline)
\S              # Any non-whitespace character
\n              # Newline character
\t              # Tab character
\r              # Carriage return

# Custom character classes
[abc]           # Matches 'a', 'b', or 'c'
[a-z]           # Matches any lowercase letter
[A-Z]           # Matches any uppercase letter
[0-9]           # Matches any digit
[a-zA-Z]        # Matches any letter
[a-zA-Z0-9]     # Matches any alphanumeric character
[^abc]          # Matches any character except 'a', 'b', or 'c'
[^0-9]          # Matches any non-digit

Cuantificadores

Cuantificadores básicos

# Exact repetition
a\\\\{3\\\\}            # Matches exactly 3 'a's: "aaa"
a\\\\{2,5\\\\}          # Matches 2 to 5 'a's: "aa", "aaa", "aaaa", "aaaaa"
a\\\\{3,\\\\}           # Matches 3 or more 'a's: "aaa", "aaaa", etc.
a\\\\{,3\\\\}           # Matches 0 to 3 'a's: "", "a", "aa", "aaa"

# Common quantifiers
a*              # Matches 0 or more 'a's (equivalent to a\\\\{0,\\\\})
a+              # Matches 1 or more 'a's (equivalent to a\\\\{1,\\\\})
a?              # Matches 0 or 1 'a' (equivalent to a\\\\{0,1\\\\})

Greedy vs Lazy Quantifiers

# Greedy (default) - matches as much as possible
.*              # Matches as many characters as possible
.+              # Matches as many characters as possible (at least 1)
.\\\\{2,5\\\\}          # Matches as many characters as possible (2-5)

# Lazy (non-greedy) - matches as little as possible
.*?             # Matches as few characters as possible
.+?             # Matches as few characters as possible (at least 1)
.\\\\{2,5\\\\}?         # Matches as few characters as possible (2-5)

# Example difference
String: "Hello World"
H.*o            # Greedy: matches "Hello Wo" (to last 'o')
H.*?o           # Lazy: matches "Hello" (to first 'o')

Anclas y Fronteras

Anclaje de posición

# String/line boundaries
^               # Start of string or line
$               # End of string or line
\A              # Start of string (not line)
\Z              # End of string (not line)
\z              # Very end of string

# Examples
^Hello          # Matches "Hello" at start of line
World$          # Matches "World" at end of line
^Hello World$   # Matches entire line containing only "Hello World"

Word Boundaries

# Word boundaries
\b              # Word boundary
\B              # Non-word boundary

# Examples
\bcat\b         # Matches "cat" as whole word, not in "category"
\Bcat\B         # Matches "cat" only when not at word boundaries
\bcat           # Matches "cat" at start of word: "cat", "category"
cat\b           # Matches "cat" at end of word: "cat", "tomcat"

Grupos y Captura

Grupo básico

# Grouping with parentheses
(abc)           # Groups "abc" together
(abc)+          # Matches one or more "abc" sequences
(abc|def)       # Matches either "abc" or "def"
(abc)\\\\{2,4\\\\}      # Matches 2 to 4 "abc" sequences

# Non-capturing groups
(?:abc)         # Groups without capturing
(?:abc|def)+    # Matches sequences of "abc" or "def"

Captura de grupos

# Numbered captures
(abc)(def)      # Capture group 1: "abc", group 2: "def"
(\d\\\\{4\\\\})-(\d\\\\{2\\\\})-(\d\\\\{2\\\\})  # Captures date parts: year, month, day

# Named captures
(?<year>\d\\\\{4\\\\})-(?<month>\d\\\\{2\\\\})-(?<day>\d\\\\{2\\\\})  # Named groups
(?P<name>\w+)   # Python-style named group

# Backreferences
(\w+)\s+\1      # Matches repeated words: "hello hello"
(["'])(.*?)\1   # Matches quoted strings with same quote type

Lookahead y Lookbehind

Aserciones de cabeza

# Positive lookahead
\d+(?=\s*dollars)    # Matches digits followed by "dollars"
\w+(?=@)             # Matches username before @ in email

# Negative lookahead
\d+(?!\s*cents)      # Matches digits NOT followed by "cents"
\w+(?!@)             # Matches words NOT followed by @

Aserciones ocultas

# Positive lookbehind
(?``<=\$)\d+           # Matches digits preceded by $
(?<=@)\w+            # Matches domain after @ in email

# Negative lookbehind
(?<!\$)\d+           # Matches digits NOT preceded by $
(?<!@)\w+            # Matches words NOT preceded by @

Patrones comunes

Validación por correo electrónico

# Basic email pattern
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]\\\{2,\\\}\b

# More comprehensive email
^[a-zA-Z0-9.!#$%&'*+/=?^_`\\\{|\\\}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?)*$

# Simple email validation
^\S+@\S+\.\S+$

Números de teléfono

# US phone numbers
\(?\d\\\{3\\\}\)?[-.\s]?\d\\\{3\\\}[-.\s]?\d\\\{4\\\}    # (123) 456-7890 or 123-456-7890
^\+?1?[-.\s]?\(?\d\\\{3\\\}\)?[-.\s]?\d\\\{3\\\}[-.\s]?\d\\\{4\\\}$  # With optional country code

# International format
^\+?[1-9]\d\\\{1,14\\\}$                     # E.164 format

URLs

# Basic URL pattern
https?://[^\s]+

# More comprehensive URL
^https?://(?:[-\w.])+(?:\:[0-9]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?$

# Domain validation
^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?\.)+[a-zA-Z]\\\{2,\\\}$

Fechas

# MM/DD/YYYY format
^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d\\\{4\\\}$

# YYYY-MM-DD format (ISO 8601)
^\d\\\{4\\\}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$

# Flexible date formats
\b\d\\\{1,2\\\}[/-]\d\\\{1,2\\\}[/-]\d\\\{2,4\\\}\b

Números de tarjeta de crédito

# Visa (starts with 4, 13-16 digits)
^4\d\\\{12\\\}(?:\d\\\{3\\\})?$

# MasterCard (starts with 5, 16 digits)
^5[1-5]\d\\\{14\\\}$

# American Express (starts with 34 or 37, 15 digits)
^3[47]\d\\\{13\\\}$

# General credit card (with optional spaces/dashes)
^\d\\\{4\\\}[-\s]?\d\\\{4\\\}[-\s]?\d\\\{4\\\}[-\s]?\d\\\{4\\\}$

Direcciones IP

# IPv4 address
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.)\\\{3\\\}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$

# IPv6 address (simplified)
^(?:[0-9a-fA-F]\\\{1,4\\\}:)\\\{7\\\}[0-9a-fA-F]\\\{1,4\\\}$

# IPv4 or IPv6
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.)\\\{3\\\}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$|^(?:[0-9a-fA-F]\\\{1,4\\\}:)\\\{7\\\}[0-9a-fA-F]\\\{1,4\\\}$

Ejemplos de lenguaje-específico

JavaScript

// Basic regex usage
const pattern = /hello/i;              // Case insensitive
const text = "Hello World";
console.log(pattern.test(text));      // true

// String methods with regex
text.match(/\w+/g);                    // ["Hello", "World"]
text.replace(/hello/i, "Hi");          // "Hi World"
text.split(/\s+/);                     // ["Hello", "World"]

// Constructor syntax
const regex = new RegExp("hello", "i");

Python

import re

# Basic usage
pattern = r'hello'
text = "Hello World"
match = re.search(pattern, text, re.IGNORECASE)

# Common methods
re.findall(r'\w+', text)              # ['Hello', 'World']
re.sub(r'hello', 'Hi', text, flags=re.IGNORECASE)  # "Hi World"
re.split(r'\s+', text)                # ['Hello', 'World']

# Compiled patterns (more efficient for repeated use)
pattern = re.compile(r'\d+')
pattern.findall("123 and 456")        # ['123', '456']

Java

import java.util.regex.*;

// Basic usage
String pattern = "hello";
String text = "Hello World";
boolean matches = Pattern.matches("(?i)" + pattern, text);

// Pattern and Matcher objects
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(text);
while (m.find()) \\\{
    System.out.println(m.group());    // Prints each word
\\\}

// String methods
text.replaceAll("(?i)hello", "Hi");   // "Hi World"
text.split("\\s+");                   // ["Hello", "World"]

PHP

// Basic usage
$pattern = '/hello/i';
$text = "Hello World";
$matches = preg_match($pattern, $text);  // 1 if found

// Common functions
preg_match_all('/\w+/', $text, $matches);     // Find all words
preg_replace('/hello/i', 'Hi', $text);        // "Hi World"
preg_split('/\s+/', $text);                   // ["Hello", "World"]

// With capture groups
preg_match('/(\w+)\s+(\w+)/', $text, $matches);
// $matches[1] = "Hello", $matches[2] = "World"

Banderas y Modificadores

Banderas comunes

# Case insensitive
/pattern/i      # JavaScript
(?i)pattern     # Inline flag
re.IGNORECASE   # Python

# Global (find all matches)
/pattern/g      # JavaScript
re.findall()    # Python (default behavior)

# Multiline (^ and $ match line breaks)
/pattern/m      # JavaScript
re.MULTILINE    # Python

# Dot matches newline
/pattern/s      # JavaScript
re.DOTALL       # Python

# Extended (ignore whitespace, allow comments)
/pattern/x      # Some languages
re.VERBOSE      # Python

Consejos de rendimiento

Estrategias de optimización

# Use specific character classes instead of .
\d+             # Better than .+ for digits
[a-zA-Z]+       # Better than .+ for letters

# Anchor patterns when possible
^pattern        # Faster when pattern should be at start
pattern$        # Faster when pattern should be at end

# Use non-capturing groups when you don't need the capture
(?:abc)+        # Better than (abc)+ if you don't need the group

# Avoid catastrophic backtracking
(a+)+b          # Dangerous pattern
a+b             # Better alternative

# Use atomic groups or possessive quantifiers
(?>``a+)b         # Atomic group (some languages)
a++b            # Possessive quantifier (some languages)

Pitfalls comunes

# Greedy quantifiers can be slow
.*expensive     # Can be slow on long strings
.*?expensive    # Often faster (lazy)

# Alternation order matters
cat|catch       # "cat" will match first part of "catch"
catch|cat       # Better: longer alternative first

# Escape special characters
\.              # Literal dot
\$              # Literal dollar sign
\(              # Literal parenthesis

Pruebas y depuración

Herramientas en línea

  • regex101.com - Tester de regex interactivo con explicaciones
  • regexr.com - Constructor de regex visual y tester
  • regexpal.com - Herramienta de prueba de reex simple
  • regexper.com - Diagramas de regex visuales

Estrategias de ensayo

# Start simple and build complexity
\d              # Start with basic digit matching
\d+             # Add quantifier
\d\\{2,4\\}         # Add specific range
^\d\\{2,4\\}$       # Add anchors

# Test edge cases
""              # Empty string
"a"             # Single character
"aaa...aaa"     # Very long strings
"special!@#"    # Special characters

Esta guía completa de reex abarca los patrones y técnicas esenciales necesarios para el procesamiento eficaz de textos en diferentes idiomas e instrumentos de programación. Practica con ejemplos reales para dominar estas poderosas capacidades de talla de patrones.