Expresiones regulares (RegEx) - Combinación de patrones
*Guía completa de las expresiones regulares para el emparejamiento de patrones y procesamiento de texto *
Las expresiones regulares (regex) son potentes herramientas de ajuste de patrones utilizadas en lenguajes de programación, editores de texto y herramientas de línea de comandos. Esta guía completa incluye sintaxis regex, patrones comunes y ejemplos prácticos para el procesamiento eficaz de textos.
Sintaxis básica
Carácteres Literales
# Exact character matching
hello # Matches "hello" exactly
123 # Matches "123" exactly
Hello World # Matches "Hello World" exactly
# Case sensitivity (depends on flags)
Hello # Matches "Hello" but not "hello" (case sensitive)
(?i)Hello # Matches "Hello", "hello", "HELLO" (case insensitive)
Metacharacters
# Special characters with meaning
. # Matches any single character except newline
^ # Matches start of string/line
$ # Matches end of string/line
* # Matches 0 or more of preceding element
+ # Matches 1 or more of preceding element
? # Matches 0 or 1 of preceding element
|# OR operator (alternation)
() # Grouping
[] # Character class
\\\\{\\\\} # Quantifiers
\ # Escape character
Clases de carácter
# Predefined character classes
\d # Any digit (0-9)
\D # Any non-digit
\w # Any word character (a-z, A-Z, 0-9, _)
\W # Any non-word character
\s # Any whitespace character (space, tab, newline)
\S # Any non-whitespace character
\n # Newline character
\t # Tab character
\r # Carriage return
# Custom character classes
[abc] # Matches 'a', 'b', or 'c'
[a-z] # Matches any lowercase letter
[A-Z] # Matches any uppercase letter
[0-9] # Matches any digit
[a-zA-Z] # Matches any letter
[a-zA-Z0-9] # Matches any alphanumeric character
[^abc] # Matches any character except 'a', 'b', or 'c'
[^0-9] # Matches any non-digit
Cuantificadores
Cuantificadores básicos
# Exact repetition
a\\\\{3\\\\} # Matches exactly 3 'a's: "aaa"
a\\\\{2,5\\\\} # Matches 2 to 5 'a's: "aa", "aaa", "aaaa", "aaaaa"
a\\\\{3,\\\\} # Matches 3 or more 'a's: "aaa", "aaaa", etc.
a\\\\{,3\\\\} # Matches 0 to 3 'a's: "", "a", "aa", "aaa"
# Common quantifiers
a* # Matches 0 or more 'a's (equivalent to a\\\\{0,\\\\})
a+ # Matches 1 or more 'a's (equivalent to a\\\\{1,\\\\})
a? # Matches 0 or 1 'a' (equivalent to a\\\\{0,1\\\\})
Greedy vs Lazy Quantifiers
# Greedy (default) - matches as much as possible
.* # Matches as many characters as possible
.+ # Matches as many characters as possible (at least 1)
.\\\\{2,5\\\\} # Matches as many characters as possible (2-5)
# Lazy (non-greedy) - matches as little as possible
.*? # Matches as few characters as possible
.+? # Matches as few characters as possible (at least 1)
.\\\\{2,5\\\\}? # Matches as few characters as possible (2-5)
# Example difference
String: "Hello World"
H.*o # Greedy: matches "Hello Wo" (to last 'o')
H.*?o # Lazy: matches "Hello" (to first 'o')
Anclas y Fronteras
Anclaje de posición
# String/line boundaries
^ # Start of string or line
$ # End of string or line
\A # Start of string (not line)
\Z # End of string (not line)
\z # Very end of string
# Examples
^Hello # Matches "Hello" at start of line
World$ # Matches "World" at end of line
^Hello World$ # Matches entire line containing only "Hello World"
Word Boundaries
# Word boundaries
\b # Word boundary
\B # Non-word boundary
# Examples
\bcat\b # Matches "cat" as whole word, not in "category"
\Bcat\B # Matches "cat" only when not at word boundaries
\bcat # Matches "cat" at start of word: "cat", "category"
cat\b # Matches "cat" at end of word: "cat", "tomcat"
Grupos y Captura
Grupo básico
# Grouping with parentheses
(abc) # Groups "abc" together
(abc)+ # Matches one or more "abc" sequences
(abc|def) # Matches either "abc" or "def"
(abc)\\\\{2,4\\\\} # Matches 2 to 4 "abc" sequences
# Non-capturing groups
(?:abc) # Groups without capturing
(?:abc|def)+ # Matches sequences of "abc" or "def"
Captura de grupos
# Numbered captures
(abc)(def) # Capture group 1: "abc", group 2: "def"
(\d\\\\{4\\\\})-(\d\\\\{2\\\\})-(\d\\\\{2\\\\}) # Captures date parts: year, month, day
# Named captures
(?<year>\d\\\\{4\\\\})-(?<month>\d\\\\{2\\\\})-(?<day>\d\\\\{2\\\\}) # Named groups
(?P<name>\w+) # Python-style named group
# Backreferences
(\w+)\s+\1 # Matches repeated words: "hello hello"
(["'])(.*?)\1 # Matches quoted strings with same quote type
Lookahead y Lookbehind
Aserciones de cabeza
# Positive lookahead
\d+(?=\s*dollars) # Matches digits followed by "dollars"
\w+(?=@) # Matches username before @ in email
# Negative lookahead
\d+(?!\s*cents) # Matches digits NOT followed by "cents"
\w+(?!@) # Matches words NOT followed by @
Aserciones ocultas
# Positive lookbehind
(?``<=\$)\d+ # Matches digits preceded by $
(?<=@)\w+ # Matches domain after @ in email
# Negative lookbehind
(?<!\$)\d+ # Matches digits NOT preceded by $
(?<!@)\w+ # Matches words NOT preceded by @
Patrones comunes
Validación por correo electrónico
# Basic email pattern
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]\\\{2,\\\}\b
# More comprehensive email
^[a-zA-Z0-9.!#$%&'*+/=?^_`\\\{|\\\}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?)*$
# Simple email validation
^\S+@\S+\.\S+$
Números de teléfono
# US phone numbers
\(?\d\\\{3\\\}\)?[-.\s]?\d\\\{3\\\}[-.\s]?\d\\\{4\\\} # (123) 456-7890 or 123-456-7890
^\+?1?[-.\s]?\(?\d\\\{3\\\}\)?[-.\s]?\d\\\{3\\\}[-.\s]?\d\\\{4\\\}$ # With optional country code
# International format
^\+?[1-9]\d\\\{1,14\\\}$ # E.164 format
URLs
# Basic URL pattern
https?://[^\s]+
# More comprehensive URL
^https?://(?:[-\w.])+(?:\:[0-9]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?$
# Domain validation
^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?\.)+[a-zA-Z]\\\{2,\\\}$
Fechas
# MM/DD/YYYY format
^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d\\\{4\\\}$
# YYYY-MM-DD format (ISO 8601)
^\d\\\{4\\\}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
# Flexible date formats
\b\d\\\{1,2\\\}[/-]\d\\\{1,2\\\}[/-]\d\\\{2,4\\\}\b
Números de tarjeta de crédito
# Visa (starts with 4, 13-16 digits)
^4\d\\\{12\\\}(?:\d\\\{3\\\})?$
# MasterCard (starts with 5, 16 digits)
^5[1-5]\d\\\{14\\\}$
# American Express (starts with 34 or 37, 15 digits)
^3[47]\d\\\{13\\\}$
# General credit card (with optional spaces/dashes)
^\d\\\{4\\\}[-\s]?\d\\\{4\\\}[-\s]?\d\\\{4\\\}[-\s]?\d\\\{4\\\}$
Direcciones IP
# IPv4 address
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.)\\\{3\\\}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$
# IPv6 address (simplified)
^(?:[0-9a-fA-F]\\\{1,4\\\}:)\\\{7\\\}[0-9a-fA-F]\\\{1,4\\\}$
# IPv4 or IPv6
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.)\\\{3\\\}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$|^(?:[0-9a-fA-F]\\\{1,4\\\}:)\\\{7\\\}[0-9a-fA-F]\\\{1,4\\\}$
Ejemplos de lenguaje-específico
JavaScript
// Basic regex usage
const pattern = /hello/i; // Case insensitive
const text = "Hello World";
console.log(pattern.test(text)); // true
// String methods with regex
text.match(/\w+/g); // ["Hello", "World"]
text.replace(/hello/i, "Hi"); // "Hi World"
text.split(/\s+/); // ["Hello", "World"]
// Constructor syntax
const regex = new RegExp("hello", "i");
Python
import re
# Basic usage
pattern = r'hello'
text = "Hello World"
match = re.search(pattern, text, re.IGNORECASE)
# Common methods
re.findall(r'\w+', text) # ['Hello', 'World']
re.sub(r'hello', 'Hi', text, flags=re.IGNORECASE) # "Hi World"
re.split(r'\s+', text) # ['Hello', 'World']
# Compiled patterns (more efficient for repeated use)
pattern = re.compile(r'\d+')
pattern.findall("123 and 456") # ['123', '456']
Java
import java.util.regex.*;
// Basic usage
String pattern = "hello";
String text = "Hello World";
boolean matches = Pattern.matches("(?i)" + pattern, text);
// Pattern and Matcher objects
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(text);
while (m.find()) \\\{
System.out.println(m.group()); // Prints each word
\\\}
// String methods
text.replaceAll("(?i)hello", "Hi"); // "Hi World"
text.split("\\s+"); // ["Hello", "World"]
PHP
// Basic usage
$pattern = '/hello/i';
$text = "Hello World";
$matches = preg_match($pattern, $text); // 1 if found
// Common functions
preg_match_all('/\w+/', $text, $matches); // Find all words
preg_replace('/hello/i', 'Hi', $text); // "Hi World"
preg_split('/\s+/', $text); // ["Hello", "World"]
// With capture groups
preg_match('/(\w+)\s+(\w+)/', $text, $matches);
// $matches[1] = "Hello", $matches[2] = "World"
Banderas y Modificadores
Banderas comunes
# Case insensitive
/pattern/i # JavaScript
(?i)pattern # Inline flag
re.IGNORECASE # Python
# Global (find all matches)
/pattern/g # JavaScript
re.findall() # Python (default behavior)
# Multiline (^ and $ match line breaks)
/pattern/m # JavaScript
re.MULTILINE # Python
# Dot matches newline
/pattern/s # JavaScript
re.DOTALL # Python
# Extended (ignore whitespace, allow comments)
/pattern/x # Some languages
re.VERBOSE # Python
Consejos de rendimiento
Estrategias de optimización
# Use specific character classes instead of .
\d+ # Better than .+ for digits
[a-zA-Z]+ # Better than .+ for letters
# Anchor patterns when possible
^pattern # Faster when pattern should be at start
pattern$ # Faster when pattern should be at end
# Use non-capturing groups when you don't need the capture
(?:abc)+ # Better than (abc)+ if you don't need the group
# Avoid catastrophic backtracking
(a+)+b # Dangerous pattern
a+b # Better alternative
# Use atomic groups or possessive quantifiers
(?>``a+)b # Atomic group (some languages)
a++b # Possessive quantifier (some languages)
Pitfalls comunes
# Greedy quantifiers can be slow
.*expensive # Can be slow on long strings
.*?expensive # Often faster (lazy)
# Alternation order matters
cat|catch # "cat" will match first part of "catch"
catch|cat # Better: longer alternative first
# Escape special characters
\. # Literal dot
\$ # Literal dollar sign
\( # Literal parenthesis
Pruebas y depuración
Herramientas en línea
- regex101.com - Tester de regex interactivo con explicaciones
- regexr.com - Constructor de regex visual y tester
- regexpal.com - Herramienta de prueba de reex simple
- regexper.com - Diagramas de regex visuales
Estrategias de ensayo
# Start simple and build complexity
\d # Start with basic digit matching
\d+ # Add quantifier
\d\\{2,4\\} # Add specific range
^\d\\{2,4\\}$ # Add anchors
# Test edge cases
"" # Empty string
"a" # Single character
"aaa...aaa" # Very long strings
"special!@#" # Special characters
Esta guía completa de reex abarca los patrones y técnicas esenciales necesarios para el procesamiento eficaz de textos en diferentes idiomas e instrumentos de programación. Practica con ejemplos reales para dominar estas poderosas capacidades de talla de patrones.