Regular Expressions (RegEx) - Pattern Matching

Complete guide to regular expressions for pattern matching and text processing

Regular expressions (regex) are powerful pattern-matching tools used across programming languages, text editors, and command-line tools. This comprehensive guide covers regex syntax, common patterns, and practical examples for effective text processing.

Basic Syntax

Literal Characters

# Exact character matching
hello           # Matches "hello" exactly
123             # Matches "123" exactly
Hello World     # Matches "Hello World" exactly

# Case sensitivity (depends on flags)
Hello           # Matches "Hello" but not "hello" (case sensitive)
(?i)Hello       # Matches "Hello", "hello", "HELLO" (case insensitive)

Metacharacters

# Special characters with meaning
.               # Matches any single character except newline
^               # Matches start of string/line
$               # Matches end of string/line
*               # Matches 0 or more of preceding element
+               # Matches 1 or more of preceding element
?               # Matches 0 or 1 of preceding element
|# OR operator (alternation)
()              # Grouping
[]              # Character class
\\\\{\\\\}              # Quantifiers
\               # Escape character

Character Classes

# Predefined character classes
\d              # Any digit (0-9)
\D              # Any non-digit
\w              # Any word character (a-z, A-Z, 0-9, _)
\W              # Any non-word character
\s              # Any whitespace character (space, tab, newline)
\S              # Any non-whitespace character
\n              # Newline character
\t              # Tab character
\r              # Carriage return

# Custom character classes
[abc]           # Matches 'a', 'b', or 'c'
[a-z]           # Matches any lowercase letter
[A-Z]           # Matches any uppercase letter
[0-9]           # Matches any digit
[a-zA-Z]        # Matches any letter
[a-zA-Z0-9]     # Matches any alphanumeric character
[^abc]          # Matches any character except 'a', 'b', or 'c'
[^0-9]          # Matches any non-digit

Quantifiers

Basic Quantifiers

# Exact repetition
a\\\\{3\\\\}            # Matches exactly 3 'a's: "aaa"
a\\\\{2,5\\\\}          # Matches 2 to 5 'a's: "aa", "aaa", "aaaa", "aaaaa"
a\\\\{3,\\\\}           # Matches 3 or more 'a's: "aaa", "aaaa", etc.
a\\\\{,3\\\\}           # Matches 0 to 3 'a's: "", "a", "aa", "aaa"

# Common quantifiers
a*              # Matches 0 or more 'a's (equivalent to a\\\\{0,\\\\})
a+              # Matches 1 or more 'a's (equivalent to a\\\\{1,\\\\})
a?              # Matches 0 or 1 'a' (equivalent to a\\\\{0,1\\\\})

Greedy vs Lazy Quantifiers

# Greedy (default) - matches as much as possible
.*              # Matches as many characters as possible
.+              # Matches as many characters as possible (at least 1)
.\\\\{2,5\\\\}          # Matches as many characters as possible (2-5)

# Lazy (non-greedy) - matches as little as possible
.*?             # Matches as few characters as possible
.+?             # Matches as few characters as possible (at least 1)
.\\\\{2,5\\\\}?         # Matches as few characters as possible (2-5)

# Example difference
String: "Hello World"
H.*o            # Greedy: matches "Hello Wo" (to last 'o')
H.*?o           # Lazy: matches "Hello" (to first 'o')

Anchors and Boundaries

Position Anchors

# String/line boundaries
^               # Start of string or line
$               # End of string or line
\A              # Start of string (not line)
\Z              # End of string (not line)
\z              # Very end of string

# Examples
^Hello          # Matches "Hello" at start of line
World$          # Matches "World" at end of line
^Hello World$   # Matches entire line containing only "Hello World"

Word Boundaries

# Word boundaries
\b              # Word boundary
\B              # Non-word boundary

# Examples
\bcat\b         # Matches "cat" as whole word, not in "category"
\Bcat\B         # Matches "cat" only when not at word boundaries
\bcat           # Matches "cat" at start of word: "cat", "category"
cat\b           # Matches "cat" at end of word: "cat", "tomcat"

Groups and Capturing

Basic Grouping

# Grouping with parentheses
(abc)           # Groups "abc" together
(abc)+          # Matches one or more "abc" sequences
(abc|def)       # Matches either "abc" or "def"
(abc)\\\\{2,4\\\\}      # Matches 2 to 4 "abc" sequences

# Non-capturing groups
(?:abc)         # Groups without capturing
(?:abc|def)+    # Matches sequences of "abc" or "def"

Capturing Groups

# Numbered captures
(abc)(def)      # Capture group 1: "abc", group 2: "def"
(\d\\\\{4\\\\})-(\d\\\\{2\\\\})-(\d\\\\{2\\\\})  # Captures date parts: year, month, day

# Named captures
(?<year>\d\\\\{4\\\\})-(?<month>\d\\\\{2\\\\})-(?<day>\d\\\\{2\\\\})  # Named groups
(?P<name>\w+)   # Python-style named group

# Backreferences
(\w+)\s+\1      # Matches repeated words: "hello hello"
(["'])(.*?)\1   # Matches quoted strings with same quote type

Lookahead and Lookbehind

Lookahead Assertions

# Positive lookahead
\d+(?=\s*dollars)    # Matches digits followed by "dollars"
\w+(?=@)             # Matches username before @ in email

# Negative lookahead
\d+(?!\s*cents)      # Matches digits NOT followed by "cents"
\w+(?!@)             # Matches words NOT followed by @

Lookbehind Assertions

# Positive lookbehind
(?``<=\$)\d+           # Matches digits preceded by $
(?<=@)\w+            # Matches domain after @ in email

# Negative lookbehind
(?<!\$)\d+           # Matches digits NOT preceded by $
(?<!@)\w+            # Matches words NOT preceded by @

Common Patterns

Email Validation

# Basic email pattern
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]\\\{2,\\\}\b

# More comprehensive email
^[a-zA-Z0-9.!#$%&'*+/=?^_`\\\{|\\\}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?)*$

# Simple email validation
^\S+@\S+\.\S+$

Phone Numbers

# US phone numbers
\(?\d\\\{3\\\}\)?[-.\s]?\d\\\{3\\\}[-.\s]?\d\\\{4\\\}    # (123) 456-7890 or 123-456-7890
^\+?1?[-.\s]?\(?\d\\\{3\\\}\)?[-.\s]?\d\\\{3\\\}[-.\s]?\d\\\{4\\\}$  # With optional country code

# International format
^\+?[1-9]\d\\\{1,14\\\}$                     # E.164 format

URLs

# Basic URL pattern
https?://[^\s]+

# More comprehensive URL
^https?://(?:[-\w.])+(?:\:[0-9]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?$

# Domain validation
^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]\\\{0,61\\\}[a-zA-Z0-9])?\.)+[a-zA-Z]\\\{2,\\\}$

Dates

# MM/DD/YYYY format
^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d\\\{4\\\}$

# YYYY-MM-DD format (ISO 8601)
^\d\\\{4\\\}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$

# Flexible date formats
\b\d\\\{1,2\\\}[/-]\d\\\{1,2\\\}[/-]\d\\\{2,4\\\}\b

Credit Card Numbers

# Visa (starts with 4, 13-16 digits)
^4\d\\\{12\\\}(?:\d\\\{3\\\})?$

# MasterCard (starts with 5, 16 digits)
^5[1-5]\d\\\{14\\\}$

# American Express (starts with 34 or 37, 15 digits)
^3[47]\d\\\{13\\\}$

# General credit card (with optional spaces/dashes)
^\d\\\{4\\\}[-\s]?\d\\\{4\\\}[-\s]?\d\\\{4\\\}[-\s]?\d\\\{4\\\}$

IP Addresses

# IPv4 address
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.)\\\{3\\\}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$

# IPv6 address (simplified)
^(?:[0-9a-fA-F]\\\{1,4\\\}:)\\\{7\\\}[0-9a-fA-F]\\\{1,4\\\}$

# IPv4 or IPv6
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.)\\\{3\\\}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$|^(?:[0-9a-fA-F]\\\{1,4\\\}:)\\\{7\\\}[0-9a-fA-F]\\\{1,4\\\}$

Language-Specific Examples

JavaScript

// Basic regex usage
const pattern = /hello/i;              // Case insensitive
const text = "Hello World";
console.log(pattern.test(text));      // true

// String methods with regex
text.match(/\w+/g);                    // ["Hello", "World"]
text.replace(/hello/i, "Hi");          // "Hi World"
text.split(/\s+/);                     // ["Hello", "World"]

// Constructor syntax
const regex = new RegExp("hello", "i");

Python

import re

# Basic usage
pattern = r'hello'
text = "Hello World"
match = re.search(pattern, text, re.IGNORECASE)

# Common methods
re.findall(r'\w+', text)              # ['Hello', 'World']
re.sub(r'hello', 'Hi', text, flags=re.IGNORECASE)  # "Hi World"
re.split(r'\s+', text)                # ['Hello', 'World']

# Compiled patterns (more efficient for repeated use)
pattern = re.compile(r'\d+')
pattern.findall("123 and 456")        # ['123', '456']

Java

import java.util.regex.*;

// Basic usage
String pattern = "hello";
String text = "Hello World";
boolean matches = Pattern.matches("(?i)" + pattern, text);

// Pattern and Matcher objects
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(text);
while (m.find()) \\\{
    System.out.println(m.group());    // Prints each word
\\\}

// String methods
text.replaceAll("(?i)hello", "Hi");   // "Hi World"
text.split("\\s+");                   // ["Hello", "World"]

PHP

// Basic usage
$pattern = '/hello/i';
$text = "Hello World";
$matches = preg_match($pattern, $text);  // 1 if found

// Common functions
preg_match_all('/\w+/', $text, $matches);     // Find all words
preg_replace('/hello/i', 'Hi', $text);        // "Hi World"
preg_split('/\s+/', $text);                   // ["Hello", "World"]

// With capture groups
preg_match('/(\w+)\s+(\w+)/', $text, $matches);
// $matches[1] = "Hello", $matches[2] = "World"

Flags and Modifiers

Common Flags

# Case insensitive
/pattern/i      # JavaScript
(?i)pattern     # Inline flag
re.IGNORECASE   # Python

# Global (find all matches)
/pattern/g      # JavaScript
re.findall()    # Python (default behavior)

# Multiline (^ and $ match line breaks)
/pattern/m      # JavaScript
re.MULTILINE    # Python

# Dot matches newline
/pattern/s      # JavaScript
re.DOTALL       # Python

# Extended (ignore whitespace, allow comments)
/pattern/x      # Some languages
re.VERBOSE      # Python

Performance Tips

Optimization Strategies

# Use specific character classes instead of .
\d+             # Better than .+ for digits
[a-zA-Z]+       # Better than .+ for letters

# Anchor patterns when possible
^pattern        # Faster when pattern should be at start
pattern$        # Faster when pattern should be at end

# Use non-capturing groups when you don't need the capture
(?:abc)+        # Better than (abc)+ if you don't need the group

# Avoid catastrophic backtracking
(a+)+b          # Dangerous pattern
a+b             # Better alternative

# Use atomic groups or possessive quantifiers
(?>``a+)b         # Atomic group (some languages)
a++b            # Possessive quantifier (some languages)

Common Pitfalls

# Greedy quantifiers can be slow
.*expensive     # Can be slow on long strings
.*?expensive    # Often faster (lazy)

# Alternation order matters
cat|catch       # "cat" will match first part of "catch"
catch|cat       # Better: longer alternative first

# Escape special characters
\.              # Literal dot
\$              # Literal dollar sign
\(              # Literal parenthesis

Testing and Debugging

Online Tools

regex101.com - Interactive regex tester with explanations
regexr.com - Visual regex builder and tester
regexpal.com - Simple regex testing tool
regexper.com - Visual regex diagrams

Testing Strategies

# Start simple and build complexity
\d              # Start with basic digit matching
\d+             # Add quantifier
\d\\{2,4\\}         # Add specific range
^\d\\{2,4\\}$       # Add anchors

# Test edge cases
""              # Empty string
"a"             # Single character
"aaa...aaa"     # Very long strings
"special!@#"    # Special characters

This comprehensive regex guide covers the essential patterns and techniques needed for effective text processing across different programming languages and tools. Practice with real examples to master these powerful pattern-matching capabilities.