Appearance
Regular Expressions (RegEx) - Pattern Matching
Complete guide to regular expressions for pattern matching and text processing
Regular expressions (regex) are powerful pattern-matching tools used across programming languages, text editors, and command-line tools. This comprehensive guide covers regex syntax, common patterns, and practical examples for effective text processing.
Basic Syntax
Literal Characters
regex
# Exact character matching
hello # Matches "hello" exactly
123 # Matches "123" exactly
Hello World # Matches "Hello World" exactly
# Case sensitivity (depends on flags)
Hello # Matches "Hello" but not "hello" (case sensitive)
(?i)Hello # Matches "Hello", "hello", "HELLO" (case insensitive)
Metacharacters
regex
# Special characters with meaning
. # Matches any single character except newline
^ # Matches start of string/line
$ # Matches end of string/line
* # Matches 0 or more of preceding element
+ # Matches 1 or more of preceding element
? # Matches 0 or 1 of preceding element
| # OR operator (alternation)
() # Grouping
[] # Character class
{} # Quantifiers
\ # Escape character
Character Classes
regex
# Predefined character classes
\d # Any digit (0-9)
\D # Any non-digit
\w # Any word character (a-z, A-Z, 0-9, _)
\W # Any non-word character
\s # Any whitespace character (space, tab, newline)
\S # Any non-whitespace character
\n # Newline character
\t # Tab character
\r # Carriage return
# Custom character classes
[abc] # Matches 'a', 'b', or 'c'
[a-z] # Matches any lowercase letter
[A-Z] # Matches any uppercase letter
[0-9] # Matches any digit
[a-zA-Z] # Matches any letter
[a-zA-Z0-9] # Matches any alphanumeric character
[^abc] # Matches any character except 'a', 'b', or 'c'
[^0-9] # Matches any non-digit
Quantifiers
Basic Quantifiers
regex
# Exact repetition
a{3} # Matches exactly 3 'a's: "aaa"
a{2,5} # Matches 2 to 5 'a's: "aa", "aaa", "aaaa", "aaaaa"
a{3,} # Matches 3 or more 'a's: "aaa", "aaaa", etc.
a{,3} # Matches 0 to 3 'a's: "", "a", "aa", "aaa"
# Common quantifiers
a* # Matches 0 or more 'a's (equivalent to a{0,})
a+ # Matches 1 or more 'a's (equivalent to a{1,})
a? # Matches 0 or 1 'a' (equivalent to a{0,1})
Greedy vs Lazy Quantifiers
regex
# Greedy (default) - matches as much as possible
.* # Matches as many characters as possible
.+ # Matches as many characters as possible (at least 1)
.{2,5} # Matches as many characters as possible (2-5)
# Lazy (non-greedy) - matches as little as possible
.*? # Matches as few characters as possible
.+? # Matches as few characters as possible (at least 1)
.{2,5}? # Matches as few characters as possible (2-5)
# Example difference
String: "Hello World"
H.*o # Greedy: matches "Hello Wo" (to last 'o')
H.*?o # Lazy: matches "Hello" (to first 'o')
Anchors and Boundaries
Position Anchors
regex
# String/line boundaries
^ # Start of string or line
$ # End of string or line
\A # Start of string (not line)
\Z # End of string (not line)
\z # Very end of string
# Examples
^Hello # Matches "Hello" at start of line
World$ # Matches "World" at end of line
^Hello World$ # Matches entire line containing only "Hello World"
Word Boundaries
regex
# Word boundaries
\b # Word boundary
\B # Non-word boundary
# Examples
\bcat\b # Matches "cat" as whole word, not in "category"
\Bcat\B # Matches "cat" only when not at word boundaries
\bcat # Matches "cat" at start of word: "cat", "category"
cat\b # Matches "cat" at end of word: "cat", "tomcat"
Groups and Capturing
Basic Grouping
regex
# Grouping with parentheses
(abc) # Groups "abc" together
(abc)+ # Matches one or more "abc" sequences
(abc|def) # Matches either "abc" or "def"
(abc){2,4} # Matches 2 to 4 "abc" sequences
# Non-capturing groups
(?:abc) # Groups without capturing
(?:abc|def)+ # Matches sequences of "abc" or "def"
Capturing Groups
regex
# Numbered captures
(abc)(def) # Capture group 1: "abc", group 2: "def"
(\d{4})-(\d{2})-(\d{2}) # Captures date parts: year, month, day
# Named captures
(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}) # Named groups
(?P<name>\w+) # Python-style named group
# Backreferences
(\w+)\s+\1 # Matches repeated words: "hello hello"
(["'])(.*?)\1 # Matches quoted strings with same quote type
Lookahead and Lookbehind
Lookahead Assertions
regex
# Positive lookahead
\d+(?=\s*dollars) # Matches digits followed by "dollars"
\w+(?=@) # Matches username before @ in email
# Negative lookahead
\d+(?!\s*cents) # Matches digits NOT followed by "cents"
\w+(?!@) # Matches words NOT followed by @
Lookbehind Assertions
regex
# Positive lookbehind
(?<=\$)\d+ # Matches digits preceded by $
(?<=@)\w+ # Matches domain after @ in email
# Negative lookbehind
(?<!\$)\d+ # Matches digits NOT preceded by $
(?<!@)\w+ # Matches words NOT preceded by @
Common Patterns
Email Validation
regex
# Basic email pattern
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
# More comprehensive email
^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$
# Simple email validation
^\S+@\S+\.\S+$
Phone Numbers
regex
# US phone numbers
\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} # (123) 456-7890 or 123-456-7890
^\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$ # With optional country code
# International format
^\+?[1-9]\d{1,14}$ # E.164 format
URLs
regex
# Basic URL pattern
https?://[^\s]+
# More comprehensive URL
^https?://(?:[-\w.])+(?:\:[0-9]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?$
# Domain validation
^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}$
Dates
regex
# MM/DD/YYYY format
^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}$
# YYYY-MM-DD format (ISO 8601)
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
# Flexible date formats
\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b
Credit Card Numbers
regex
# Visa (starts with 4, 13-16 digits)
^4\d{12}(?:\d{3})?$
# MasterCard (starts with 5, 16 digits)
^5[1-5]\d{14}$
# American Express (starts with 34 or 37, 15 digits)
^3[47]\d{13}$
# General credit card (with optional spaces/dashes)
^\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}$
IP Addresses
regex
# IPv4 address
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$
# IPv6 address (simplified)
^(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}$
# IPv4 or IPv6
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$|^(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}$
Language-Specific Examples
JavaScript
javascript
// Basic regex usage
const pattern = /hello/i; // Case insensitive
const text = "Hello World";
console.log(pattern.test(text)); // true
// String methods with regex
text.match(/\w+/g); // ["Hello", "World"]
text.replace(/hello/i, "Hi"); // "Hi World"
text.split(/\s+/); // ["Hello", "World"]
// Constructor syntax
const regex = new RegExp("hello", "i");
Python
python
import re
# Basic usage
pattern = r'hello'
text = "Hello World"
match = re.search(pattern, text, re.IGNORECASE)
# Common methods
re.findall(r'\w+', text) # ['Hello', 'World']
re.sub(r'hello', 'Hi', text, flags=re.IGNORECASE) # "Hi World"
re.split(r'\s+', text) # ['Hello', 'World']
# Compiled patterns (more efficient for repeated use)
pattern = re.compile(r'\d+')
pattern.findall("123 and 456") # ['123', '456']
Java
java
import java.util.regex.*;
// Basic usage
String pattern = "hello";
String text = "Hello World";
boolean matches = Pattern.matches("(?i)" + pattern, text);
// Pattern and Matcher objects
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group()); // Prints each word
}
// String methods
text.replaceAll("(?i)hello", "Hi"); // "Hi World"
text.split("\\s+"); // ["Hello", "World"]
PHP
php
// Basic usage
$pattern = '/hello/i';
$text = "Hello World";
$matches = preg_match($pattern, $text); // 1 if found
// Common functions
preg_match_all('/\w+/', $text, $matches); // Find all words
preg_replace('/hello/i', 'Hi', $text); // "Hi World"
preg_split('/\s+/', $text); // ["Hello", "World"]
// With capture groups
preg_match('/(\w+)\s+(\w+)/', $text, $matches);
// $matches[1] = "Hello", $matches[2] = "World"
Flags and Modifiers
Common Flags
regex
# Case insensitive
/pattern/i # JavaScript
(?i)pattern # Inline flag
re.IGNORECASE # Python
# Global (find all matches)
/pattern/g # JavaScript
re.findall() # Python (default behavior)
# Multiline (^ and $ match line breaks)
/pattern/m # JavaScript
re.MULTILINE # Python
# Dot matches newline
/pattern/s # JavaScript
re.DOTALL # Python
# Extended (ignore whitespace, allow comments)
/pattern/x # Some languages
re.VERBOSE # Python
Performance Tips
Optimization Strategies
regex
# Use specific character classes instead of .
\d+ # Better than .+ for digits
[a-zA-Z]+ # Better than .+ for letters
# Anchor patterns when possible
^pattern # Faster when pattern should be at start
pattern$ # Faster when pattern should be at end
# Use non-capturing groups when you don't need the capture
(?:abc)+ # Better than (abc)+ if you don't need the group
# Avoid catastrophic backtracking
(a+)+b # Dangerous pattern
a+b # Better alternative
# Use atomic groups or possessive quantifiers
(?>a+)b # Atomic group (some languages)
a++b # Possessive quantifier (some languages)
Common Pitfalls
regex
# Greedy quantifiers can be slow
.*expensive # Can be slow on long strings
.*?expensive # Often faster (lazy)
# Alternation order matters
cat|catch # "cat" will match first part of "catch"
catch|cat # Better: longer alternative first
# Escape special characters
\. # Literal dot
\$ # Literal dollar sign
\( # Literal parenthesis
Testing and Debugging
Online Tools
- regex101.com - Interactive regex tester with explanations
- regexr.com - Visual regex builder and tester
- regexpal.com - Simple regex testing tool
- regexper.com - Visual regex diagrams
Testing Strategies
regex
# Start simple and build complexity
\d # Start with basic digit matching
\d+ # Add quantifier
\d{2,4} # Add specific range
^\d{2,4}$ # Add anchors
# Test edge cases
"" # Empty string
"a" # Single character
"aaa...aaa" # Very long strings
"special!@#" # Special characters
This comprehensive regex guide covers the essential patterns and techniques needed for effective text processing across different programming languages and tools. Practice with real examples to master these powerful pattern-matching capabilities.