How to Extract Email Addresses from Text — Methods, Patterns & Pitfalls

Learn how email extraction works, which regex patterns catch real addresses, what edge cases break naive approaches, and when to use a dedicated tool instead.

The Quick Answer

An email address follows the format local-part@domain. The most reliable way to extract emails from a block of text is to use a regex pattern that matches this structure:

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}

This pattern catches the vast majority of real-world email addresses. It is not a perfect validator — no single regex is — but it works well for extraction from unstructured text.

For quick extraction without writing code, paste your text into an email address extractor and get deduplicated results instantly.

What Counts as a Valid Email Address?

The format of email addresses is defined by RFC 5321 and RFC 5322. The rules are more permissive than most people expect.

An email address has two parts separated by @:

  • Local part (before the @): The mailbox name
  • Domain part (after the @): The mail server

Local Part Rules

The local part can contain:

  • Letters (a–z, A–Z)
  • Digits (0–9)
  • Dots (.), but not at the start or end, and not two in a row
  • Special characters: ! # $ % & ' * + - / = ? ^ _ { | } ~`
  • Quoted strings (e.g., "john doe"@example.com) — technically valid but rare in practice

Domain Part Rules

The domain must:

  • Contain at least one dot
  • Use only letters, digits, and hyphens in each label
  • Not start or end a label with a hyphen
  • End with a top-level domain (TLD) of at least two characters

Examples of valid addresses:

Address Why It's Valid
[email protected] Standard format
[email protected] Dots in local part
[email protected] Plus addressing (common in Gmail)
[email protected] Subdomain with country-code TLD
[email protected] IP address domain (rare but valid)
[email protected] Single-character local part

Examples that look wrong but are technically valid:

Address Note
"spaces allowed"@example.com Quoted local part
user@[192.168.1.1] IP literal domain
[email protected] Many dots, still valid

In practice, most extraction tasks target standard addresses. Exotic formats like quoted local parts or IP literal domains are rare enough to ignore for most use cases.

The Standard Extraction Regex

Here is the most commonly used pattern for extracting email addresses from text:

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}

What each part matches:

Segment Matches
[A-Za-z0-9._%+-]+ One or more local-part characters
@ The literal @ symbol
[A-Za-z0-9.-]+ One or more domain characters
\.[A-Za-z]{2,} A dot followed by 2+ letter TLD

This pattern handles about 99% of real-world email addresses you will encounter in documents, logs, email threads, and web pages.

What This Pattern Misses

  • Quoted local parts: "john doe"@example.com
  • IP literal domains: user@[10.0.0.1]
  • Internationalized email addresses (IDN): 用户@例え.jp

For extraction purposes, these omissions are acceptable. If you need strict RFC 5322 compliance, the full regex is over 6,000 characters long and impractical for most tasks.

Common Edge Cases That Break Naive Patterns

When extracting emails from real text, several patterns can produce false positives or miss valid addresses.

1. Trailing Punctuation

Text often places email addresses next to punctuation:

Contact us at [email protected].

A greedy pattern might capture the trailing period as part of the domain. The standard pattern above handles this correctly because \.[A-Za-z]{2,} requires letters after the final dot, but simpler patterns can fail here.

2. Emails Inside Angle Brackets

Email headers and mailto links often wrap addresses:

From: John Doe <[email protected]>

The extraction regex still works — it matches the address inside the brackets. But if you need to handle this explicitly, strip angle brackets in a pre-processing step.

3. Obfuscated Addresses

Some text deliberately disguises emails:

  • user [at] example [dot] com
  • user(at)example(dot)com
  • user @ example . com

Standard regex will not catch these. If you need to handle obfuscation, you will need a separate normalization step before extraction.

4. Plus Addressing

Gmail and other providers support + tags:

[email protected]
[email protected]

These are distinct delivery addresses that share the same mailbox. The standard pattern correctly captures them. Whether you want to deduplicate by stripping the +tag portion depends on your use case.

5. Long or Unusual TLDs

Modern TLDs include .photography, .international, and .museum. The {2,} quantifier in the pattern accommodates these. Older patterns that hardcoded {2,4} would miss them.

Practical Extraction Methods

Method 1: Online Tool

The fastest approach for one-off tasks. Paste text into an email address extractor, and get deduplicated results with domain breakdown, copy-to-clipboard, and CSV export. No code required, and the text never leaves your browser.

Method 2: Command Line (grep)

For files on your computer:

grep -oEi '[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}' input.txt | sort -u
  • -o prints only the matched text
  • -E enables extended regex
  • -i makes it case-insensitive
  • sort -u deduplicates

Method 3: Python Script

For more control over processing:

import re

text = open("input.txt").read()
pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
emails = list(set(re.findall(pattern, text)))
emails.sort()

for email in emails:
    print(email)

Method 4: JavaScript (In Browser)

const text = document.body.innerText;
const emails = text.match(
  /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g
) || [];
const unique = [...new Set(emails)];
console.log(unique);

This is essentially what browser-based extraction tools use internally.

Deduplication Strategies

Extracted lists often contain duplicates. How you deduplicate depends on what you need:

Exact match deduplication: Remove entries that are identical character-for-character. This is the default and safest approach.

Case-insensitive deduplication: Email addresses are case-insensitive in the domain part (per RFC) and effectively case-insensitive in the local part for most providers. Converting to lowercase before deduplication is usually safe.

Plus-tag normalization: If [email protected] and [email protected] should count as one address, strip everything between + and @ before comparing.

Sorting and Organizing Results

After extraction, you may want to organize the results:

  • Alphabetical sort: Useful for scanning large lists
  • Group by domain: Shows which organizations appear most often
  • Count by domain: Identifies the most common email providers in the dataset

Domain grouping is particularly helpful when processing contact lists, conference attendee data, or lead databases.

What Email Extraction Cannot Do

Extraction finds text that looks like an email address. It does not:

  • Verify deliverability: The address might not exist
  • Check for typos: [email protected] passes extraction
  • Detect role vs. personal: It cannot tell info@ from jane@
  • Guarantee consent: Finding an address does not mean you have permission to email it

For validation beyond pattern matching, you would need DNS MX record checks, SMTP verification, or a dedicated validation service.

Common Mistakes

Using Too Strict a Pattern

Patterns that require exactly 3-letter TLDs ({3}) miss .io, .co, .uk, .photography, and hundreds of others. Use {2,} instead.

Using Too Loose a Pattern

A pattern like \S+@\S+ will match @mentions, user@, and other non-email text. The domain part needs at least one dot and a valid TLD.

Forgetting About Context

If you extract from HTML source code, you may pick up emails from mailto: links, metadata, or JavaScript strings. Decide whether you want all of these or just visible text.

Not Handling Character Encoding

Text copied from PDFs or Word documents sometimes contains invisible characters or non-standard quotation marks around email addresses. Normalizing the text to UTF-8 before extraction helps.

Frequently Asked Questions

How do I extract email addresses from a PDF?

Copy the text from the PDF first (select all → copy), then paste it into an extraction tool or script. If the PDF is image-based (scanned), you need OCR software to convert it to text before extraction.

Can I extract emails from a web page?

Yes. Copy the visible text from the page, or use browser developer tools to access the full HTML source. The JavaScript method above works directly in the browser console on any page.

What is the maximum length of an email address?

The total length must not exceed 254 characters. The local part can be up to 64 characters, and the domain up to 253 characters.

Are email addresses case-sensitive?

The domain part is always case-insensitive. The local part is technically case-sensitive per RFC 5321, but virtually all providers treat it as case-insensitive. Lowercasing during deduplication is standard practice.

Why does my regex miss some valid emails?

Common reasons: the TLD length is restricted too tightly, plus signs or hyphens are not included in the character class, or the pattern does not handle subdomains (multiple dots in the domain).

Is it legal to extract email addresses from public text?

Extraction itself is a text-processing operation. How you use the extracted addresses determines legality. Sending unsolicited commercial email may violate CAN-SPAM (US), GDPR (EU), CASL (Canada), or other regulations depending on jurisdiction and context.

How do I handle emails with international characters?

Internationalized Email Addresses (EAI) can contain Unicode characters in both the local and domain parts. Standard ASCII regex will not match these. If you need to support them, use a Unicode-aware pattern or a dedicated library.

What is plus addressing and should I deduplicate it?

Plus addressing (e.g., [email protected]) routes to the same mailbox as [email protected]. Whether to normalize depends on your goal: for contact deduplication, strip the tag; for mailing list management, keep it as-is since the user chose that specific address.

How many emails can I extract at once?

Browser-based tools handle thousands of emails without issues since the processing runs locally. Command-line tools and scripts can process files of any size. The practical limit is your available memory, not the extraction logic.

Can extraction tools find obfuscated emails?

Standard regex cannot match addresses written as user [at] domain [dot] com. You would need a pre-processing step to normalize these patterns before running extraction.

Related Tools