How to Remove Duplicate Lines from Text: 5 Methods Compared

The Quick Answer

To remove duplicate lines from text, paste your lines into a duplicate line remover and copy the unique output. For files, use awk '!seen[$0]++' file.txt on Linux/macOS to deduplicate while preserving order, or sort file.txt | uniq if sorting is acceptable.

Why Duplicates Appear

Duplicate lines show up in text data for predictable reasons:

Copy-paste errors — pasting the same block twice
Log files — repeated events generating identical entries
Data exports — overlapping date ranges producing duplicate rows
List merging — combining lists from different sources without deduplication
Web scraping — pagination or retry logic capturing the same content twice

The right removal method depends on your data size, whether order matters, and whether you need a one-time fix or a repeatable process.

Method 1: Online Duplicate Line Remover

Best for: Quick, one-time deduplication of small-to-medium lists (up to tens of thousands of lines).

Steps:

Open a duplicate line remover tool.
Paste your text (one item per line).
Choose options: case sensitivity, whitespace trimming, sorting.
Copy the unique output.

Advantages: No installation required. Instant results. Options for case sensitivity and whitespace handling.

Limitations: Not practical for very large files (hundreds of MB) or automated pipelines. Data stays in your browser — nothing is uploaded — but you have to paste and copy manually.

Example

Input:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Output (3 unique lines, 2 duplicates removed):

[email protected]
[email protected]
[email protected]

Method 2: Command Line (Linux / macOS)

Best for: Large files, scripting, and automated workflows.

Preserve original order with `awk`

awk '!seen[$0]++' input.txt > output.txt

How it works: awk reads each line. The associative array seen tracks which lines have appeared. The first time a line is encountered, seen[$0] is 0 (falsy after increment), so the line is printed. On subsequent encounters, it's skipped.

Sort first, then deduplicate with `sort | uniq`

sort input.txt | uniq > output.txt

uniq only removes adjacent duplicates, so sort is required first. This changes the line order.

To count how many times each line appears:

sort input.txt | uniq -c | sort -rn

This produces a frequency count, sorted from most to least common — useful for finding the most repeated entries.

Case-insensitive deduplication

awk 'BEGIN{IGNORECASE=1} !seen[tolower($0)]++' input.txt

Or with sort:

sort -f input.txt | uniq -i

Windows (PowerShell)

Get-Content input.txt | Sort-Object -Unique | Set-Content output.txt

To preserve order:

$seen = @{}; Get-Content input.txt | Where-Object { -not $seen[$_]++; $seen[$_] -eq 1 }

Method 3: Text Editors

Best for: Interactive editing of files you already have open.

VS Code

Select all text (Ctrl+A).
Open the Command Palette (Ctrl+Shift+P).
Search for "Sort Lines Ascending" (sorts and deduplicates are separate — install the "Unique Lines" extension for direct deduplication).

Sublime Text

Select all lines.
Edit → Permute Lines → Unique (removes duplicates, preserves order within the selection).

Notepad++ (Windows)

Edit → Line Operations → Remove Duplicate Lines.
Choose between removing consecutive duplicates or all duplicates across the file.

Vim

Sort and deduplicate:

:%sort u

Preserve order (remove duplicates without sorting):

:g/^\(.*\)\n\ze\(.*\n\)*\1$/d

Method 4: Spreadsheets

Best for: Tabular data where duplicates are in a specific column.

Excel

Select your data range.
Data → Remove Duplicates.
Choose which columns to check for duplicates.
Excel removes entire rows where the selected columns match.

Google Sheets

Select the data range.
Data → Data cleanup → Remove duplicates.
Optionally use =UNIQUE(A1:A100) in a new column to extract unique values without modifying the original.

Key difference from line-based deduplication: Spreadsheet deduplication can check specific columns, so two rows are only considered duplicates if the chosen columns match — even if other columns differ.

Method 5: Scripting (Python)

Best for: Custom logic, large files, or integration into data pipelines.

Preserve order

seen = set()
with open('input.txt') as f, open('output.txt', 'w') as out:
    for line in f:
        stripped = line.rstrip('\n')
        if stripped not in seen:
            seen.add(stripped)
            out.write(line)

Case-insensitive

seen = set()
with open('input.txt') as f, open('output.txt', 'w') as out:
    for line in f:
        key = line.rstrip('\n').lower()
        if key not in seen:
            seen.add(key)
            out.write(line)

With pandas (for structured data)

import pandas as pd

df = pd.read_csv('data.csv')
df_unique = df.drop_duplicates(subset=['email'])  # deduplicate by column
df_unique.to_csv('data_clean.csv', index=False)

Edge Cases to Watch For

Not all deduplication is straightforward. Watch for these:

Trailing whitespace: Two lines that look identical may differ by invisible spaces or tabs. Use a trim/whitespace option before comparing.
Line endings: Files from different operating systems use different line endings (\n vs \r\n). Normalize line endings first if you're merging files from mixed sources.
Encoding: Unicode normalization can cause "identical-looking" characters to differ at the byte level (e.g., é as a single character vs. e + combining accent). This is rare but can cause persistent "phantom duplicates."
Empty lines: Decide upfront whether blank lines should be stripped or preserved. Most tools treat all empty lines as duplicates of each other.
Leading zeros and formatting: "007" and "7" are different strings. If you're deduplicating numeric IDs, consider normalizing the format first.

Which Method Should You Use?

Scenario	Recommended method
Quick one-time cleanup	Online duplicate line remover
Large file (100K+ lines)	`awk` or Python script
File already open in editor	Text editor built-in command
Tabular data / CSV	Spreadsheet or pandas
Automated pipeline	Shell script or Python
Need frequency count	`sort \| uniq -c`

Frequently Asked Questions

Does removing duplicates change the order of my lines?

It depends on the method. Online tools and awk '!seen[$0]++' preserve original order. sort | uniq reorders lines alphabetically. Spreadsheet "Remove Duplicates" preserves the row order of the first occurrence.

How do I find duplicates without removing them?

Use sort input.txt | uniq -d to show only lines that appear more than once. In a spreadsheet, use COUNTIF to flag rows where a value appears more than once. The duplicate line remover shows identified duplicates in a separate output panel.

Can I remove duplicates based on part of a line?

Standard line-based deduplication compares entire lines. For partial matching (e.g., deduplicate by the first word or a specific column), use awk with a field selector: awk -F',' '!seen[$1]++' file.csv deduplicates by the first comma-separated field.

What is the fastest method for very large files?

For files over 100 MB, awk '!seen[$0]++' is typically the fastest single-threaded option. sort | uniq uses disk-backed sorting and handles arbitrarily large files but changes the order. Python with a set is also efficient for files that fit in memory.

How do I remove duplicate lines in a Google Doc?

Google Docs doesn't have a built-in duplicate remover. Copy the text, paste it into a duplicate line remover tool, then paste the cleaned result back. Alternatively, paste into Google Sheets (one line per cell) and use Data → Remove duplicates.

How is deduplication different from finding unique values?

They produce the same result if you only care about the output list. The difference is what you keep track of: deduplication focuses on removing repeated entries from existing data, while "find unique" emphasizes extracting the distinct set. In practice, both operations give you a list of unique lines.