How to Analyze Word Frequency in Text — Methods, Formulas & Practical Examples

Learn how word frequency analysis works, what it reveals about any text, and how to use it for SEO, writing improvement, and content analysis.

Word frequency analysis counts how many times each word appears in a piece of text. It's one of the simplest forms of text analysis — and one of the most useful. Writers use it to spot repetition. SEO practitioners use it to check keyword density. Researchers use it to compare writing styles across documents.

This guide covers the core concepts, the math behind them, and practical ways to apply frequency data.

What Word Frequency Analysis Tells You

Counting words sounds trivial, but the results reveal patterns that are hard to see by reading alone:

  • Repetition problems — using the same word 47 times in a 500-word article
  • Keyword distribution — whether your target phrase appears enough (or too much) for search engines
  • Writing style — formal texts tend to have higher vocabulary diversity than casual ones
  • Content focus — the most frequent words usually reflect the true topic, regardless of what the title claims

How Word Frequency Is Calculated

The basic formula is straightforward:

Word frequency = Number of occurrences of a word ÷ Total number of words

This gives you a proportion, often expressed as a percentage.

Worked Example

Take this sentence: "The cat sat on the mat and the cat slept."

Word Count Frequency
the 3 30%
cat 2 20%
sat 1 10%
on 1 10%
mat 1 10%
and 1 10%
slept 1 10%

Total words: 10. "The" appears 3 times, so its frequency is 3 ÷ 10 = 0.30, or 30%.

Filtering Stop Words

In practice, the most frequent words in any English text are almost always function words — "the", "is", "and", "of", "to". These are called stop words. They carry grammatical meaning but rarely tell you anything about the content.

Most word frequency tools let you filter stop words so the results show content-carrying words instead. With stop words removed from the example above, "cat" becomes the top word at 40% (2 out of 5 remaining words).

Keyword Density for SEO

Keyword density is word frequency applied to a specific target phrase. The formula:

Keyword density = (Number of times keyword appears ÷ Total word count) × 100

For a 1,000-word article where "word frequency" appears 12 times:

12 ÷ 1,000 × 100 = 1.2% keyword density

What's a Good Keyword Density?

There is no magic number. Search engines in 2026 use semantic understanding, not simple keyword counting. That said, some practical guidelines:

  • Below 0.5% — the keyword may not register as a topic signal
  • 0.5% to 2% — typical range for well-written content
  • Above 3% — reads unnaturally and risks being flagged as keyword stuffing

The real test: read the text aloud. If a word feels forced or repetitive, it probably is — regardless of the percentage.

N-gram Frequency (Phrases, Not Just Words)

Single-word frequency misses multi-word phrases. "Machine learning" as a concept only shows up when you count 2-word combinations (bigrams).

Common n-gram types:

Type Example Use case
Unigram "frequency" Basic word count
Bigram "word frequency" Phrase detection, keyword density
Trigram "word frequency analysis" Long-tail keyword research

Most word frequency counters focus on unigrams, but some support n-gram analysis for deeper content evaluation.

Lexical Diversity: Measuring Vocabulary Richness

Lexical diversity (also called vocabulary richness) measures how varied the word choices are in a text. The simplest formula:

Type-Token Ratio (TTR) = Unique words ÷ Total words × 100

A 500-word essay with 250 unique words has a TTR of 50%.

What Affects Lexical Diversity

  • Text length — longer texts naturally repeat more words, lowering TTR
  • Genre — technical writing repeats specialized terms; fiction uses more varied vocabulary
  • Audience — content for beginners tends to use simpler, more repetitive language

Interpreting TTR Scores

TTR Range Typical interpretation
Below 30% Very repetitive or highly technical
30%–50% Normal for articles and essays
50%–70% Rich vocabulary, varied writing
Above 70% Very diverse (short texts or poetry)

TTR is most meaningful when comparing texts of similar length. A 100-word paragraph will almost always score higher than a 10,000-word article.

Practical Applications

1. Improving Your Writing

Run your draft through a word frequency counter. Look for:

  • Overused words — if one content word appears significantly more than others, find synonyms or restructure sentences
  • Filler words — high counts of "very", "really", "just", "basically" suggest the writing can be tightened
  • Missing variety — a low TTR might mean you're leaning on the same handful of words

2. SEO Content Auditing

Before publishing, check that:

  • Your primary keyword appears in the top 10 most frequent content words
  • Keyword density stays in the 0.5%–2% range
  • Related terms (LSI keywords) also appear — search engines expect topically complete content
  • No single word dominates to the point of feeling unnatural

3. Academic and Research Analysis

Word frequency analysis is a foundation of computational linguistics. Common academic uses:

  • Authorship attribution — comparing frequency profiles across documents to identify likely authors
  • Sentiment patterns — tracking frequency of positive vs. negative words over time
  • Language learning — identifying which words a student uses most vs. least
  • Corpus analysis — understanding word distribution across large text collections

4. Content Comparison

Compare the frequency profiles of two texts to see how their focus differs. If your competitor's article about "text analysis" uses "tokenization" 15 times and yours uses it zero times, you may be missing a relevant subtopic.

Common Mistakes

Ignoring Case Sensitivity

"Python" and "python" mean different things. Most tools default to case-insensitive counting, which is usually correct for general analysis. But when proper nouns matter (brand names, programming languages), case-sensitive mode gives more accurate results.

Not Filtering Stop Words

If your frequency table is dominated by "the", "and", "is", you're looking at grammar, not content. Always filter stop words when analyzing topic and keyword density.

Comparing Texts of Different Lengths

A 200-word email and a 5,000-word report will have very different frequency distributions even if they cover the same topic. Normalize by using percentages rather than raw counts when comparing.

Over-Optimizing for a Single Keyword

If word frequency analysis shows your keyword at 4%, the fix is not to add more text — it's to write more naturally. Search engines reward comprehensive coverage over keyword repetition.

How to Count Word Frequency

Method 1: Online Tool (Fastest)

Paste your text into a word frequency counter. Results appear instantly with counts, percentages, and optional CSV export.

Method 2: Command Line

On Linux or macOS, this one-liner counts word frequency in a file:

tr '[:upper:]' '[:lower:]' < file.txt | tr -cs '[:alpha:]' '\n' | sort | uniq -c | sort -rn | head -20

This converts to lowercase, splits into words, counts, and sorts by frequency. The head -20 shows the top 20 words.

Method 3: Python Script

from collections import Counter
import re

text = open('file.txt').read().lower()
words = re.findall(r'\b[a-z]+\b', text)
frequency = Counter(words)

for word, count in frequency.most_common(20):
    print(f"{word}: {count}")

Method 4: Spreadsheet

In Google Sheets or Excel:

  1. Put each word in column A (use a text-to-columns split)
  2. Use =COUNTIF(A:A, A1) in column B to count occurrences
  3. Remove duplicates and sort by count

FAQ

What is word frequency analysis?

Word frequency analysis counts how many times each word appears in a text and ranks them by occurrence. It reveals which words dominate a document and how varied the vocabulary is.

How do you calculate keyword density?

Divide the number of times the keyword appears by the total word count, then multiply by 100. For example, if "analysis" appears 8 times in 800 words: (8 ÷ 800) × 100 = 1.0% keyword density.

What are stop words?

Stop words are common function words like "the", "and", "is", "of", "to" that appear frequently in all English text but carry little topical meaning. Most frequency analysis filters them out to focus on content words.

What is a good keyword density for SEO?

There is no fixed rule. Most well-written content naturally falls between 0.5% and 2%. Above 3% often reads as unnatural. Modern search engines prioritize relevance and comprehensiveness over exact keyword frequency.

What is lexical diversity?

Lexical diversity (or vocabulary richness) measures how many different words a text uses relative to its total length. The Type-Token Ratio (TTR) formula is: unique words ÷ total words × 100. Higher scores mean more varied vocabulary.

Does word frequency matter for SEO in 2026?

Not as a ranking formula, but as a diagnostic tool. If your target keyword doesn't appear in your top content words, search engines may not associate your page with that topic. Frequency analysis helps verify your content matches your intent.

How many unique words does a typical English article use?

A 1,000-word article typically uses 400–600 unique words, depending on the topic and writing style. Technical content tends to reuse specialized terms more often, resulting in fewer unique words.

What is the difference between word frequency and TF-IDF?

Word frequency counts occurrences within a single document. TF-IDF (Term Frequency–Inverse Document Frequency) weights those counts against how common the word is across many documents. A word that appears frequently in your text but rarely elsewhere gets a higher TF-IDF score, making it a better topic signal.

Can word frequency analysis detect plagiarism?

Not directly, but unusual frequency patterns can flag suspicious content. If two documents share nearly identical frequency profiles for uncommon words, it may indicate copied content. Dedicated plagiarism tools use more sophisticated methods.

What is Zipf's Law?

Zipf's Law states that in any large text, the most frequent word will appear roughly twice as often as the second most frequent, three times as often as the third, and so on. Most natural language follows this distribution pattern.

Related Tools