Information Gain — Why Original Content Wins

What It Is

Information Gain is a 0-100 score that measures how much unique value your content provides compared to competing pages ranking for the same keywords. It answers the question: "Does this page tell the reader something they can't find elsewhere?" Google's Information Gain patent (US20200349181A1) describes exactly this concept — pages that add new information to a topic deserve higher rankings.

Why It Matters for Your SEO

In competitive SERPs, pages that merely rewrite what every other result says offer zero information gain. Google actively looks for pages that contribute unique entities, original concepts, proprietary data, or novel perspectives. A high Information Gain score means your content is genuinely additive — not just another commodity article. Pages with scores above 60 consistently outperform similar-quality pages that lack unique content.

How korvex Measures It

The score combines three components:

Component	Points	What It Measures
Unique Entities	0-40	Named things (people, products, organisations) that appear in your content but not in competitors
Unique Concepts	0-40	Ideas, phrases, and topic clusters unique to your page
Content Depth	0-20	Structural quality — word count, heading structure, paragraph depth

Score Ranges

Range	Rating	What It Means
70-100	High Gain	Substantial original content — strong differentiation
50-69	Moderate Gain	Some unique angles but overlaps significantly with competitors
30-49	Low Gain	Mostly commodity content with limited original contribution
0-29	Minimal Gain	Nearly all content duplicates what competitors already cover

How to Improve Your Score

Add proprietary data — original research, case studies with real numbers, surveys, or benchmarks
Cover entities competitors miss — identify entity gaps using the Entity Intelligence page and fill them
Go deeper on subtopics — expand thin sections with original analysis rather than surface-level summaries
Include expert perspectives — quotes, interviews, or commentary from practitioners
Structure for depth — use 10+ headings and 15+ paragraphs to demonstrate comprehensive coverage

<details> <summary>Technical Deep Dive</summary>

Scoring Components

Unique Entities (0-40 points):

base_score = uniqueness_ratio × 30 (ratio of entities NOT found in competitors)
bonus = min(unique_count / 20, 1.0) × 10 (absolute count bonus)
Semantic deduplication: entities with cosine similarity > 0.85 to a competitor entity count as duplicates
Uses spaCy entity extraction + SentenceTransformer embeddings (all-MiniLM-L6-v2, 384-dim)

Unique Concepts (0-40 points):

Same formula as entities but for concept phrases (unigrams ≥4 chars, bigrams, trigrams)
Top 200 concepts per side sampled for performance
Stop-word filtered, BERT-normalised (NFD + strip combining marks)
bonus = min(unique_count / 50, 1.0) × 10

Content Depth (0-20 points):

Word count (0-8 pts): 2000+ words = 8, 1500+ = 6, 1000+ = 4, 500+ = 2
Heading structure (0-6 pts): min(heading_count / 10, 1.0) × 6
Paragraph structure (0-4 pts): min(paragraph_count / 15, 1.0) × 4
Technical detail (0-2 pts): average words per paragraph ≥ 100 = 2, ≥ 50 = 1.5

Data Sources

Competitor content: Fetched during , stored in page_scores with is_competitor = true
Entity extraction: services/analyzers/entity_extractor.py using spaCy en_core_web_sm
Embedding model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, cosine similarity)
Update frequency: Recalculated when page is re-scored in Phase 5

Entity Salience — how individual entities are weighted
The Koray Score — information gain contributes to overall content quality
Content Opportunities — finding topics with high information gain potential

</details>