Markdown Data Validation and Schema Checking: Complete Guide for Content Quality Assurance and Automated Verification
Advanced Markdown data validation and schema checking enables sophisticated content quality assurance systems that automatically verify document structure, validate metadata consistency, and enforce editorial standards across large documentation repositories. By implementing comprehensive validation frameworks, automated schema checking, and intelligent content verification systems, technical teams can maintain high-quality documentation standards while scaling content production and ensuring consistency across distributed authoring workflows.
Why Master Markdown Data Validation?
Professional data validation provides essential benefits for content quality assurance:
- Content Consistency: Enforce standardized document structures and metadata formats across all content
- Quality Assurance: Automatically detect formatting errors, missing required fields, and structural inconsistencies
- Editorial Standards: Maintain consistent writing quality through automated grammar, style, and terminology checking
- Integration Compliance: Ensure content meets API requirements and system integration specifications
- Team Collaboration: Provide immediate feedback to authors and editors through automated validation workflows
Foundation Validation Concepts
Understanding Markdown Schema Validation
Implementing structured validation approaches for Markdown content:
# markdown-schema.yml - Comprehensive schema definition
schema_version: "1.0"
document_types:
blog_post:
required_frontmatter:
- title
- description
- keywords
- layout
- date
- author
- category
optional_frontmatter:
- image
- tags
- excerpt
- featured
frontmatter_rules:
title:
type: string
min_length: 10
max_length: 120
pattern: "^[A-Za-z0-9\\s\\-:]+$"
description:
type: string
min_length: 50
max_length: 300
must_end_with_period: true
keywords:
type: string
min_keywords: 3
max_keywords: 12
separator: ", "
date:
type: date
format: "YYYY-MM-DD"
not_future: true
category:
type: string
allowed_values: ["Tutorial", "Guide", "Reference", "News"]
content_rules:
min_word_count: 800
max_word_count: 5000
required_sections:
- introduction_paragraph
- main_headings
- conclusion_section
heading_structure:
max_heading_level: 4
require_h1: true
sequential_levels: true
code_blocks:
require_language_tags: true
max_line_length: 100
escape_liquid_syntax: true
links:
validate_internal: true
check_external: false
require_descriptions: true
images:
require_alt_text: true
validate_paths: true
max_file_size: "2MB"
documentation_page:
required_frontmatter:
- title
- description
- layout
- nav_order
content_rules:
min_word_count: 200
require_table_of_contents: true
api_reference:
required_frontmatter:
- title
- api_version
- endpoint
- method
- layout
content_rules:
required_sections:
- parameters_section
- response_section
- example_section
Comprehensive Validation Framework
Building robust validation systems for Markdown content:
# markdown_validator.py - Advanced validation framework
import yaml
import re
import os
import requests
from datetime import datetime, date
from pathlib import Path
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass
from enum import Enum
class ValidationLevel(Enum):
ERROR = "error"
WARNING = "warning"
INFO = "info"
class ValidationCategory(Enum):
FRONTMATTER = "frontmatter"
CONTENT = "content"
STRUCTURE = "structure"
LINKS = "links"
IMAGES = "images"
STYLE = "style"
@dataclass
class ValidationResult:
level: ValidationLevel
category: ValidationCategory
message: str
line_number: Optional[int] = None
column_number: Optional[int] = None
suggestion: Optional[str] = None
class MarkdownValidator:
def __init__(self, schema_path: str = "markdown-schema.yml"):
"""Initialize validator with schema configuration"""
self.schema = self.load_schema(schema_path)
self.results: List[ValidationResult] = []
# Compiled regex patterns for performance
self.patterns = {
'frontmatter': re.compile(r'^---\n(.*?)\n---', re.DOTALL),
'heading': re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE),
'code_block': re.compile(r'```(\w*)\n(.*?)\n```', re.DOTALL),
'inline_code': re.compile(r'`([^`]+)`'),
'link': re.compile(r'\[([^\]]+)\]\(([^)]+)\)'),
'image': re.compile(r'!\[([^\]]*)\]\(([^)]+)\)'),
'liquid_syntax': re.compile(r'\{\{.*?\}\}|\{%.*?%\}'),
'html_tags': re.compile(r'<[^>]+>'),
'word_boundary': re.compile(r'\b\w+\b')
}
def load_schema(self, schema_path: str) -> Dict[str, Any]:
"""Load validation schema from YAML file"""
try:
with open(schema_path, 'r', encoding='utf-8') as f:
return yaml.safe_load(f)
except FileNotFoundError:
return self.get_default_schema()
except yaml.YAMLError as e:
raise ValueError(f"Invalid schema YAML: {e}")
def get_default_schema(self) -> Dict[str, Any]:
"""Return default schema when no schema file is found"""
return {
'schema_version': '1.0',
'document_types': {
'default': {
'required_frontmatter': ['title', 'description', 'date'],
'content_rules': {
'min_word_count': 100,
'max_word_count': 10000
}
}
}
}
def validate_file(self, file_path: str, document_type: str = 'default') -> List[ValidationResult]:
"""Validate a single Markdown file"""
self.results = []
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
except Exception as e:
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.STRUCTURE,
message=f"Cannot read file: {e}"
))
return self.results
# Get document type schema
schema = self.schema['document_types'].get(document_type,
self.schema['document_types']['default'])
# Parse frontmatter and content
frontmatter, content_body = self.parse_document(content)
# Run validation checks
self.validate_frontmatter(frontmatter, schema)
self.validate_content_structure(content_body, schema)
self.validate_content_rules(content_body, schema)
self.validate_links(content_body, file_path)
self.validate_images(content_body, file_path)
self.validate_code_blocks(content_body, schema)
self.validate_style_guidelines(content_body)
return self.results
def parse_document(self, content: str) -> Tuple[Dict[str, Any], str]:
"""Parse frontmatter and content body from markdown"""
frontmatter_match = self.patterns['frontmatter'].match(content)
if not frontmatter_match:
return {}, content
try:
frontmatter = yaml.safe_load(frontmatter_match.group(1))
content_body = content[frontmatter_match.end():]
return frontmatter or {}, content_body
except yaml.YAMLError as e:
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.FRONTMATTER,
message=f"Invalid frontmatter YAML: {e}",
line_number=1
))
return {}, content
def validate_frontmatter(self, frontmatter: Dict[str, Any], schema: Dict[str, Any]):
"""Validate frontmatter against schema requirements"""
required_fields = schema.get('required_frontmatter', [])
optional_fields = schema.get('optional_frontmatter', [])
frontmatter_rules = schema.get('frontmatter_rules', {})
# Check required fields
for field in required_fields:
if field not in frontmatter:
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.FRONTMATTER,
message=f"Missing required frontmatter field: {field}",
suggestion=f"Add '{field}: [value]' to frontmatter"
))
else:
# Validate field according to rules
self.validate_frontmatter_field(field, frontmatter[field], frontmatter_rules.get(field, {}))
# Check for unknown fields
all_allowed = set(required_fields + optional_fields)
for field in frontmatter:
if field not in all_allowed:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Unknown frontmatter field: {field}",
suggestion="Remove unknown field or add to schema"
))
def validate_frontmatter_field(self, field_name: str, value: Any, rules: Dict[str, Any]):
"""Validate individual frontmatter field against rules"""
if not rules:
return
# Type validation
expected_type = rules.get('type')
if expected_type == 'string' and not isinstance(value, str):
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' must be a string",
suggestion=f"Change {field_name} value to string format"
))
return
elif expected_type == 'date':
self.validate_date_field(field_name, value, rules)
return
if not isinstance(value, str):
return
# Length validation
if 'min_length' in rules and len(value) < rules['min_length']:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' is too short (min {rules['min_length']} chars)",
suggestion=f"Expand {field_name} to meet minimum length requirement"
))
if 'max_length' in rules and len(value) > rules['max_length']:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' is too long (max {rules['max_length']} chars)",
suggestion=f"Shorten {field_name} to meet maximum length requirement"
))
# Pattern validation
if 'pattern' in rules:
if not re.match(rules['pattern'], value):
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' doesn't match required pattern",
suggestion=f"Update {field_name} format to match: {rules['pattern']}"
))
# Allowed values validation
if 'allowed_values' in rules:
if value not in rules['allowed_values']:
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' has invalid value: {value}",
suggestion=f"Use one of: {', '.join(rules['allowed_values'])}"
))
# Custom validations
if field_name == 'keywords' and 'min_keywords' in rules:
keywords = [k.strip() for k in value.split(rules.get('separator', ','))]
if len(keywords) < rules['min_keywords']:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Too few keywords ({len(keywords)} < {rules['min_keywords']})",
suggestion=f"Add more relevant keywords"
))
if 'must_end_with_period' in rules and rules['must_end_with_period']:
if not value.rstrip().endswith('.'):
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' should end with a period",
suggestion=f"Add period at the end of {field_name}"
))
def validate_date_field(self, field_name: str, value: Any, rules: Dict[str, Any]):
"""Validate date field with specific date rules"""
try:
if isinstance(value, str):
parsed_date = datetime.strptime(value, rules.get('format', '%Y-%m-%d')).date()
elif isinstance(value, date):
parsed_date = value
else:
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' must be a valid date",
suggestion=f"Use format: {rules.get('format', 'YYYY-MM-DD')}"
))
return
if rules.get('not_future') and parsed_date > date.today():
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' is set to future date",
suggestion="Use current or past date"
))
except ValueError:
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.FRONTMATTER,
message=f"Field '{field_name}' has invalid date format",
suggestion=f"Use format: {rules.get('format', 'YYYY-MM-DD')}"
))
def validate_content_structure(self, content: str, schema: Dict[str, Any]):
"""Validate overall content structure"""
content_rules = schema.get('content_rules', {})
# Word count validation
words = self.patterns['word_boundary'].findall(content)
word_count = len(words)
min_words = content_rules.get('min_word_count', 0)
max_words = content_rules.get('max_word_count', float('inf'))
if word_count < min_words:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.CONTENT,
message=f"Content too short: {word_count} words (min {min_words})",
suggestion=f"Add {min_words - word_count} more words"
))
if word_count > max_words:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.CONTENT,
message=f"Content too long: {word_count} words (max {max_words})",
suggestion=f"Remove {word_count - max_words} words or split content"
))
# Heading structure validation
self.validate_heading_structure(content, content_rules)
# Required sections validation
self.validate_required_sections(content, content_rules)
def validate_heading_structure(self, content: str, rules: Dict[str, Any]):
"""Validate heading hierarchy and structure"""
heading_rules = rules.get('heading_structure', {})
headings = self.patterns['heading'].findall(content)
if not headings:
if heading_rules.get('require_h1'):
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.STRUCTURE,
message="Missing required H1 heading",
suggestion="Add a main heading with # at the beginning"
))
return
# Check for H1 requirement
has_h1 = any(len(h[0]) == 1 for h in headings)
if heading_rules.get('require_h1') and not has_h1:
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.STRUCTURE,
message="Missing required H1 heading",
suggestion="Add a main heading with #"
))
# Check maximum heading level
max_level = heading_rules.get('max_heading_level', 6)
for heading_hashes, heading_text in headings:
level = len(heading_hashes)
if level > max_level:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STRUCTURE,
message=f"Heading level {level} exceeds maximum {max_level}: {heading_text}",
suggestion=f"Use heading level {max_level} or lower"
))
# Check sequential levels
if heading_rules.get('sequential_levels'):
levels = [len(h[0]) for h in headings]
for i in range(1, len(levels)):
if levels[i] > levels[i-1] + 1:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STRUCTURE,
message=f"Heading level jumps from {levels[i-1]} to {levels[i]}",
suggestion="Use sequential heading levels"
))
def validate_required_sections(self, content: str, rules: Dict[str, Any]):
"""Validate presence of required content sections"""
required_sections = rules.get('required_sections', [])
for section in required_sections:
if section == 'introduction_paragraph':
# Check for substantial first paragraph
paragraphs = content.strip().split('\n\n')
if not paragraphs or len(paragraphs[0].split()) < 20:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.CONTENT,
message="Missing substantial introduction paragraph",
suggestion="Add a comprehensive introduction (20+ words)"
))
elif section == 'conclusion_section':
# Check for conclusion heading or paragraph at end
if not re.search(r'## Conclusion|# Conclusion|In conclusion|To conclude', content, re.IGNORECASE):
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.CONTENT,
message="Missing conclusion section",
suggestion="Add a conclusion section or paragraph"
))
elif section == 'main_headings':
headings = self.patterns['heading'].findall(content)
main_headings = [h for h in headings if len(h[0]) <= 3] # H1, H2, H3
if len(main_headings) < 2:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STRUCTURE,
message="Insufficient main headings for content structure",
suggestion="Add more section headings to organize content"
))
def validate_content_rules(self, content: str, schema: Dict[str, Any]):
"""Validate specific content formatting rules"""
content_rules = schema.get('content_rules', {})
# Table of contents requirement
if content_rules.get('require_table_of_contents'):
if not re.search(r'table of contents|toc', content, re.IGNORECASE):
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.CONTENT,
message="Missing table of contents",
suggestion="Add a table of contents section"
))
def validate_links(self, content: str, file_path: str):
"""Validate links in content"""
links = self.patterns['link'].findall(content)
for link_text, link_url in links:
# Check for empty link text
if not link_text.strip():
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.LINKS,
message=f"Link has empty description: {link_url}",
suggestion="Add descriptive link text"
))
# Validate internal links
if link_url.startswith('/') or not link_url.startswith(('http://', 'https://', 'mailto:')):
self.validate_internal_link(link_url, file_path)
def validate_internal_link(self, link_url: str, file_path: str):
"""Validate internal link existence"""
# Convert relative link to absolute path
if link_url.startswith('/'):
# Absolute path from site root
base_dir = Path(file_path).parent
while base_dir.name != '_posts' and base_dir.parent != base_dir:
base_dir = base_dir.parent
site_root = base_dir.parent if base_dir.name == '_posts' else base_dir
target_path = site_root / link_url.lstrip('/')
else:
# Relative path
target_path = Path(file_path).parent / link_url
if not target_path.exists():
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.LINKS,
message=f"Broken internal link: {link_url}",
suggestion="Fix link path or create target file"
))
def validate_images(self, content: str, file_path: str):
"""Validate images in content"""
images = self.patterns['image'].findall(content)
for alt_text, image_url in images:
# Check for alt text
if not alt_text.strip():
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.IMAGES,
message=f"Image missing alt text: {image_url}",
suggestion="Add descriptive alt text for accessibility"
))
# Validate image file existence for local images
if not image_url.startswith(('http://', 'https://')):
self.validate_image_file(image_url, file_path)
def validate_image_file(self, image_url: str, file_path: str):
"""Validate local image file existence and properties"""
if image_url.startswith('/'):
# Absolute path from site root
base_dir = Path(file_path).parent
while base_dir.name != '_posts' and base_dir.parent != base_dir:
base_dir = base_dir.parent
site_root = base_dir.parent if base_dir.name == '_posts' else base_dir
image_path = site_root / image_url.lstrip('/')
else:
# Relative path
image_path = Path(file_path).parent / image_url
if not image_path.exists():
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.IMAGES,
message=f"Missing image file: {image_url}",
suggestion="Add image file or fix image path"
))
elif image_path.is_file():
# Check file size if it exists
file_size = image_path.stat().st_size
if file_size > 2 * 1024 * 1024: # 2MB limit
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.IMAGES,
message=f"Large image file ({file_size // 1024}KB): {image_url}",
suggestion="Optimize image size for web performance"
))
def validate_code_blocks(self, content: str, schema: Dict[str, Any]):
"""Validate code blocks formatting and content"""
code_rules = schema.get('content_rules', {}).get('code_blocks', {})
code_blocks = self.patterns['code_block'].findall(content)
for language, code_content in code_blocks:
# Check for language specification
if code_rules.get('require_language_tags') and not language.strip():
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.CONTENT,
message="Code block missing language specification",
suggestion="Add language identifier after opening backticks"
))
# Check line length
max_line_length = code_rules.get('max_line_length')
if max_line_length:
lines = code_content.split('\n')
for i, line in enumerate(lines, 1):
if len(line) > max_line_length:
self.results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.CONTENT,
message=f"Code block line {i} exceeds {max_line_length} characters",
suggestion="Break long lines or use horizontal scrolling"
))
# Check for unescaped Liquid syntax
if code_rules.get('escape_liquid_syntax'):
if self.patterns['liquid_syntax'].search(code_content):
# Check if properly wrapped with raw tags
raw_pattern = re.compile(r'\{\%\s*raw\s*\%\}.*?\{\%\s*endraw\s*\%\}', re.DOTALL)
if not raw_pattern.search(content):
self.results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.CONTENT,
message="Code block contains unescaped Liquid syntax",
suggestion="Wrap code block with {%" + " raw %} and {%" + " endraw %} tags"
))
def validate_style_guidelines(self, content: str):
"""Validate writing style and formatting guidelines"""
lines = content.split('\n')
for i, line in enumerate(lines, 1):
# Check for trailing whitespace
if line.rstrip() != line:
self.results.append(ValidationResult(
level=ValidationLevel.INFO,
category=ValidationCategory.STYLE,
message=f"Line {i} has trailing whitespace",
suggestion="Remove trailing whitespace"
))
# Check for very long lines (prose)
if len(line) > 120 and not line.strip().startswith(('```', '|', '#')):
self.results.append(ValidationResult(
level=ValidationLevel.INFO,
category=ValidationCategory.STYLE,
message=f"Line {i} is very long ({len(line)} chars)",
suggestion="Consider breaking into shorter sentences"
))
# Check for repeated words
words = self.patterns['word_boundary'].findall(content.lower())
word_pairs = [(words[i], words[i+1]) for i in range(len(words)-1)]
repeated_pairs = [pair for pair in set(word_pairs) if word_pairs.count(pair) > 3 and len(pair[0]) > 3]
for pair in repeated_pairs[:3]: # Limit to avoid spam
self.results.append(ValidationResult(
level=ValidationLevel.INFO,
category=ValidationCategory.STYLE,
message=f"Frequently repeated word pair: '{pair[0]} {pair[1]}'",
suggestion="Consider varying word choice for better readability"
))
def generate_report(self, results: List[ValidationResult]) -> str:
"""Generate human-readable validation report"""
if not results:
return "β
All validation checks passed!"
report = ["# Markdown Validation Report", ""]
# Summary
error_count = len([r for r in results if r.level == ValidationLevel.ERROR])
warning_count = len([r for r in results if r.level == ValidationLevel.WARNING])
info_count = len([r for r in results if r.level == ValidationLevel.INFO])
report.append(f"## Summary")
report.append(f"- β Errors: {error_count}")
report.append(f"- β οΈ Warnings: {warning_count}")
report.append(f"- βΉοΈ Info: {info_count}")
report.append("")
# Group by category
categories = {}
for result in results:
if result.category not in categories:
categories[result.category] = []
categories[result.category].append(result)
for category, category_results in categories.items():
report.append(f"## {category.value.title()} Issues")
report.append("")
for result in category_results:
level_icon = {"error": "β", "warning": "β οΈ", "info": "βΉοΈ"}[result.level.value]
location = f" (Line {result.line_number})" if result.line_number else ""
report.append(f"### {level_icon} {result.message}{location}")
if result.suggestion:
report.append(f"**Suggestion:** {result.suggestion}")
report.append("")
return "\n".join(report)
# Usage examples and CLI interface
def validate_directory(directory_path: str, schema_path: str = None) -> Dict[str, List[ValidationResult]]:
"""Validate all Markdown files in a directory"""
validator = MarkdownValidator(schema_path) if schema_path else MarkdownValidator()
results = {}
for file_path in Path(directory_path).rglob("*.md"):
file_results = validator.validate_file(str(file_path))
if file_results:
results[str(file_path)] = file_results
return results
def main():
"""CLI interface for the validator"""
import argparse
parser = argparse.ArgumentParser(description="Validate Markdown files")
parser.add_argument("path", help="File or directory path to validate")
parser.add_argument("--schema", help="Path to validation schema file")
parser.add_argument("--type", default="default", help="Document type to validate against")
parser.add_argument("--format", choices=["text", "json"], default="text", help="Output format")
parser.add_argument("--errors-only", action="store_true", help="Show only errors")
args = parser.parse_args()
validator = MarkdownValidator(args.schema) if args.schema else MarkdownValidator()
if os.path.isfile(args.path):
results = validator.validate_file(args.path, args.type)
if args.errors_only:
results = [r for r in results if r.level == ValidationLevel.ERROR]
if args.format == "json":
import json
json_results = [
{
"level": r.level.value,
"category": r.category.value,
"message": r.message,
"line_number": r.line_number,
"suggestion": r.suggestion
}
for r in results
]
print(json.dumps(json_results, indent=2))
else:
print(validator.generate_report(results))
else:
all_results = validate_directory(args.path, args.schema)
for file_path, results in all_results.items():
if args.errors_only:
results = [r for r in results if r.level == ValidationLevel.ERROR]
if not results:
continue
print(f"\n{'='*60}")
print(f"File: {file_path}")
print('='*60)
print(validator.generate_report(results))
if __name__ == "__main__":
main()
Advanced Validation Techniques
Custom Rule Development
Creating specialized validation rules for specific content requirements:
# custom_validators.py - Specialized validation extensions
from abc import ABC, abstractmethod
from typing import List, Dict, Any
import re
from dataclasses import dataclass
class CustomValidator(ABC):
"""Abstract base class for custom validators"""
@abstractmethod
def validate(self, content: str, frontmatter: Dict[str, Any]) -> List[ValidationResult]:
pass
@abstractmethod
def get_name(self) -> str:
pass
class TechnicalWritingValidator(CustomValidator):
"""Validator for technical writing standards"""
def __init__(self):
self.prohibited_phrases = [
"click here",
"read more",
"as you can see",
"obviously",
"simply",
"just",
"easy"
]
self.required_patterns = {
'api_documentation': [
r'## Parameters?',
r'## Response',
r'## Example'
],
'tutorial': [
r'## Prerequisites?',
r'## Step \d+',
r'## Conclusion'
]
}
def validate(self, content: str, frontmatter: Dict[str, Any]) -> List[ValidationResult]:
results = []
content_lower = content.lower()
# Check for prohibited phrases
for phrase in self.prohibited_phrases:
if phrase in content_lower:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STYLE,
message=f"Avoid using '{phrase}' in technical writing",
suggestion=f"Replace '{phrase}' with more specific language"
))
# Check passive voice usage
passive_patterns = [
r'\b\w+ed\s+by\b',
r'\bis\s+\w+ed\b',
r'\bare\s+\w+ed\b',
r'\bwas\s+\w+ed\b',
r'\bwere\s+\w+ed\b'
]
passive_count = 0
for pattern in passive_patterns:
passive_count += len(re.findall(pattern, content_lower))
total_sentences = len(re.findall(r'[.!?]+', content))
if total_sentences > 0 and (passive_count / total_sentences) > 0.3:
results.append(ValidationResult(
level=ValidationLevel.INFO,
category=ValidationCategory.STYLE,
message=f"High passive voice usage ({passive_count}/{total_sentences} sentences)",
suggestion="Consider using more active voice constructions"
))
# Check document type specific patterns
doc_category = frontmatter.get('category', '').lower()
if doc_category in self.required_patterns:
for pattern in self.required_patterns[doc_category]:
if not re.search(pattern, content, re.IGNORECASE):
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STRUCTURE,
message=f"Missing expected section for {doc_category}: {pattern}",
suggestion=f"Add section matching pattern: {pattern}"
))
return results
def get_name(self) -> str:
return "Technical Writing Standards"
class AccessibilityValidator(CustomValidator):
"""Validator for accessibility compliance"""
def validate(self, content: str, frontmatter: Dict[str, Any]) -> List[ValidationResult]:
results = []
# Check image alt text quality
image_pattern = re.compile(r'!\[([^\]]*)\]\(([^)]+)\)')
images = image_pattern.findall(content)
for alt_text, image_url in images:
if not alt_text.strip():
results.append(ValidationResult(
level=ValidationLevel.ERROR,
category=ValidationCategory.IMAGES,
message=f"Image missing alt text: {image_url}",
suggestion="Add descriptive alt text for screen readers"
))
elif len(alt_text.split()) < 3:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.IMAGES,
message=f"Alt text may be too brief: '{alt_text}'",
suggestion="Provide more descriptive alt text (3+ words)"
))
elif alt_text.lower().startswith(('image of', 'picture of', 'photo of')):
results.append(ValidationResult(
level=ValidationLevel.INFO,
category=ValidationCategory.IMAGES,
message=f"Alt text contains redundant prefix: '{alt_text}'",
suggestion="Remove 'image of', 'picture of', etc. from alt text"
))
# Check heading hierarchy
heading_pattern = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
headings = heading_pattern.findall(content)
if headings:
levels = [len(h[0]) for h in headings]
# Check if starts with H1
if levels[0] != 1:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STRUCTURE,
message="Document should start with H1 heading",
suggestion="Use # for the main document title"
))
# Check for skipped heading levels
for i in range(1, len(levels)):
if levels[i] > levels[i-1] + 1:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STRUCTURE,
message=f"Heading level jumps from H{levels[i-1]} to H{levels[i]}",
suggestion="Use sequential heading levels for proper document structure"
))
# Check for meaningful link text
link_pattern = re.compile(r'\[([^\]]+)\]\(([^)]+)\)')
links = link_pattern.findall(content)
problematic_link_texts = ['click here', 'here', 'read more', 'more', 'link', 'this']
for link_text, link_url in links:
if link_text.lower().strip() in problematic_link_texts:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.LINKS,
message=f"Non-descriptive link text: '{link_text}'",
suggestion="Use descriptive text that explains the link destination"
))
return results
def get_name(self) -> str:
return "Accessibility Compliance"
class SEOValidator(CustomValidator):
"""Validator for SEO best practices"""
def validate(self, content: str, frontmatter: Dict[str, Any]) -> List[ValidationResult]:
results = []
# Check title length
title = frontmatter.get('title', '')
if title:
title_length = len(title)
if title_length < 30:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Title may be too short for SEO ({title_length} chars)",
suggestion="Consider expanding title to 30-60 characters"
))
elif title_length > 60:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Title may be too long for SEO ({title_length} chars)",
suggestion="Consider shortening title to under 60 characters"
))
# Check description
description = frontmatter.get('description', '')
if description:
desc_length = len(description)
if desc_length < 120:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Description may be too short for SEO ({desc_length} chars)",
suggestion="Consider expanding description to 120-160 characters"
))
elif desc_length > 160:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.FRONTMATTER,
message=f"Description may be too long for SEO ({desc_length} chars)",
suggestion="Consider shortening description to under 160 characters"
))
# Check keyword usage in content
keywords = frontmatter.get('keywords', '')
if keywords:
keyword_list = [k.strip().lower() for k in keywords.split(',')]
content_lower = content.lower()
for keyword in keyword_list[:3]: # Check first 3 keywords
if keyword not in content_lower:
results.append(ValidationResult(
level=ValidationLevel.INFO,
category=ValidationCategory.CONTENT,
message=f"Keyword '{keyword}' not found in content",
suggestion="Consider incorporating keywords naturally in content"
))
# Check heading structure for SEO
heading_pattern = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
headings = heading_pattern.findall(content)
h1_count = len([h for h in headings if len(h[0]) == 1])
if h1_count == 0:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STRUCTURE,
message="Missing H1 heading for SEO",
suggestion="Add a main H1 heading"
))
elif h1_count > 1:
results.append(ValidationResult(
level=ValidationLevel.WARNING,
category=ValidationCategory.STRUCTURE,
message=f"Multiple H1 headings found ({h1_count})",
suggestion="Use only one H1 heading per document"
))
return results
def get_name(self) -> str:
return "SEO Best Practices"
Integration with Development Workflows
For comprehensive documentation workflows, data validation systems integrate seamlessly with automated testing and CI/CD pipelines to ensure that content quality standards are maintained throughout the development lifecycle, providing immediate feedback to authors and preventing quality issues from reaching production environments.
When building sophisticated content management systems, validation frameworks work effectively with version control and Git integration to implement pre-commit hooks, automated quality checks, and comprehensive review processes that maintain editorial standards across distributed teams and complex content repositories.
For advanced documentation architectures, schema validation complements Progressive Web App documentation systems by ensuring that content structure meets API requirements, maintains consistent metadata for offline functionality, and provides reliable data structures for interactive documentation features and enhanced user experiences.
Automated Quality Assurance Workflows
CI/CD Integration
Implementing comprehensive validation in continuous integration pipelines:
# .github/workflows/content-validation.yml
name: Content Quality Validation
on:
push:
branches: [ main, develop ]
paths: ['**/*.md']
pull_request:
branches: [ main ]
paths: ['**/*.md']
jobs:
validate-content:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
cache: 'pip'
- name: Install validation dependencies
run: |
pip install PyYAML requests pathlib
pip install -r requirements-validation.txt
- name: Run Markdown validation
run: |
python scripts/markdown_validator.py content/ \
--schema config/content-schema.yml \
--format json > validation-results.json
- name: Generate validation report
run: |
python scripts/generate_validation_report.py \
--input validation-results.json \
--output validation-report.md
- name: Check validation results
run: |
error_count=$(jq '[.[] | select(.level == "error")] | length' validation-results.json)
if [ "$error_count" -gt 0 ]; then
echo "β Content validation failed with $error_count errors"
exit 1
else
echo "β
Content validation passed"
fi
- name: Comment PR with validation results
if: github.event_name == 'pull_request'
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
try {
const report = fs.readFileSync('validation-report.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: report
});
} catch (error) {
console.error('Failed to post validation report:', error);
}
- name: Upload validation artifacts
if: always()
uses: actions/upload-artifact@v3
with:
name: validation-results
path: |
validation-results.json
validation-report.md
Pre-commit Hooks
Setting up local validation with Git hooks:
#!/bin/bash
# .git/hooks/pre-commit - Content validation before commits
echo "π Running content validation..."
# Find all staged Markdown files
staged_files=$(git diff --cached --name-only --diff-filter=ACM | grep '\.md$')
if [ -z "$staged_files" ]; then
echo "No Markdown files to validate"
exit 0
fi
# Create temporary file with staged content
temp_dir=$(mktemp -d)
validation_failed=false
for file in $staged_files; do
# Get staged version of file
git show ":$file" > "$temp_dir/$(basename $file)"
# Validate the staged content
if ! python scripts/markdown_validator.py "$temp_dir/$(basename $file)" --errors-only; then
echo "β Validation failed for: $file"
validation_failed=true
fi
done
# Cleanup
rm -rf "$temp_dir"
if [ "$validation_failed" = true ]; then
echo ""
echo "β Content validation failed. Please fix issues before committing."
echo " Run 'python scripts/markdown_validator.py <file>' for detailed results"
exit 1
fi
echo "β
All content validation checks passed"
exit 0
Advanced Schema Patterns
Dynamic Schema Generation
Creating adaptive validation schemas based on content analysis:
# schema_generator.py - Dynamic schema creation
import os
import yaml
from collections import Counter, defaultdict
from pathlib import Path
import re
class SchemaGenerator:
def __init__(self):
self.content_analysis = {
'frontmatter_fields': Counter(),
'heading_patterns': [],
'content_lengths': [],
'common_sections': Counter(),
'link_patterns': [],
'image_patterns': []
}
def analyze_content_directory(self, directory_path: str):
"""Analyze existing content to generate schema patterns"""
for file_path in Path(directory_path).rglob("*.md"):
self.analyze_file(str(file_path))
def analyze_file(self, file_path: str):
"""Analyze individual file for schema patterns"""
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
except:
return
# Parse frontmatter
frontmatter_match = re.match(r'^---\n(.*?)\n---', content, re.DOTALL)
if frontmatter_match:
try:
frontmatter = yaml.safe_load(frontmatter_match.group(1))
if isinstance(frontmatter, dict):
for field in frontmatter.keys():
self.content_analysis['frontmatter_fields'][field] += 1
except:
pass
# Analyze content structure
content_body = content[frontmatter_match.end():] if frontmatter_match else content
# Word count
word_count = len(re.findall(r'\b\w+\b', content_body))
self.content_analysis['content_lengths'].append(word_count)
# Heading analysis
headings = re.findall(r'^(#{1,6})\s+(.+)$', content_body, re.MULTILINE)
for level_hashes, heading_text in headings:
level = len(level_hashes)
self.content_analysis['heading_patterns'].append((level, heading_text.lower()))
# Common sections
section_keywords = ['introduction', 'conclusion', 'overview', 'getting started',
'installation', 'usage', 'examples', 'troubleshooting']
for keyword in section_keywords:
if keyword in heading_text.lower():
self.content_analysis['common_sections'][keyword] += 1
def generate_schema(self, document_type: str = "auto_generated") -> Dict[str, Any]:
"""Generate validation schema based on analyzed content"""
# Determine common frontmatter fields
total_files = len(self.content_analysis['content_lengths'])
required_threshold = 0.8 # Field must appear in 80% of files
optional_threshold = 0.3 # Field must appear in 30% of files
required_fields = []
optional_fields = []
for field, count in self.content_analysis['frontmatter_fields'].items():
frequency = count / total_files
if frequency >= required_threshold:
required_fields.append(field)
elif frequency >= optional_threshold:
optional_fields.append(field)
# Content length analysis
if self.content_analysis['content_lengths']:
min_length = int(min(self.content_analysis['content_lengths']) * 0.5)
max_length = int(max(self.content_analysis['content_lengths']) * 1.5)
else:
min_length, max_length = 100, 5000
# Generate schema
schema = {
'schema_version': '1.0',
'generated_from_analysis': True,
'analysis_stats': {
'files_analyzed': total_files,
'avg_word_count': sum(self.content_analysis['content_lengths']) // total_files if total_files > 0 else 0,
'common_sections': dict(self.content_analysis['common_sections'].most_common(10))
},
'document_types': {
document_type: {
'required_frontmatter': required_fields,
'optional_frontmatter': optional_fields,
'frontmatter_rules': self.generate_frontmatter_rules(),
'content_rules': {
'min_word_count': min_length,
'max_word_count': max_length,
'heading_structure': {
'require_h1': True,
'max_heading_level': 4,
'sequential_levels': True
}
}
}
}
}
return schema
def generate_frontmatter_rules(self) -> Dict[str, Dict[str, Any]]:
"""Generate specific rules for frontmatter fields"""
rules = {}
# Standard rules for common fields
if 'title' in self.content_analysis['frontmatter_fields']:
rules['title'] = {
'type': 'string',
'min_length': 10,
'max_length': 100,
'pattern': r'^[A-Za-z0-9\s\-:,]+$'
}
if 'description' in self.content_analysis['frontmatter_fields']:
rules['description'] = {
'type': 'string',
'min_length': 50,
'max_length': 200,
'must_end_with_period': True
}
if 'date' in self.content_analysis['frontmatter_fields']:
rules['date'] = {
'type': 'date',
'format': 'YYYY-MM-DD',
'not_future': True
}
if 'category' in self.content_analysis['frontmatter_fields']:
# Analyze existing categories
# This would require additional parsing, simplified for example
rules['category'] = {
'type': 'string',
'allowed_values': ['Tutorial', 'Guide', 'Reference', 'News']
}
return rules
def save_schema(self, schema: Dict[str, Any], output_path: str):
"""Save generated schema to file"""
with open(output_path, 'w', encoding='utf-8') as f:
yaml.dump(schema, f, default_flow_style=False, indent=2)
# Usage example
def generate_schema_from_content():
generator = SchemaGenerator()
generator.analyze_content_directory('content/')
schema = generator.generate_schema('blog_post')
generator.save_schema(schema, 'generated-schema.yml')
print("Schema generated successfully!")
if __name__ == "__main__":
generate_schema_from_content()
Conclusion
Advanced Markdown data validation and schema checking provides the foundation for maintaining high-quality documentation at scale while enabling sophisticated automation workflows that support distributed teams and complex content requirements. By implementing comprehensive validation frameworks, automated quality assurance systems, and intelligent schema management, technical teams can ensure consistent content quality while reducing manual oversight and accelerating content production processes.
The key to successful validation implementation lies in balancing automation with flexibility, ensuring that validation systems support content creators rather than hindering their workflow. Whether youβre building internal documentation systems, managing open-source project documentation, or creating comprehensive knowledge bases, the validation techniques covered in this guide provide the tools necessary for maintaining professional content standards while scaling your documentation efforts effectively.
Remember to iterate on your validation schemas based on real-world usage patterns, implement gradual rollouts of new validation rules to avoid disrupting existing workflows, and maintain clear documentation about validation requirements to help content creators understand and meet quality standards. With properly implemented data validation and schema checking, your Markdown documentation can achieve enterprise-level quality assurance while maintaining the simplicity and accessibility that makes Markdown an ideal choice for technical documentation.