Markdown Table Data Validation and Quality Assurance: Complete Guide for Content Integrity and Data Quality Management
Advanced Markdown table data validation and quality assurance systems enable sophisticated content management workflows that ensure data integrity, maintain consistency standards, and provide automated verification of tabular content across large documentation repositories. By implementing comprehensive validation rules, automated quality checks, and systematic error detection processes, technical teams can build robust content management systems that maintain high data quality standards while scaling efficiently across complex information architectures.
Why Master Markdown Table Data Validation?
Professional data validation provides essential benefits for content management systems:
- Data Integrity: Ensure accuracy and consistency of tabular content across all documentation
- Quality Assurance: Maintain professional standards through automated validation and verification
- Error Prevention: Detect and prevent data quality issues before content publication
- Compliance Standards: Meet data quality requirements for regulated industries and documentation standards
- User Confidence: Provide reliable, accurate information that users can depend on for decision-making
Foundation Validation Concepts
Basic Table Structure Validation
Understanding fundamental validation principles for Markdown table integrity:
# Basic Table Validation Examples
## Well-Formed Table Structure
| Feature | Status | Version | Notes |
|---------|--------|---------|-------|
| Authentication | ✅ Active | 2.1.0 | OAuth 2.0 support |
| Rate Limiting | ✅ Active | 2.0.5 | 1000 requests/hour |
| Caching | 🚧 Beta | 2.2.0 | Redis implementation |
| Monitoring | ❌ Planned | 3.0.0 | Grafana dashboard |
## Validation Requirements
✅ **Valid Structure:**
- Consistent column count across all rows
- Proper header row with separator
- Clean cell content without formatting conflicts
- Appropriate data types for each column
❌ **Common Issues to Detect:**
- Inconsistent column counts
- Missing separator rows
- Malformed cell content
- Data type mismatches
Advanced Data Type Validation
Implementing sophisticated validation rules for different data types:
// table-data-validator.js - Advanced table data validation system
class MarkdownTableValidator {
constructor(options = {}) {
this.options = {
strictMode: options.strictMode || false,
allowEmptyCells: options.allowEmptyCells !== false,
maxCellLength: options.maxCellLength || 1000,
customValidators: options.customValidators || {},
...options
};
this.builtInValidators = {
'string': this.validateString.bind(this),
'number': this.validateNumber.bind(this),
'integer': this.validateInteger.bind(this),
'float': this.validateFloat.bind(this),
'boolean': this.validateBoolean.bind(this),
'date': this.validateDate.bind(this),
'url': this.validateUrl.bind(this),
'email': this.validateEmail.bind(this),
'version': this.validateVersion.bind(this),
'status': this.validateStatus.bind(this),
'enum': this.validateEnum.bind(this),
'regex': this.validateRegex.bind(this),
'json': this.validateJson.bind(this),
'markdown': this.validateMarkdown.bind(this)
};
this.validationResults = {
tables: [],
summary: {
totalTables: 0,
validTables: 0,
errorsFound: 0,
warningsFound: 0
}
};
}
async validateDocument(markdownContent, filePath = '') {
console.log(`Validating tables in ${filePath || 'document'}...`);
const tables = this.extractTables(markdownContent);
this.validationResults.summary.totalTables = tables.length;
for (let i = 0; i < tables.length; i++) {
const table = tables[i];
const tableResult = await this.validateTable(table, i, filePath);
this.validationResults.tables.push(tableResult);
if (tableResult.isValid) {
this.validationResults.summary.validTables++;
}
this.validationResults.summary.errorsFound += tableResult.errors.length;
this.validationResults.summary.warningsFound += tableResult.warnings.length;
}
return this.validationResults;
}
extractTables(markdownContent) {
const tables = [];
const lines = markdownContent.split('\n');
let currentTable = null;
let lineNumber = 0;
for (let i = 0; i < lines.length; i++) {
lineNumber = i + 1;
const line = lines[i].trim();
// Check if this looks like a table row
if (line.includes('|') && line.length > 0) {
if (!currentTable) {
// Start of new table
currentTable = {
startLine: lineNumber,
endLine: lineNumber,
headers: [],
separatorLine: null,
rows: [],
rawLines: []
};
}
currentTable.endLine = lineNumber;
currentTable.rawLines.push({
content: line,
lineNumber: lineNumber
});
// Parse the line
const cells = this.parseCells(line);
if (currentTable.headers.length === 0 && !this.isSeparatorLine(line)) {
// First non-separator line is headers
currentTable.headers = cells;
} else if (this.isSeparatorLine(line)) {
// Separator line
currentTable.separatorLine = {
content: line,
lineNumber: lineNumber,
alignments: this.parseAlignments(line)
};
} else if (currentTable.separatorLine) {
// Data row (only count after separator)
currentTable.rows.push({
cells: cells,
lineNumber: lineNumber,
rawContent: line
});
}
} else if (currentTable) {
// End of current table
tables.push(currentTable);
currentTable = null;
}
}
// Add final table if exists
if (currentTable) {
tables.push(currentTable);
}
return tables;
}
parseCells(line) {
// Remove leading and trailing pipes, split on pipes
const cleanLine = line.replace(/^\||\|$/g, '');
return cleanLine.split('|').map(cell => cell.trim());
}
isSeparatorLine(line) {
// Check if line contains only |, -, :, and whitespace
return /^[\s\|:\-]+$/.test(line) && line.includes('-');
}
parseAlignments(separatorLine) {
const cells = this.parseCells(separatorLine);
return cells.map(cell => {
if (cell.startsWith(':') && cell.endsWith(':')) {
return 'center';
} else if (cell.endsWith(':')) {
return 'right';
} else {
return 'left';
}
});
}
async validateTable(table, tableIndex, filePath) {
const result = {
tableIndex,
filePath,
startLine: table.startLine,
endLine: table.endLine,
isValid: true,
errors: [],
warnings: [],
suggestions: [],
structure: this.analyzeTableStructure(table),
dataQuality: await this.analyzeDataQuality(table)
};
// Validate table structure
this.validateTableStructure(table, result);
// Validate data consistency
await this.validateDataConsistency(table, result);
// Validate cell content
await this.validateCellContent(table, result);
// Generate suggestions
this.generateSuggestions(table, result);
// Determine overall validity
result.isValid = result.errors.length === 0;
return result;
}
analyzeTableStructure(table) {
return {
headerCount: table.headers.length,
rowCount: table.rows.length,
hasSeparator: !!table.separatorLine,
columnAlignments: table.separatorLine ? table.separatorLine.alignments : [],
avgRowLength: table.rows.length > 0 ?
table.rows.reduce((sum, row) => sum + row.cells.length, 0) / table.rows.length : 0,
maxCellLength: this.getMaxCellLength(table),
emptyRows: table.rows.filter(row => row.cells.every(cell => !cell)).length
};
}
getMaxCellLength(table) {
let maxLength = 0;
// Check headers
table.headers.forEach(header => {
maxLength = Math.max(maxLength, header.length);
});
// Check data rows
table.rows.forEach(row => {
row.cells.forEach(cell => {
maxLength = Math.max(maxLength, cell.length);
});
});
return maxLength;
}
async analyzeDataQuality(table) {
const quality = {
completeness: 0,
consistency: 0,
uniqueness: {},
patterns: {},
dataTypes: {}
};
if (table.rows.length === 0) {
return quality;
}
const totalCells = table.rows.length * table.headers.length;
let filledCells = 0;
// Analyze each column
for (let colIndex = 0; colIndex < table.headers.length; colIndex++) {
const columnData = table.rows.map(row => row.cells[colIndex] || '');
const columnName = table.headers[colIndex];
// Calculate completeness
const nonEmptyCells = columnData.filter(cell => cell.trim().length > 0);
filledCells += nonEmptyCells.length;
// Analyze uniqueness
const uniqueValues = new Set(columnData.filter(cell => cell.trim()));
quality.uniqueness[columnName] = {
totalValues: columnData.length,
uniqueValues: uniqueValues.size,
duplicates: columnData.length - uniqueValues.size
};
// Analyze patterns
quality.patterns[columnName] = this.analyzeColumnPatterns(columnData);
// Detect data types
quality.dataTypes[columnName] = this.detectDataType(columnData);
}
quality.completeness = (filledCells / totalCells) * 100;
return quality;
}
analyzeColumnPatterns(columnData) {
const patterns = {
empty: 0,
numeric: 0,
alphabetic: 0,
alphanumeric: 0,
url: 0,
email: 0,
date: 0,
common: new Map()
};
columnData.forEach(cell => {
const trimmed = cell.trim();
if (!trimmed) {
patterns.empty++;
return;
}
// Count pattern occurrences
if (/^\d+$/.test(trimmed)) patterns.numeric++;
if (/^[a-zA-Z\s]+$/.test(trimmed)) patterns.alphabetic++;
if (/^[a-zA-Z0-9\s]+$/.test(trimmed)) patterns.alphanumeric++;
if (/^https?:\/\//.test(trimmed)) patterns.url++;
if (/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(trimmed)) patterns.email++;
if (this.isDateLike(trimmed)) patterns.date++;
// Track common values
patterns.common.set(trimmed, (patterns.common.get(trimmed) || 0) + 1);
});
return patterns;
}
detectDataType(columnData) {
const nonEmpty = columnData.filter(cell => cell.trim());
if (nonEmpty.length === 0) return 'empty';
const totalCount = nonEmpty.length;
let numericCount = 0;
let integerCount = 0;
let booleanCount = 0;
let dateCount = 0;
let urlCount = 0;
let emailCount = 0;
nonEmpty.forEach(cell => {
const trimmed = cell.trim();
if (!isNaN(trimmed) && !isNaN(parseFloat(trimmed))) {
numericCount++;
if (Number.isInteger(parseFloat(trimmed))) {
integerCount++;
}
}
if (/^(true|false|yes|no|on|off|enabled|disabled|active|inactive)$/i.test(trimmed)) {
booleanCount++;
}
if (this.isDateLike(trimmed)) {
dateCount++;
}
if (/^https?:\/\//.test(trimmed)) {
urlCount++;
}
if (/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(trimmed)) {
emailCount++;
}
});
// Determine primary data type based on majority
const threshold = totalCount * 0.8; // 80% threshold
if (integerCount >= threshold) return 'integer';
if (numericCount >= threshold) return 'number';
if (booleanCount >= threshold) return 'boolean';
if (dateCount >= threshold) return 'date';
if (urlCount >= threshold) return 'url';
if (emailCount >= threshold) return 'email';
return 'string';
}
isDateLike(value) {
// Simple date detection - could be enhanced
const datePatterns = [
/^\d{4}-\d{2}-\d{2}$/, // YYYY-MM-DD
/^\d{2}\/\d{2}\/\d{4}$/, // MM/DD/YYYY
/^\d{2}-\d{2}-\d{4}$/, // MM-DD-YYYY
/^\d{1,2}\/\d{1,2}\/\d{2,4}$/, // M/D/YY or MM/DD/YYYY
/^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2},?\s+\d{4}$/i
];
return datePatterns.some(pattern => pattern.test(value.trim()));
}
validateTableStructure(table, result) {
// Check for separator line
if (!table.separatorLine) {
result.errors.push({
type: 'structure',
severity: 'error',
message: 'Missing table separator line',
line: table.startLine + 1,
suggestion: 'Add a separator line with dashes (e.g., |---|---|)'
});
}
// Check header count consistency
if (table.rows.length > 0) {
const expectedColumns = table.headers.length;
table.rows.forEach(row => {
if (row.cells.length !== expectedColumns) {
result.errors.push({
type: 'structure',
severity: 'error',
message: `Row has ${row.cells.length} columns, expected ${expectedColumns}`,
line: row.lineNumber,
suggestion: 'Ensure all rows have the same number of columns'
});
}
});
}
// Check for empty headers
table.headers.forEach((header, index) => {
if (!header.trim()) {
result.warnings.push({
type: 'structure',
severity: 'warning',
message: `Empty header in column ${index + 1}`,
line: table.startLine,
suggestion: 'Provide descriptive column headers'
});
}
});
// Check separator alignment consistency
if (table.separatorLine && table.separatorLine.alignments.length !== table.headers.length) {
result.errors.push({
type: 'structure',
severity: 'error',
message: 'Separator columns do not match header columns',
line: table.separatorLine.lineNumber,
suggestion: 'Ensure separator has same number of columns as headers'
});
}
}
async validateDataConsistency(table, result) {
// Check data type consistency within columns
for (let colIndex = 0; colIndex < table.headers.length; colIndex++) {
const columnName = table.headers[colIndex];
const columnData = table.rows.map(row => row.cells[colIndex] || '');
const detectedType = result.dataQuality.dataTypes[columnName];
// Validate each cell against detected type
columnData.forEach((cellValue, rowIndex) => {
if (!cellValue.trim()) {
if (!this.options.allowEmptyCells) {
result.warnings.push({
type: 'data',
severity: 'warning',
message: `Empty cell in column "${columnName}"`,
line: table.rows[rowIndex].lineNumber,
column: colIndex + 1,
suggestion: 'Consider providing a value or using "N/A" placeholder'
});
}
return;
}
const validation = this.validateCellDataType(cellValue, detectedType);
if (!validation.isValid) {
result.errors.push({
type: 'data',
severity: 'error',
message: `Invalid ${detectedType} value: "${cellValue}" in column "${columnName}"`,
line: table.rows[rowIndex].lineNumber,
column: colIndex + 1,
suggestion: validation.suggestion
});
}
});
}
// Check for duplicate rows
const rowHashes = new Map();
table.rows.forEach((row, index) => {
const rowHash = row.cells.join('|').toLowerCase();
if (rowHashes.has(rowHash)) {
result.warnings.push({
type: 'data',
severity: 'warning',
message: 'Duplicate row detected',
line: row.lineNumber,
duplicateOf: rowHashes.get(rowHash),
suggestion: 'Review and remove duplicate entries'
});
} else {
rowHashes.set(rowHash, row.lineNumber);
}
});
}
validateCellDataType(value, expectedType) {
const validator = this.builtInValidators[expectedType];
if (!validator) {
return { isValid: true, suggestion: '' };
}
return validator(value);
}
// Built-in validator implementations
validateString(value) {
if (value.length > this.options.maxCellLength) {
return {
isValid: false,
suggestion: `String too long (${value.length} chars, max ${this.options.maxCellLength})`
};
}
return { isValid: true };
}
validateNumber(value) {
if (isNaN(value) || isNaN(parseFloat(value))) {
return {
isValid: false,
suggestion: 'Value should be a valid number'
};
}
return { isValid: true };
}
validateInteger(value) {
const num = parseFloat(value);
if (isNaN(num) || !Number.isInteger(num)) {
return {
isValid: false,
suggestion: 'Value should be a valid integer'
};
}
return { isValid: true };
}
validateFloat(value) {
if (isNaN(value) || isNaN(parseFloat(value))) {
return {
isValid: false,
suggestion: 'Value should be a valid decimal number'
};
}
return { isValid: true };
}
validateBoolean(value) {
const booleanValues = ['true', 'false', 'yes', 'no', 'on', 'off', '1', '0', 'enabled', 'disabled', 'active', 'inactive'];
if (!booleanValues.includes(value.toLowerCase())) {
return {
isValid: false,
suggestion: `Value should be one of: ${booleanValues.join(', ')}`
};
}
return { isValid: true };
}
validateDate(value) {
const date = new Date(value);
if (isNaN(date.getTime()) && !this.isDateLike(value)) {
return {
isValid: false,
suggestion: 'Value should be a valid date (YYYY-MM-DD, MM/DD/YYYY, etc.)'
};
}
return { isValid: true };
}
validateUrl(value) {
try {
new URL(value);
return { isValid: true };
} catch {
return {
isValid: false,
suggestion: 'Value should be a valid URL starting with http:// or https://'
};
}
}
validateEmail(value) {
const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
if (!emailRegex.test(value)) {
return {
isValid: false,
suggestion: 'Value should be a valid email address'
};
}
return { isValid: true };
}
validateVersion(value) {
const versionRegex = /^\d+\.\d+(\.\d+)?(-[a-zA-Z0-9]+)?$/;
if (!versionRegex.test(value)) {
return {
isValid: false,
suggestion: 'Value should be a valid version number (e.g., 1.2.3, 2.0.0-beta)'
};
}
return { isValid: true };
}
validateStatus(value) {
const statusValues = ['active', 'inactive', 'pending', 'completed', 'cancelled', 'draft', 'published'];
if (!statusValues.includes(value.toLowerCase())) {
return {
isValid: false,
suggestion: `Value should be one of: ${statusValues.join(', ')}`
};
}
return { isValid: true };
}
validateEnum(value, options = []) {
if (!options.includes(value)) {
return {
isValid: false,
suggestion: `Value should be one of: ${options.join(', ')}`
};
}
return { isValid: true };
}
validateRegex(value, pattern) {
if (!pattern.test(value)) {
return {
isValid: false,
suggestion: `Value should match pattern: ${pattern.toString()}`
};
}
return { isValid: true };
}
validateJson(value) {
try {
JSON.parse(value);
return { isValid: true };
} catch {
return {
isValid: false,
suggestion: 'Value should be valid JSON'
};
}
}
validateMarkdown(value) {
// Basic markdown validation - check for common issues
const issues = [];
if (value.includes('[') && !value.includes(']')) {
issues.push('Unclosed markdown link');
}
if (value.includes('](') && !value.match(/\[[^\]]*\]\([^)]*\)/)) {
issues.push('Malformed markdown link');
}
if (issues.length > 0) {
return {
isValid: false,
suggestion: issues.join(', ')
};
}
return { isValid: true };
}
async validateCellContent(table, result) {
// Check for potentially problematic content
table.rows.forEach((row, rowIndex) => {
row.cells.forEach((cell, colIndex) => {
const issues = this.detectContentIssues(cell);
issues.forEach(issue => {
result.warnings.push({
type: 'content',
severity: 'warning',
message: issue.message,
line: row.lineNumber,
column: colIndex + 1,
suggestion: issue.suggestion
});
});
});
});
}
detectContentIssues(cellContent) {
const issues = [];
// Check for potentially problematic characters
if (cellContent.includes('|')) {
issues.push({
message: 'Cell contains pipe character which may break table formatting',
suggestion: 'Escape pipe character or use different delimiter'
});
}
// Check for excessive whitespace
if (cellContent !== cellContent.trim()) {
issues.push({
message: 'Cell has leading or trailing whitespace',
suggestion: 'Remove unnecessary whitespace'
});
}
// Check for very long content
if (cellContent.length > this.options.maxCellLength * 0.8) {
issues.push({
message: 'Cell content is very long',
suggestion: 'Consider breaking into multiple rows or using abbreviations'
});
}
// Check for HTML content
if (/<[^>]+>/.test(cellContent)) {
issues.push({
message: 'Cell contains HTML tags',
suggestion: 'Use Markdown formatting instead of HTML'
});
}
return issues;
}
generateSuggestions(table, result) {
// Generate improvement suggestions based on analysis
const suggestions = [];
// Suggest column type annotations if mixed types detected
Object.entries(result.dataQuality.dataTypes).forEach(([columnName, dataType]) => {
const columnData = table.rows.map(row =>
row.cells[table.headers.indexOf(columnName)] || ''
);
const nonEmpty = columnData.filter(cell => cell.trim());
const consistency = this.calculateTypeConsistency(nonEmpty, dataType);
if (consistency < 0.8) {
suggestions.push({
type: 'improvement',
priority: 'medium',
message: `Consider standardizing data format in column "${columnName}"`,
suggestion: `Column appears to have mixed data types. Consider using consistent ${dataType} format.`,
column: columnName
});
}
});
// Suggest sorting if data appears sortable
if (table.rows.length > 3) {
const firstColumn = table.rows.map(row => row.cells[0] || '');
if (this.isColumnSortable(firstColumn)) {
suggestions.push({
type: 'improvement',
priority: 'low',
message: 'Consider sorting table rows',
suggestion: 'First column appears sortable - consider ordering rows alphabetically or numerically'
});
}
}
// Suggest adding missing data
const completeness = result.dataQuality.completeness;
if (completeness < 80) {
suggestions.push({
type: 'improvement',
priority: 'high',
message: `Table is only ${completeness.toFixed(1)}% complete`,
suggestion: 'Consider filling in missing data or using placeholder values like "N/A" or "TBD"'
});
}
result.suggestions.push(...suggestions);
}
calculateTypeConsistency(values, expectedType) {
if (values.length === 0) return 1;
let consistentCount = 0;
values.forEach(value => {
const validation = this.validateCellDataType(value, expectedType);
if (validation.isValid) {
consistentCount++;
}
});
return consistentCount / values.length;
}
isColumnSortable(columnData) {
const nonEmpty = columnData.filter(cell => cell.trim());
if (nonEmpty.length < 2) return false;
// Check if all values are numbers
if (nonEmpty.every(val => !isNaN(val) && !isNaN(parseFloat(val)))) {
return true;
}
// Check if all values are dates
if (nonEmpty.every(val => this.isDateLike(val))) {
return true;
}
// Check if values are already sorted or nearly sorted
const sorted = [...nonEmpty].sort();
const sortedDesc = [...nonEmpty].sort().reverse();
const matchesAsc = nonEmpty.join('') === sorted.join('');
const matchesDesc = nonEmpty.join('') === sortedDesc.join('');
return !matchesAsc && !matchesDesc; // Suggest sorting if not already sorted
}
generateReport() {
const report = {
summary: this.validationResults.summary,
overallHealth: this.calculateOverallHealth(),
tables: this.validationResults.tables.map(table => ({
index: table.tableIndex,
location: `${table.filePath}:${table.startLine}-${table.endLine}`,
isValid: table.isValid,
errorCount: table.errors.length,
warningCount: table.warnings.length,
structure: table.structure,
dataQuality: {
completeness: table.dataQuality.completeness,
primaryDataTypes: Object.entries(table.dataQuality.dataTypes)
.map(([col, type]) => ({ column: col, type }))
},
topIssues: [...table.errors, ...table.warnings]
.sort((a, b) => a.severity === 'error' ? -1 : 1)
.slice(0, 5)
})),
recommendations: this.generateRecommendations()
};
return report;
}
calculateOverallHealth() {
const { totalTables, validTables, errorsFound, warningsFound } = this.validationResults.summary;
if (totalTables === 0) return 100;
const validityScore = (validTables / totalTables) * 100;
const errorPenalty = Math.min(errorsFound * 5, 50); // Max 50% penalty for errors
const warningPenalty = Math.min(warningsFound * 2, 25); // Max 25% penalty for warnings
return Math.max(0, validityScore - errorPenalty - warningPenalty);
}
generateRecommendations() {
const recommendations = [];
const { errorsFound, warningsFound, totalTables, validTables } = this.validationResults.summary;
if (errorsFound > 0) {
recommendations.push({
priority: 'high',
title: 'Fix Critical Table Errors',
description: `${errorsFound} errors found across tables that prevent proper rendering`,
action: 'Review and fix structural issues, column mismatches, and data type errors'
});
}
if (warningsFound > 0) {
recommendations.push({
priority: 'medium',
title: 'Address Data Quality Warnings',
description: `${warningsFound} warnings found that could affect data quality`,
action: 'Review empty cells, inconsistent formatting, and content issues'
});
}
if (validTables / totalTables < 0.8) {
recommendations.push({
priority: 'high',
title: 'Improve Table Validity Rate',
description: `Only ${Math.round(validTables / totalTables * 100)}% of tables are valid`,
action: 'Focus on fixing structural issues and maintaining consistent formatting'
});
}
// Analyze common issues across tables
const commonIssues = this.findCommonIssues();
if (commonIssues.length > 0) {
recommendations.push({
priority: 'medium',
title: 'Address Common Patterns',
description: 'Several tables have similar issues that could be addressed systematically',
action: `Focus on: ${commonIssues.slice(0, 3).join(', ')}`
});
}
return recommendations;
}
findCommonIssues() {
const issueCount = new Map();
this.validationResults.tables.forEach(table => {
[...table.errors, ...table.warnings].forEach(issue => {
const key = `${issue.type}:${issue.message.split(' ').slice(0, 5).join(' ')}`;
issueCount.set(key, (issueCount.get(key) || 0) + 1);
});
});
return Array.from(issueCount.entries())
.filter(([, count]) => count > 1)
.sort(([, a], [, b]) => b - a)
.map(([issue]) => issue.split(':')[1])
.slice(0, 5);
}
}
module.exports = MarkdownTableValidator;
Automated Quality Assurance Workflows
CI/CD Integration for Table Validation
Implementing continuous validation in development workflows:
# .github/workflows/table-validation.yml - Automated table quality assurance
name: Table Data Validation
on:
push:
branches: [ main, develop ]
paths:
- '**/*.md'
pull_request:
branches: [ main, develop ]
paths:
- '**/*.md'
jobs:
validate-tables:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: |
npm install
npm install -g markdown-table-validator
- name: Run table validation
run: |
node scripts/validate-all-tables.js --output-format=json > validation-results.json
- name: Generate validation report
run: |
node scripts/generate-validation-report.js validation-results.json
- name: Check validation results
id: validation-check
run: |
ERRORS=$(jq '.summary.errorsFound' validation-results.json)
WARNINGS=$(jq '.summary.warningsFound' validation-results.json)
echo "errors=$ERRORS" >> $GITHUB_OUTPUT
echo "warnings=$WARNINGS" >> $GITHUB_OUTPUT
if [ "$ERRORS" -gt 0 ]; then
echo "❌ Table validation failed with $ERRORS errors"
exit 1
elif [ "$WARNINGS" -gt 5 ]; then
echo "⚠️ Table validation completed with $WARNINGS warnings"
exit 0
else
echo "✅ Table validation passed"
exit 0
fi
- name: Upload validation report
if: always()
uses: actions/upload-artifact@v3
with:
name: table-validation-report
path: |
validation-results.json
validation-report.html
- name: Comment on PR
if: github.event_name == 'pull_request' && always()
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
try {
const results = JSON.parse(fs.readFileSync('validation-results.json', 'utf8'));
const { summary, tables } = results;
const errorTables = tables.filter(t => t.errorCount > 0);
const warningTables = tables.filter(t => t.warningCount > 0);
let comment = `## 📊 Table Validation Report\n\n`;
comment += `**Summary:**\n`;
comment += `- Total tables: ${summary.totalTables}\n`;
comment += `- Valid tables: ${summary.validTables}\n`;
comment += `- Errors: ${summary.errorsFound}\n`;
comment += `- Warnings: ${summary.warningsFound}\n\n`;
if (errorTables.length > 0) {
comment += `### ❌ Tables with Errors (${errorTables.length})\n\n`;
errorTables.slice(0, 5).forEach(table => {
comment += `- **${table.location}**: ${table.errorCount} errors\n`;
});
if (errorTables.length > 5) {
comment += `- ... and ${errorTables.length - 5} more\n`;
}
comment += `\n`;
}
if (warningTables.length > 0) {
comment += `### ⚠️ Tables with Warnings (${warningTables.length})\n\n`;
warningTables.slice(0, 3).forEach(table => {
comment += `- **${table.location}**: ${table.warningCount} warnings\n`;
});
if (warningTables.length > 3) {
comment += `- ... and ${warningTables.length - 3} more\n`;
}
comment += `\n`;
}
if (results.recommendations.length > 0) {
comment += `### 💡 Recommendations\n\n`;
results.recommendations.slice(0, 3).forEach(rec => {
comment += `- **${rec.title}**: ${rec.description}\n`;
});
}
comment += `\n[View detailed report](${process.env.GITHUB_SERVER_URL}/${process.env.GITHUB_REPOSITORY}/actions/runs/${process.env.GITHUB_RUN_ID})`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
} catch (error) {
console.error('Failed to post validation comment:', error);
}
validate-data-integrity:
runs-on: ubuntu-latest
needs: validate-tables
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
cache: 'pip'
- name: Install Python dependencies
run: |
pip install pandas numpy jsonschema
- name: Run data integrity checks
run: |
python scripts/check-data-integrity.py --format markdown
- name: Validate cross-table references
run: |
python scripts/validate-cross-references.py
- name: Generate data quality metrics
run: |
python scripts/generate-quality-metrics.py > quality-metrics.json
- name: Upload metrics
uses: actions/upload-artifact@v3
with:
name: data-quality-metrics
path: quality-metrics.json
performance-test:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '18'
- name: Install dependencies
run: npm install
- name: Run table rendering performance test
run: |
node scripts/performance-test.js > performance-results.json
- name: Check performance regression
run: |
node scripts/check-performance-regression.js
- name: Upload performance results
uses: actions/upload-artifact@v3
with:
name: performance-results
path: performance-results.json
Advanced Data Quality Monitoring
Real-time monitoring and alerting for table data quality:
# data-quality-monitor.py - Advanced data quality monitoring system
import json
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import requests
import schedule
import time
from pathlib import Path
import re
class DataQualityMonitor:
def __init__(self, config_path='config/data-quality.json'):
with open(config_path) as f:
self.config = json.load(f)
self.quality_metrics = {
'completeness': {},
'consistency': {},
'accuracy': {},
'timeliness': {},
'validity': {}
}
self.alerts = []
self.history = []
def scan_markdown_tables(self, content_dir):
"""Scan all markdown files for tables and extract data"""
tables_data = []
for md_file in Path(content_dir).rglob('*.md'):
try:
with open(md_file, 'r', encoding='utf-8') as f:
content = f.read()
tables = self.extract_tables_from_markdown(content)
for i, table in enumerate(tables):
table_data = {
'file': str(md_file),
'table_index': i,
'headers': table['headers'],
'rows': table['rows'],
'metadata': {
'last_modified': datetime.fromtimestamp(md_file.stat().st_mtime),
'size': len(table['rows']),
'columns': len(table['headers'])
}
}
tables_data.append(table_data)
except Exception as e:
print(f"Error processing {md_file}: {e}")
return tables_data
def extract_tables_from_markdown(self, content):
"""Extract table data from markdown content"""
tables = []
lines = content.split('\n')
current_table = None
for i, line in enumerate(lines):
line = line.strip()
if '|' in line and line:
if current_table is None:
current_table = {'headers': [], 'rows': [], 'separator_found': False}
cells = [cell.strip() for cell in line.split('|')[1:-1]]
if not current_table['separator_found'] and re.match(r'^[\s\|:\-]+$', line):
current_table['separator_found'] = True
elif not current_table['headers'] and not re.match(r'^[\s\|:\-]+$', line):
current_table['headers'] = cells
elif current_table['separator_found']:
current_table['rows'].append(cells)
elif current_table is not None:
# End of table
if current_table['headers'] and current_table['separator_found']:
tables.append(current_table)
current_table = None
# Handle table at end of file
if current_table and current_table['headers'] and current_table['separator_found']:
tables.append(current_table)
return tables
def calculate_completeness_metrics(self, tables_data):
"""Calculate data completeness metrics"""
metrics = {}
for table in tables_data:
table_id = f"{table['file']}:{table['table_index']}"
total_cells = len(table['rows']) * len(table['headers'])
filled_cells = 0
for row in table['rows']:
for cell in row:
if cell.strip():
filled_cells += 1
completeness = (filled_cells / total_cells * 100) if total_cells > 0 else 0
metrics[table_id] = {
'completeness_percentage': completeness,
'total_cells': total_cells,
'filled_cells': filled_cells,
'empty_cells': total_cells - filled_cells,
'column_completeness': self.calculate_column_completeness(table)
}
return metrics
def calculate_column_completeness(self, table):
"""Calculate completeness for each column"""
column_metrics = {}
for col_index, header in enumerate(table['headers']):
filled_count = 0
total_count = len(table['rows'])
for row in table['rows']:
if col_index < len(row) and row[col_index].strip():
filled_count += 1
completeness = (filled_count / total_count * 100) if total_count > 0 else 0
column_metrics[header] = {
'completeness': completeness,
'filled': filled_count,
'total': total_count
}
return column_metrics
def calculate_consistency_metrics(self, tables_data):
"""Calculate data consistency metrics"""
metrics = {}
for table in tables_data:
table_id = f"{table['file']}:{table['table_index']}"
consistency_scores = {}
for col_index, header in enumerate(table['headers']):
column_data = []
for row in table['rows']:
if col_index < len(row):
column_data.append(row[col_index].strip())
# Data type consistency
type_consistency = self.calculate_type_consistency(column_data)
# Format consistency
format_consistency = self.calculate_format_consistency(column_data)
# Case consistency
case_consistency = self.calculate_case_consistency(column_data)
consistency_scores[header] = {
'type_consistency': type_consistency,
'format_consistency': format_consistency,
'case_consistency': case_consistency,
'overall_consistency': (type_consistency + format_consistency + case_consistency) / 3
}
metrics[table_id] = consistency_scores
return metrics
def calculate_type_consistency(self, column_data):
"""Calculate data type consistency within a column"""
if not column_data:
return 100
non_empty = [val for val in column_data if val]
if not non_empty:
return 100
# Detect primary data type
type_counts = {
'number': 0,
'date': 0,
'boolean': 0,
'url': 0,
'email': 0,
'string': 0
}
for value in non_empty:
if self.is_number(value):
type_counts['number'] += 1
elif self.is_date(value):
type_counts['date'] += 1
elif self.is_boolean(value):
type_counts['boolean'] += 1
elif self.is_url(value):
type_counts['url'] += 1
elif self.is_email(value):
type_counts['email'] += 1
else:
type_counts['string'] += 1
# Calculate consistency as percentage of most common type
max_count = max(type_counts.values())
return (max_count / len(non_empty)) * 100
def calculate_format_consistency(self, column_data):
"""Calculate format consistency within a column"""
if not column_data:
return 100
non_empty = [val for val in column_data if val]
if len(non_empty) <= 1:
return 100
# Group by format patterns
format_patterns = {}
for value in non_empty:
pattern = self.extract_format_pattern(value)
format_patterns[pattern] = format_patterns.get(pattern, 0) + 1
# Calculate consistency as percentage of most common format
max_count = max(format_patterns.values())
return (max_count / len(non_empty)) * 100
def calculate_case_consistency(self, column_data):
"""Calculate case consistency within a column"""
if not column_data:
return 100
text_values = [val for val in column_data if val and not self.is_number(val)]
if len(text_values) <= 1:
return 100
# Count case patterns
case_counts = {
'lowercase': sum(1 for val in text_values if val.islower()),
'uppercase': sum(1 for val in text_values if val.isupper()),
'title': sum(1 for val in text_values if val.istitle()),
'mixed': sum(1 for val in text_values if not val.islower() and not val.isupper() and not val.istitle())
}
# Calculate consistency as percentage of most common case
max_count = max(case_counts.values())
return (max_count / len(text_values)) * 100
def calculate_accuracy_metrics(self, tables_data):
"""Calculate data accuracy metrics using validation rules"""
metrics = {}
for table in tables_data:
table_id = f"{table['file']}:{table['table_index']}"
accuracy_scores = {}
for col_index, header in enumerate(table['headers']):
column_data = []
for row in table['rows']:
if col_index < len(row):
column_data.append(row[col_index].strip())
# Apply validation rules based on column name and data
validation_results = self.apply_validation_rules(header, column_data)
total_values = len([val for val in column_data if val])
valid_values = sum(1 for result in validation_results if result['valid'])
accuracy = (valid_values / total_values * 100) if total_values > 0 else 100
accuracy_scores[header] = {
'accuracy_percentage': accuracy,
'valid_count': valid_values,
'total_count': total_values,
'invalid_values': [r for r in validation_results if not r['valid']]
}
metrics[table_id] = accuracy_scores
return metrics
def apply_validation_rules(self, column_name, column_data):
"""Apply validation rules based on column name patterns"""
results = []
# Get validation rules for this column
rules = self.get_column_validation_rules(column_name)
for value in column_data:
if not value:
results.append({'value': value, 'valid': True})
continue
is_valid = True
errors = []
for rule in rules:
try:
if rule['type'] == 'regex':
if not re.match(rule['pattern'], value):
is_valid = False
errors.append(f"Does not match pattern: {rule['pattern']}")
elif rule['type'] == 'range':
if self.is_number(value):
num_val = float(value)
if num_val < rule['min'] or num_val > rule['max']:
is_valid = False
errors.append(f"Value {num_val} outside range [{rule['min']}, {rule['max']}]")
elif rule['type'] == 'enum':
if value.lower() not in [opt.lower() for opt in rule['options']]:
is_valid = False
errors.append(f"Value not in allowed options: {rule['options']}")
elif rule['type'] == 'length':
if len(value) < rule['min'] or len(value) > rule['max']:
is_valid = False
errors.append(f"Length {len(value)} outside range [{rule['min']}, {rule['max']}]")
except Exception as e:
errors.append(f"Validation error: {e}")
results.append({
'value': value,
'valid': is_valid,
'errors': errors
})
return results
def get_column_validation_rules(self, column_name):
"""Get validation rules based on column name patterns"""
rules = []
name_lower = column_name.lower()
# Email validation
if 'email' in name_lower:
rules.append({
'type': 'regex',
'pattern': r'^[^\s@]+@[^\s@]+\.[^\s@]+$'
})
# URL validation
elif 'url' in name_lower or 'link' in name_lower:
rules.append({
'type': 'regex',
'pattern': r'^https?://.+'
})
# Version validation
elif 'version' in name_lower:
rules.append({
'type': 'regex',
'pattern': r'^\d+\.\d+(\.\d+)?'
})
# Status validation
elif 'status' in name_lower:
rules.append({
'type': 'enum',
'options': ['active', 'inactive', 'pending', 'completed', 'draft', 'published', 'archived']
})
# Date validation
elif 'date' in name_lower:
rules.append({
'type': 'regex',
'pattern': r'^\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{2,4}|[A-Za-z]{3,9}\s+\d{1,2},?\s+\d{4}'
})
return rules
def calculate_timeliness_metrics(self, tables_data):
"""Calculate data timeliness metrics"""
metrics = {}
current_time = datetime.now()
for table in tables_data:
table_id = f"{table['file']}:{table['table_index']}"
last_modified = table['metadata']['last_modified']
age_days = (current_time - last_modified).days
# Define freshness thresholds
freshness_score = 100
if age_days > 30:
freshness_score = max(0, 100 - (age_days - 30) * 2) # Decrease by 2% per day after 30 days
# Look for date columns to assess data currency
date_columns = []
for col_index, header in enumerate(table['headers']):
if 'date' in header.lower():
column_data = []
for row in table['rows']:
if col_index < len(row):
column_data.append(row[col_index].strip())
date_values = []
for value in column_data:
if value and self.is_date(value):
try:
parsed_date = pd.to_datetime(value)
date_values.append(parsed_date)
except:
pass
if date_values:
latest_date = max(date_values)
data_age = (pd.Timestamp.now() - latest_date).days
date_columns.append({
'column': header,
'latest_date': latest_date,
'age_days': data_age,
'values_count': len(date_values)
})
metrics[table_id] = {
'file_freshness_score': freshness_score,
'file_age_days': age_days,
'last_modified': last_modified.isoformat(),
'date_columns': date_columns
}
return metrics
def run_quality_assessment(self, content_dir):
"""Run complete data quality assessment"""
print("Starting data quality assessment...")
# Scan all tables
tables_data = self.scan_markdown_tables(content_dir)
print(f"Found {len(tables_data)} tables to analyze")
# Calculate metrics
self.quality_metrics['completeness'] = self.calculate_completeness_metrics(tables_data)
self.quality_metrics['consistency'] = self.calculate_consistency_metrics(tables_data)
self.quality_metrics['accuracy'] = self.calculate_accuracy_metrics(tables_data)
self.quality_metrics['timeliness'] = self.calculate_timeliness_metrics(tables_data)
# Generate overall quality scores
quality_report = self.generate_quality_report()
# Check for alerts
self.check_quality_alerts()
# Save results
self.save_quality_results()
return quality_report
def generate_quality_report(self):
"""Generate comprehensive quality report"""
report = {
'timestamp': datetime.now().isoformat(),
'summary': {
'total_tables': len(self.quality_metrics['completeness']),
'overall_quality_score': self.calculate_overall_quality_score(),
'dimension_scores': {
'completeness': self.calculate_dimension_average('completeness'),
'consistency': self.calculate_dimension_average('consistency'),
'accuracy': self.calculate_dimension_average('accuracy'),
'timeliness': self.calculate_dimension_average('timeliness')
}
},
'alerts': self.alerts,
'top_issues': self.identify_top_issues(),
'recommendations': self.generate_recommendations(),
'detailed_metrics': self.quality_metrics
}
return report
def calculate_overall_quality_score(self):
"""Calculate overall quality score across all dimensions"""
dimension_weights = {
'completeness': 0.3,
'consistency': 0.25,
'accuracy': 0.3,
'timeliness': 0.15
}
weighted_score = 0
for dimension, weight in dimension_weights.items():
dimension_score = self.calculate_dimension_average(dimension)
weighted_score += dimension_score * weight
return round(weighted_score, 2)
def calculate_dimension_average(self, dimension):
"""Calculate average score for a quality dimension"""
metrics = self.quality_metrics[dimension]
if not metrics:
return 0
if dimension == 'completeness':
scores = [table_metrics['completeness_percentage']
for table_metrics in metrics.values()]
elif dimension == 'consistency':
scores = []
for table_metrics in metrics.values():
table_scores = [col_data['overall_consistency']
for col_data in table_metrics.values()]
if table_scores:
scores.append(sum(table_scores) / len(table_scores))
elif dimension == 'accuracy':
scores = []
for table_metrics in metrics.values():
table_scores = [col_data['accuracy_percentage']
for col_data in table_metrics.values()]
if table_scores:
scores.append(sum(table_scores) / len(table_scores))
elif dimension == 'timeliness':
scores = [table_metrics['file_freshness_score']
for table_metrics in metrics.values()]
return round(sum(scores) / len(scores), 2) if scores else 0
def check_quality_alerts(self):
"""Check for quality issues that require alerts"""
self.alerts = []
# Check completeness alerts
for table_id, metrics in self.quality_metrics['completeness'].items():
if metrics['completeness_percentage'] < 50:
self.alerts.append({
'severity': 'high',
'type': 'completeness',
'table': table_id,
'message': f"Table completeness is only {metrics['completeness_percentage']:.1f}%",
'recommendation': 'Review and fill missing data or add placeholders'
})
# Check accuracy alerts
for table_id, metrics in self.quality_metrics['accuracy'].items():
for column, col_metrics in metrics.items():
if col_metrics['accuracy_percentage'] < 80:
self.alerts.append({
'severity': 'medium',
'type': 'accuracy',
'table': table_id,
'column': column,
'message': f"Column '{column}' accuracy is only {col_metrics['accuracy_percentage']:.1f}%",
'recommendation': 'Review and fix invalid data values'
})
# Check timeliness alerts
for table_id, metrics in self.quality_metrics['timeliness'].items():
if metrics['file_age_days'] > 90:
self.alerts.append({
'severity': 'low',
'type': 'timeliness',
'table': table_id,
'message': f"Table data is {metrics['file_age_days']} days old",
'recommendation': 'Review and update data if necessary'
})
def identify_top_issues(self):
"""Identify top data quality issues across all tables"""
issues = []
# Collect all issues
for alert in self.alerts:
issues.append({
'type': alert['type'],
'severity': alert['severity'],
'count': 1,
'tables': [alert['table']]
})
# Group similar issues
grouped_issues = {}
for issue in issues:
key = f"{issue['type']}:{issue['severity']}"
if key in grouped_issues:
grouped_issues[key]['count'] += issue['count']
grouped_issues[key]['tables'].extend(issue['tables'])
else:
grouped_issues[key] = issue
# Sort by severity and count
severity_order = {'high': 3, 'medium': 2, 'low': 1}
sorted_issues = sorted(
grouped_issues.values(),
key=lambda x: (severity_order.get(x['severity'], 0), x['count']),
reverse=True
)
return sorted_issues[:10] # Return top 10 issues
def generate_recommendations(self):
"""Generate actionable recommendations based on quality assessment"""
recommendations = []
overall_score = self.calculate_overall_quality_score()
if overall_score < 70:
recommendations.append({
'priority': 'high',
'title': 'Improve Overall Data Quality',
'description': f'Overall quality score is {overall_score}%, below acceptable threshold',
'actions': [
'Focus on completeness and accuracy improvements',
'Implement data validation rules',
'Establish regular data review processes'
]
})
# Dimension-specific recommendations
completeness_score = self.calculate_dimension_average('completeness')
if completeness_score < 80:
recommendations.append({
'priority': 'high',
'title': 'Address Data Completeness Issues',
'description': f'Data completeness is {completeness_score}%',
'actions': [
'Fill missing data where possible',
'Use consistent placeholder values (N/A, TBD)',
'Document data collection requirements'
]
})
consistency_score = self.calculate_dimension_average('consistency')
if consistency_score < 75:
recommendations.append({
'priority': 'medium',
'title': 'Improve Data Consistency',
'description': f'Data consistency is {consistency_score}%',
'actions': [
'Standardize data formats within columns',
'Implement data entry guidelines',
'Use validation rules and dropdowns'
]
})
return recommendations
def save_quality_results(self):
"""Save quality assessment results"""
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
# Save detailed results
results_file = f'quality_results_{timestamp}.json'
with open(results_file, 'w') as f:
json.dump(self.quality_metrics, f, indent=2, default=str)
# Save report
report = self.generate_quality_report()
report_file = f'quality_report_{timestamp}.json'
with open(report_file, 'w') as f:
json.dump(report, f, indent=2, default=str)
print(f"Quality results saved to {results_file}")
print(f"Quality report saved to {report_file}")
# Helper methods for data type detection
def is_number(self, value):
try:
float(value)
return True
except ValueError:
return False
def is_date(self, value):
date_patterns = [
r'^\d{4}-\d{2}-\d{2}$',
r'^\d{2}/\d{2}/\d{4}$',
r'^\d{1,2}/\d{1,2}/\d{2,4}$',
r'^[A-Za-z]{3,9}\s+\d{1,2},?\s+\d{4}$'
]
return any(re.match(pattern, value.strip()) for pattern in date_patterns)
def is_boolean(self, value):
boolean_values = ['true', 'false', 'yes', 'no', 'on', 'off', '1', '0', 'enabled', 'disabled']
return value.lower() in boolean_values
def is_url(self, value):
return value.startswith(('http://', 'https://'))
def is_email(self, value):
return re.match(r'^[^\s@]+@[^\s@]+\.[^\s@]+$', value) is not None
def extract_format_pattern(self, value):
"""Extract a format pattern from a value for consistency checking"""
pattern = re.sub(r'\d', 'N', value) # Replace digits with N
pattern = re.sub(r'[A-Za-z]', 'A', pattern) # Replace letters with A
return pattern
# CLI interface
if __name__ == "__main__":
import sys
monitor = DataQualityMonitor()
if len(sys.argv) > 1:
content_dir = sys.argv[1]
else:
content_dir = '.'
report = monitor.run_quality_assessment(content_dir)
print(f"\n📊 Data Quality Assessment Complete")
print(f"Overall Quality Score: {report['summary']['overall_quality_score']}")
print(f"Tables Analyzed: {report['summary']['total_tables']}")
print(f"Alerts Generated: {len(report['alerts'])}")
if report['alerts']:
print(f"\n⚠️ Top Issues:")
for issue in report['top_issues'][:3]:
print(f" - {issue['type'].title()} issues ({issue['count']} occurrences)")
Integration with Content Management Systems
Data validation systems integrate seamlessly with modern content workflows. When combined with automation and CI/CD systems, table validation becomes part of the continuous integration process, ensuring data quality is maintained automatically as content is updated and published across development environments.
For comprehensive content management, validation systems work effectively with link management and cross-referencing systems to ensure that table data references and cross-links remain accurate and functional, creating cohesive information architectures where data integrity extends beyond individual tables.
When building sophisticated documentation platforms, data validation complements Progressive Web App documentation systems by enabling offline data validation capabilities and ensuring that cached content maintains quality standards even when accessed without internet connectivity.
Advanced Validation Scenarios
Cross-Table Data Consistency
// cross-table-validator.js - Validate data consistency across multiple tables
class CrossTableValidator {
constructor() {
this.tableRegistry = new Map();
this.relationships = new Map();
this.inconsistencies = [];
}
registerTable(tableId, tableData, schema = {}) {
this.tableRegistry.set(tableId, {
data: tableData,
schema: schema,
relationships: []
});
}
defineRelationship(sourceTable, sourceColumn, targetTable, targetColumn, type = 'reference') {
const relationshipId = `${sourceTable}.${sourceColumn} -> ${targetTable}.${targetColumn}`;
this.relationships.set(relationshipId, {
source: { table: sourceTable, column: sourceColumn },
target: { table: targetTable, column: targetColumn },
type: type, // 'reference', 'lookup', 'aggregation'
validated: false
});
}
async validateAllRelationships() {
const results = {
validRelationships: [],
brokenRelationships: [],
inconsistencies: []
};
for (const [relationshipId, relationship] of this.relationships) {
const validationResult = await this.validateRelationship(relationshipId, relationship);
if (validationResult.isValid) {
results.validRelationships.push({
relationship: relationshipId,
details: validationResult
});
} else {
results.brokenRelationships.push({
relationship: relationshipId,
issues: validationResult.issues
});
}
}
return results;
}
async validateRelationship(relationshipId, relationship) {
const sourceTable = this.tableRegistry.get(relationship.source.table);
const targetTable = this.tableRegistry.get(relationship.target.table);
if (!sourceTable || !targetTable) {
return {
isValid: false,
issues: ['Source or target table not found']
};
}
const sourceValues = this.extractColumnValues(sourceTable.data, relationship.source.column);
const targetValues = this.extractColumnValues(targetTable.data, relationship.target.column);
const issues = [];
// Check referential integrity
if (relationship.type === 'reference') {
const missingReferences = sourceValues.filter(val =>
val && !targetValues.includes(val)
);
if (missingReferences.length > 0) {
issues.push({
type: 'missing_references',
count: missingReferences.length,
examples: missingReferences.slice(0, 5)
});
}
}
// Check data type consistency
const sourceTypes = this.analyzeDataTypes(sourceValues);
const targetTypes = this.analyzeDataTypes(targetValues);
if (sourceTypes.primary !== targetTypes.primary) {
issues.push({
type: 'type_mismatch',
source_type: sourceTypes.primary,
target_type: targetTypes.primary
});
}
return {
isValid: issues.length === 0,
issues: issues,
statistics: {
source_unique_values: new Set(sourceValues).size,
target_unique_values: new Set(targetValues).size,
common_values: sourceValues.filter(val => targetValues.includes(val)).length
}
};
}
extractColumnValues(tableData, columnName) {
const columnIndex = tableData.headers.indexOf(columnName);
if (columnIndex === -1) return [];
return tableData.rows.map(row =>
row.cells[columnIndex] ? row.cells[columnIndex].trim() : ''
).filter(val => val);
}
analyzeDataTypes(values) {
const typeCounts = {
number: 0,
date: 0,
boolean: 0,
string: 0
};
values.forEach(value => {
if (!isNaN(value) && !isNaN(parseFloat(value))) {
typeCounts.number++;
} else if (this.isDateLike(value)) {
typeCounts.date++;
} else if (['true', 'false', 'yes', 'no'].includes(value.toLowerCase())) {
typeCounts.boolean++;
} else {
typeCounts.string++;
}
});
const primaryType = Object.entries(typeCounts)
.sort(([,a], [,b]) => b - a)[0][0];
return { primary: primaryType, counts: typeCounts };
}
isDateLike(value) {
const datePatterns = [
/^\d{4}-\d{2}-\d{2}$/,
/^\d{2}\/\d{2}\/\d{4}$/,
/^[A-Za-z]{3,9}\s+\d{1,2},?\s+\d{4}$/
];
return datePatterns.some(pattern => pattern.test(value));
}
}
Real-Time Validation Integration
<!-- real-time-table-validator.html - Browser-based real-time validation -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Real-Time Table Validator</title>
<style>
.validation-panel {
position: fixed;
right: 20px;
top: 20px;
width: 300px;
background: white;
border: 1px solid #ddd;
border-radius: 8px;
padding: 15px;
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
max-height: 80vh;
overflow-y: auto;
}
.validation-summary {
margin-bottom: 15px;
padding: 10px;
border-radius: 4px;
}
.validation-summary.valid {
background: #d4edda;
border: 1px solid #c3e6cb;
color: #155724;
}
.validation-summary.invalid {
background: #f8d7da;
border: 1px solid #f5c6cb;
color: #721c24;
}
.validation-issue {
margin: 8px 0;
padding: 8px;
background: #fff3cd;
border: 1px solid #ffeaa7;
border-radius: 4px;
font-size: 12px;
}
.validation-issue.error {
background: #f8d7da;
border-color: #f5c6cb;
}
.table-highlight {
outline: 2px solid #007bff;
outline-offset: 2px;
}
.table-highlight.invalid {
outline-color: #dc3545;
}
.cell-error {
background-color: rgba(220, 53, 69, 0.1) !important;
border: 1px solid rgba(220, 53, 69, 0.3) !important;
}
.cell-warning {
background-color: rgba(255, 193, 7, 0.1) !important;
border: 1px solid rgba(255, 193, 7, 0.3) !important;
}
</style>
</head>
<body>
<div id="validation-panel" class="validation-panel">
<h3>Table Validation</h3>
<div id="validation-summary" class="validation-summary">
<strong>No tables detected</strong>
</div>
<div id="validation-issues"></div>
</div>
<script>
class RealTimeTableValidator {
constructor() {
this.validationPanel = document.getElementById('validation-panel');
this.summaryElement = document.getElementById('validation-summary');
this.issuesElement = document.getElementById('validation-issues');
this.observer = new MutationObserver(this.handleDOMChanges.bind(this));
this.validationResults = new Map();
this.init();
}
init() {
// Start observing DOM changes
this.observer.observe(document.body, {
childList: true,
subtree: true,
attributes: true,
characterData: true
});
// Initial validation
this.validateAllTables();
// Periodic revalidation
setInterval(() => this.validateAllTables(), 5000);
}
handleDOMChanges(mutations) {
let shouldRevalidate = false;
mutations.forEach(mutation => {
if (mutation.type === 'childList') {
mutation.addedNodes.forEach(node => {
if (node.nodeType === Node.ELEMENT_NODE) {
if (node.tagName === 'TABLE' || node.querySelector('table')) {
shouldRevalidate = true;
}
}
});
}
});
if (shouldRevalidate) {
setTimeout(() => this.validateAllTables(), 100);
}
}
validateAllTables() {
const tables = document.querySelectorAll('table');
this.validationResults.clear();
if (tables.length === 0) {
this.updateSummary('No tables detected', 'valid');
this.issuesElement.innerHTML = '';
return;
}
let totalIssues = 0;
let totalTables = tables.length;
tables.forEach((table, index) => {
const results = this.validateTable(table, index);
this.validationResults.set(table, results);
totalIssues += results.errors.length + results.warnings.length;
this.highlightTable(table, results);
});
this.updateSummary(
`${totalTables} tables, ${totalIssues} issues`,
totalIssues === 0 ? 'valid' : 'invalid'
);
this.displayIssues();
}
validateTable(table, tableIndex) {
const results = {
tableIndex,
errors: [],
warnings: [],
isValid: true
};
// Extract table data
const headers = [];
const rows = [];
// Get headers
const headerRow = table.querySelector('thead tr, tr:first-child');
if (headerRow) {
headerRow.querySelectorAll('th, td').forEach(cell => {
headers.push(cell.textContent.trim());
});
}
// Get data rows
const dataRows = table.querySelectorAll('tbody tr, tr:not(:first-child)');
dataRows.forEach(row => {
const cells = [];
row.querySelectorAll('td, th').forEach(cell => {
cells.push({
content: cell.textContent.trim(),
element: cell
});
});
rows.push({ cells, element: row });
});
// Validate structure
this.validateTableStructure(table, headers, rows, results);
// Validate data consistency
this.validateDataConsistency(headers, rows, results);
// Validate cell content
this.validateCellContent(rows, results);
results.isValid = results.errors.length === 0;
return results;
}
validateTableStructure(table, headers, rows, results) {
// Check if table has headers
if (headers.length === 0) {
results.errors.push({
type: 'structure',
message: 'Table has no headers',
severity: 'error'
});
}
// Check column consistency
const expectedColumns = headers.length;
rows.forEach((row, rowIndex) => {
if (row.cells.length !== expectedColumns) {
results.errors.push({
type: 'structure',
message: `Row ${rowIndex + 1} has ${row.cells.length} columns, expected ${expectedColumns}`,
severity: 'error',
element: row.element
});
}
});
// Check for empty headers
headers.forEach((header, index) => {
if (!header) {
results.warnings.push({
type: 'structure',
message: `Header ${index + 1} is empty`,
severity: 'warning'
});
}
});
}
validateDataConsistency(headers, rows, results) {
// Check data types within columns
headers.forEach((header, colIndex) => {
const columnData = rows.map(row =>
row.cells[colIndex] ? row.cells[colIndex].content : ''
).filter(content => content.trim());
if (columnData.length > 1) {
const dataTypes = this.analyzeColumnDataTypes(columnData);
const consistency = this.calculateTypeConsistency(dataTypes);
if (consistency < 0.8) {
results.warnings.push({
type: 'consistency',
message: `Column "${header}" has mixed data types (${Math.round(consistency * 100)}% consistent)`,
severity: 'warning',
column: colIndex
});
}
}
});
// Check for duplicate rows
const rowHashes = new Set();
rows.forEach((row, rowIndex) => {
const rowHash = row.cells.map(cell => cell.content).join('|').toLowerCase();
if (rowHashes.has(rowHash)) {
results.warnings.push({
type: 'consistency',
message: `Row ${rowIndex + 1} appears to be a duplicate`,
severity: 'warning',
element: row.element
});
}
rowHashes.add(rowHash);
});
}
validateCellContent(rows, results) {
rows.forEach((row, rowIndex) => {
row.cells.forEach((cell, cellIndex) => {
const issues = this.detectCellIssues(cell.content);
issues.forEach(issue => {
results.warnings.push({
type: 'content',
message: `Row ${rowIndex + 1}, Column ${cellIndex + 1}: ${issue.message}`,
severity: issue.severity,
element: cell.element
});
});
});
});
}
analyzeColumnDataTypes(columnData) {
const types = {
number: 0,
date: 0,
boolean: 0,
url: 0,
email: 0,
string: 0
};
columnData.forEach(value => {
if (!isNaN(value) && !isNaN(parseFloat(value))) {
types.number++;
} else if (this.isDateLike(value)) {
types.date++;
} else if (['true', 'false', 'yes', 'no', 'on', 'off'].includes(value.toLowerCase())) {
types.boolean++;
} else if (value.startsWith('http://') || value.startsWith('https://')) {
types.url++;
} else if (value.includes('@') && value.includes('.')) {
types.email++;
} else {
types.string++;
}
});
return types;
}
calculateTypeConsistency(dataTypes) {
const total = Object.values(dataTypes).reduce((sum, count) => sum + count, 0);
const maxCount = Math.max(...Object.values(dataTypes));
return total > 0 ? maxCount / total : 1;
}
detectCellIssues(content) {
const issues = [];
// Check for excessive whitespace
if (content !== content.trim()) {
issues.push({
message: 'Has leading or trailing whitespace',
severity: 'warning'
});
}
// Check for very long content
if (content.length > 100) {
issues.push({
message: 'Content is very long (consider abbreviating)',
severity: 'warning'
});
}
// Check for HTML content
if (/<[^>]+>/.test(content)) {
issues.push({
message: 'Contains HTML tags',
severity: 'warning'
});
}
return issues;
}
isDateLike(value) {
const datePatterns = [
/^\d{4}-\d{2}-\d{2}$/,
/^\d{2}\/\d{2}\/\d{4}$/,
/^[A-Za-z]{3,9}\s+\d{1,2},?\s+\d{4}$/
];
return datePatterns.some(pattern => pattern.test(value));
}
highlightTable(table, results) {
// Remove previous highlights
table.classList.remove('table-highlight', 'invalid');
table.querySelectorAll('.cell-error, .cell-warning').forEach(cell => {
cell.classList.remove('cell-error', 'cell-warning');
});
// Add table highlight
table.classList.add('table-highlight');
if (!results.isValid) {
table.classList.add('invalid');
}
// Highlight problem cells
[...results.errors, ...results.warnings].forEach(issue => {
if (issue.element) {
const cssClass = issue.severity === 'error' ? 'cell-error' : 'cell-warning';
issue.element.classList.add(cssClass);
}
});
}
updateSummary(text, status) {
this.summaryElement.innerHTML = `<strong>${text}</strong>`;
this.summaryElement.className = `validation-summary ${status}`;
}
displayIssues() {
this.issuesElement.innerHTML = '';
this.validationResults.forEach((results, table) => {
if (results.errors.length > 0 || results.warnings.length > 0) {
const tableHeader = document.createElement('h4');
tableHeader.textContent = `Table ${results.tableIndex + 1}`;
tableHeader.style.margin = '10px 0 5px 0';
this.issuesElement.appendChild(tableHeader);
[...results.errors, ...results.warnings].forEach(issue => {
const issueElement = document.createElement('div');
issueElement.className = `validation-issue ${issue.severity}`;
issueElement.innerHTML = `
<strong>${issue.type}:</strong> ${issue.message}
`;
if (issue.element) {
issueElement.style.cursor = 'pointer';
issueElement.addEventListener('click', () => {
issue.element.scrollIntoView({ behavior: 'smooth', block: 'center' });
issue.element.style.backgroundColor = 'yellow';
setTimeout(() => {
issue.element.style.backgroundColor = '';
}, 2000);
});
}
this.issuesElement.appendChild(issueElement);
});
}
});
}
}
// Initialize validator when DOM is ready
if (document.readyState === 'loading') {
document.addEventListener('DOMContentLoaded', () => new RealTimeTableValidator());
} else {
new RealTimeTableValidator();
}
</script>
</body>
</html>
Conclusion
Advanced Markdown table data validation and quality assurance represents a sophisticated approach to content management that ensures data integrity, maintains professional standards, and provides automated verification of tabular content across large documentation repositories. By implementing comprehensive validation rules, automated quality checks, and systematic error detection processes, technical teams can build robust content management systems that maintain high data quality standards while scaling efficiently across complex information architectures.
The key to successful data validation lies in balancing automated checks with human oversight, ensuring that technical validation serves content quality and user needs. Whether you’re building technical documentation, data catalogs, or comprehensive knowledge bases, the validation techniques covered in this guide provide the foundation for creating reliable, accurate, and maintainable tabular content that users can depend on for critical decision-making.
Remember to implement validation early in the content creation process, establish clear data quality standards that match your organization’s requirements, and continuously monitor and optimize your validation systems based on real-world usage patterns and user feedback. With proper implementation of advanced data validation and quality assurance systems, your Markdown tables can achieve the same level of rigor, reliability, and professional quality that users expect from enterprise data management systems.