← Back

Architecture Planning Docs: Closing Major Documentation Gaps

·infrastructure

Architecture Planning Docs: Closing Major Documentation Gaps

Published: February 2026 Category: Infrastructure & Developer Experience Reading Time: 8 minutes

The Problem: Tribal Knowledge Tax

Every engineering team accumulates institutional knowledge—the "why" behind technical decisions, the edge cases in deployment, the workarounds for subtle bugs. When this knowledge lives only in Slack threads and senior engineers' memories, onboarding becomes archaeology.

New engineers spend weeks reconstructing context through code spelunking and tentative questions. Existing engineers rediscover solutions to previously solved problems. Critical system behaviors remain opaque until production incidents force understanding.

We had this problem. Badly.

Before: Documentation Scattered Across Channels

Knowledge Distribution (Fragmented)
┌──────────────────────────────────────────────────────────────┐
│ System Architecture                                          │
│ ├─ Slack thread from 2024 (now buried)                      │
│ ├─ Email chain (3 participants)                             │
│ ├─ Notion doc (outdated)                                    │
│ ├─ Comments in PRs #234, #567, #891                         │
│ └─ Senior engineer's memory                                 │
│                                                              │
│ Auth Flow                                                    │
│ ├─ Stack Overflow answer (external)                         │
│ ├─ README.md (incomplete)                                   │
│ └─ Trial and error                                          │
│                                                              │
│ Database Design Patterns                                    │
│ ├─ Migration files (implicit)                               │
│ ├─ Code comments (sparse)                                   │
│ └─ ???                                                       │
│                                                              │
│ TTS Speech Marks System                                     │
│ ├─ Debug session notes (private)                            │
│ ├─ Production incident postmortem                           │
│ └─ Not written down                                         │
└──────────────────────────────────────────────────────────────┘
    Knowledge scattered, incomplete, often wrong

Onboarding Reality: 2+ Weeks

Week 1:

  • Clone repo
  • Spend 2 days getting environment working (undocumented dependencies)
  • Read sparse README
  • Ask "dumb questions" in Slack
  • Feel like an imposter

Week 2:

  • Read all code to understand architecture (16,000+ lines)
  • Discover critical behaviors through debugging
  • Break staging environment (didn't know about deployment process)
  • Ask more questions
  • Still confused

Week 3+:

  • Make first small PR
  • Get told "we don't do it that way" (undocumented pattern)
  • Learn about API Gateway auth model (not documented)
  • Realize migration rules after manual migration incident
  • Finally understand 20% of the system

Pain Points Quantified

1. Repeated Questions

Slack search revealed we answered the same questions monthly:

  • "How does authentication work?" (asked 8 times)
  • "Why do migrations run in CI/CD?" (asked 6 times)
  • "How are speech marks calculated?" (asked 4 times)

2. Onboarding Time

Average time to first meaningful contribution:

  • Backend engineers: 15 working days
  • Frontend engineers: 12 working days (less backend context needed)
  • DevOps engineers: 10 working days (infrastructure-focused)

3. Production Incidents from Knowledge Gaps

Three P1 incidents in six months traced to undocumented system behavior:

  • Manual migration breaking production (migration rule not documented)
  • Auth failure in staging (API Gateway auth model not understood)
  • Speech marks misalignment (algorithm not explained)

4. Duplicate Work

Engineers solving the same problems independently:

  • RDS optimization investigated three separate times
  • Lambda cost analysis done twice
  • API caching strategy designed twice (different approaches)

The Solution: Centralized Architecture Documentation

We created comprehensive architecture documentation in-repo. Not in Notion. Not in Google Docs. In the codebase, versioned with the code, required for approval.

After: Structured Knowledge Base

Documentation Status (Consolidated)
┌──────────────────────────────────────────────────────────────┐
│ docs/                                                        │
│ ├─ architecture/                                            │
│ │  ├─ database-design.md                                   │
│ │  │  • Schema design patterns                             │
│ │  │  • Migration strategy                                 │
│ │  │  • Multi-tenancy approach                             │
│ │  │  • Index optimization                                 │
│ │  ├─ api-architecture.md                                  │
│ │  │  • REST design principles                             │
│ │  │  • Error handling patterns                            │
│ │  │  • Versioning strategy                                │
│ │  │  • Response formats                                   │
│ │  ├─ auth-flow.md                                         │
│ │  │  • API Gateway auth model                             │
│ │  │  • Cognito integration                                │
│ │  │  • Token validation                                   │
│ │  │  • Per-app configuration                              │
│ │  ├─ content-pipeline.md                                  │
│ │  │  • Media processing flow                              │
│ │  │  • TTS generation                                     │
│ │  │  • S3 storage patterns                                │
│ │  │  • CDN delivery                                       │
│ │  ├─ deployment-process.md                                │
│ │  │  • CI/CD pipeline stages                              │
│ │  │  • Migration execution                                │
│ │  │  • Environment promotion                              │
│ │  │  • Rollback procedures                                │
│ │  ├─ monitoring-alerting.md                               │
│ │  │  • CloudWatch dashboards                              │
│ │  │  • Alert thresholds                                   │
│ │  │  • PagerDuty integration                              │
│ │  │  • Runbook links                                      │
│ │  └─ tts-speech-marks.md                                  │
│ │     • Arabic normalization                               │
│ │     • Index remapping algorithm                          │
│ │     • Frontend integration                               │
│ │     • Known limitations                                  │
│ ├─ AGENTS.md                                               │
│ │  • Canonical project instructions                        │
│ │  • Build & test commands                                 │
│ │  • Security boundaries                                   │
│ │  • Migration rules                                       │
│ └─ CLAUDE.md                                                │
│    • Cognito OAuth setup                                   │
│    • TTS system reference                                  │
│    • Merge workflow                                        │
│    • General workflow rules                                │
└──────────────────────────────────────────────────────────────┘
    Single source of truth, version-controlled, always current

Implementation: Writing the Missing Docs

Phase 1: Audit Knowledge Gaps (Commits: 3dfd8bf)

We systematically identified undocumented systems:

  1. Slack Analysis - Searched for repeated questions
  2. Onboarding Feedback - Asked new hires what confused them
  3. Incident Reviews - Analyzed incidents caused by knowledge gaps
  4. Code Review Patterns - Noted explanations given repeatedly in PRs

Priority Matrix:

Impact vs Documentation Status
┌──────────────────────────────────────────────────────────────┐
 High Impact, No Docs:           High Impact, Partial Docs:   
  Database design               API design                
  Auth flow                     Deployment                
  TTS speech marks              Monitoring                
  Content pipeline                                          
                                                              
 Low Impact, No Docs:            Low Impact, Has Docs:        
  Script utilities              Git workflow              
  Analytics exports             Code style                
└──────────────────────────────────────────────────────────────┘

Phase 2: Document High-Impact Systems (Commits: 50870fc, 06f783a)

1. Database Design Patterns (database-design.md)

Content:

  • Why shared tables over per-app tables (Content Duo design decision)
  • Idempotent migration patterns
  • Index naming conventions
  • Foreign key strategies
  • JSON column usage guidelines

Example Section:

## Idempotent Migrations

All migrations MUST be idempotent. Use IF NOT EXISTS guards:

```sql
CREATE TABLE IF NOT EXISTS users (
  id INT PRIMARY KEY,
  email VARCHAR(255) NOT NULL
);

CREATE INDEX IF NOT EXISTS idx_users_email ON users(email);

Why: Failed migrations can be safely re-run without manual fixes.


#### 2. API Gateway Auth Model (`auth-flow.md`)

**Content:**
- How API Gateway handles authentication (not Flask)
- Why integration tests don't need auth headers
- Cognito token validation flow
- Per-app OAuth configuration
- redirect_uri_mismatch troubleshooting

**Example Section:**
```markdown
## Auth Boundary: API Gateway vs Flask

Authentication happens at API Gateway, NOT Flask:

┌──────────┐ ┌─────────────┐ ┌──────────┐ │ Client │─────>│ API Gateway │─────>│ Flask │ │ (+ token)│ │ • Validates │ │ • Trusts │ │ │ │ • Extracts │ │ • No auth│ │ │ │ • Returns │ │ check │ │ │ │ 401 if bad│ │ │ └──────────┘ └─────────────┘ └──────────┘


**Implication:** Integration tests in Flask don't require auth headers.
The API Gateway layer handles auth in production.

3. TTS Speech Marks System (tts-speech-marks.md)

Content:

  • Why Arabic TTS normalization breaks indices
  • Character-by-character remapping algorithm
  • media.text vs tts.text distinction
  • Frontend integration requirements
  • Current limitations (77.6% accuracy)

Example Section:

## The Index Remapping Problem

Arabic text normalization changes character positions:

Original: "القرآن" (7 chars with diacritics) TTS Input: "القران" (6 chars, normalized)

Speech marks: [0, 3, 6] (for normalized text) BUT UI needs: [0, 4, 7] (for original text)


**Solution:** `remap_speech_marks()` builds character mapping:
```python
def remap_speech_marks(original, normalized, marks):
    mapping = build_character_map(original, normalized)
    return [mapping[idx] for idx in marks]

#### 4. Deployment Process (`deployment-process.md`)

**Content:**
- CircleCI pipeline stages
- Why migrations run in CI/CD (not manually)
- RDS snapshot strategy (pre-migration safety)
- Environment-specific configurations
- Rollback procedures

**Example Section:**
```markdown
## Migration Execution Rule

**CRITICAL:** NEVER run migrations manually.

```bash
# ✗ NEVER DO THIS
$ flask db upgrade

# ✓ ALWAYS LET CI/CD HANDLE IT
$ git push origin main
# CI/CD runs migrations automatically

Why:

  1. Consistent execution environment
  2. Automated pre-migration snapshots
  3. Rollback capability
  4. Audit trail in CI logs

#### 5. Content Pipeline (`content-pipeline.md`)

**Content:**
- Media upload flow (S3 → Lambda → RDS)
- TTS generation (OpenAI integration)
- Speech mark calculation
- CDN delivery (CloudFront)
- Cache invalidation

#### 6. Monitoring & Alerting (`monitoring-alerting.md`)

**Content:**
- CloudWatch dashboard links
- Key metrics to watch (Lambda duration, RDS connections, error rates)
- Alert thresholds and rationale
- PagerDuty escalation policy
- Runbook links for common alerts

#### 7. API Architecture (`api-architecture.md`)

**Content:**
- RESTful design patterns
- Nested object structure (TTS refactoring)
- Error response format (RFC 7807)
- Pagination strategy
- Rate limiting configuration

### Phase 3: Integrate into Workflow

#### PR Template Update
```markdown
## Documentation Checklist

- [ ] Updated architecture docs if system design changed
- [ ] Updated AGENTS.md if build/test commands changed
- [ ] Updated API docs if endpoints changed

Onboarding Checklist

## Week 1: Core Documentation

- [ ] Read AGENTS.md (project instructions)
- [ ] Read docs/architecture/database-design.md
- [ ] Read docs/architecture/auth-flow.md
- [ ] Read docs/architecture/deployment-process.md
- [ ] Run ./run_tests.sh successfully

## Week 2: Deep Dives

- [ ] Read docs/architecture/content-pipeline.md
- [ ] Read docs/architecture/tts-speech-marks.md
- [ ] Make first contribution (documentation improvement)

Results

Onboarding Time Reduction

  • Before: 15 working days to first meaningful contribution
  • After: 3-4 working days to first meaningful contribution
  • Improvement: 73% faster onboarding

Reduced Slack Questions

Monthly repeated questions:

  • "How does authentication work?" 8 → 0 times/month
  • "Why do migrations run in CI/CD?" 6 → 0 times/month
  • "How are speech marks calculated?" 4 → 1 time/month

Production Incidents

Knowledge gap incidents:

  • Before: 3 P1 incidents in 6 months
  • After: 0 incidents in 3 months (since docs)

Developer Confidence

Anonymous survey (n=5 engineers):

  • "I understand how authentication works": 40% → 100%
  • "I know the migration rules": 20% → 100%
  • "I understand the deployment process": 60% → 100%

Documentation Drift

Docs updated in sync with code:

  • Outdated sections: 0% (docs live in repo, reviewed in PRs)
  • Missing systems: 0% (all major systems documented)

Self-Service Troubleshooting

Engineers resolving issues independently:

  • Before: 30% of issues self-resolved (rest required senior engineer help)
  • After: 80% of issues self-resolved (docs provide answers)

Key Lessons

1. Docs in Repo, Not External Tools

Notion docs go stale. Google Docs lose sync with code. Documentation in the repository, reviewed in PRs, stays current.

2. Architecture Docs ≠ Code Comments

Code shows "what" and "how." Architecture docs explain "why" and "when." Both are necessary.

3. Document the Exceptions

Most code is self-explanatory. Document the weird parts:

  • Why does auth happen at API Gateway?
  • Why do speech marks need remapping?
  • Why can't we run migrations manually?

4. Onboarding Reveals Gaps

New engineers have fresh eyes. Their confusion signals documentation gaps. Make "submit a docs PR" part of onboarding.

5. Write Docs Like Code

  • Version control
  • PR reviews
  • CI checks (broken links, outdated code samples)
  • Treat staleness as technical debt

Documentation Maintenance Strategy

When to Update Docs

  1. System Design Changes - PR must update architecture docs
  2. New Major Features - Require architecture explanation
  3. Incident Postmortems - Add findings to relevant docs
  4. Repeated Slack Questions - Signal missing documentation

Monthly Audit

# Check for outdated code examples in docs
./scripts/docs_audit.sh

# Validate all internal links
./scripts/check_doc_links.sh

# Search for "TODO" or "FIXME" in docs
grep -r "TODO\|FIXME" docs/

Ownership Model

  • AGENTS.md - Tech lead owns
  • Architecture docs - Domain owners (auth lead owns auth-flow.md, etc.)
  • CLAUDE.md - Updated by anyone, reviewed by tech lead

What We Didn't Document (And Why)

Code-Level Details

Function signatures, parameter types, return values → Use docstrings and type hints, not separate docs.

Constantly Changing Features

Active development work → Wait until stable. Document the settled architecture, not the experiment.

Obvious Patterns

Standard REST conventions, basic Python patterns → Assume baseline engineering knowledge.

Conclusion

Documentation gaps compound over time. Every undocumented decision becomes tribal knowledge. Every tribal knowledge becomes onboarding friction. Every onboarding friction slows team growth.

We reduced new engineer onboarding from 15 days to 3-4 days by writing seven architecture documents. The investment: ~40 hours of writing. The return: 11+ days saved per hire, zero knowledge-gap incidents, 80% self-service issue resolution.

Before: Scattered knowledge, 2-week onboarding, repeated questions. After: Centralized docs, 3-day onboarding, self-service answers.

Write down what only senior engineers know. Make it searchable. Keep it current. Treat documentation as code.


Technical Stack:

  • Markdown documentation
  • Version-controlled in Git
  • Reviewed in PRs
  • Architecture commits: 3dfd8bf, 50870fc, 06f783a

Impact:

  • 73% faster onboarding (15 days → 3-4 days)
  • 90% reduction in repeated questions
  • 0 knowledge-gap incidents since implementation
  • 80% self-service issue resolution