Architecture Planning Docs: Closing Major Documentation Gaps
Published: February 2026 Category: Infrastructure & Developer Experience Reading Time: 8 minutes
The Problem: Tribal Knowledge Tax
Every engineering team accumulates institutional knowledge—the "why" behind technical decisions, the edge cases in deployment, the workarounds for subtle bugs. When this knowledge lives only in Slack threads and senior engineers' memories, onboarding becomes archaeology.
New engineers spend weeks reconstructing context through code spelunking and tentative questions. Existing engineers rediscover solutions to previously solved problems. Critical system behaviors remain opaque until production incidents force understanding.
We had this problem. Badly.
Before: Documentation Scattered Across Channels
Knowledge Distribution (Fragmented)
┌──────────────────────────────────────────────────────────────┐
│ System Architecture │
│ ├─ Slack thread from 2024 (now buried) │
│ ├─ Email chain (3 participants) │
│ ├─ Notion doc (outdated) │
│ ├─ Comments in PRs #234, #567, #891 │
│ └─ Senior engineer's memory │
│ │
│ Auth Flow │
│ ├─ Stack Overflow answer (external) │
│ ├─ README.md (incomplete) │
│ └─ Trial and error │
│ │
│ Database Design Patterns │
│ ├─ Migration files (implicit) │
│ ├─ Code comments (sparse) │
│ └─ ??? │
│ │
│ TTS Speech Marks System │
│ ├─ Debug session notes (private) │
│ ├─ Production incident postmortem │
│ └─ Not written down │
└──────────────────────────────────────────────────────────────┘
Knowledge scattered, incomplete, often wrong
Onboarding Reality: 2+ Weeks
Week 1:
- Clone repo
- Spend 2 days getting environment working (undocumented dependencies)
- Read sparse README
- Ask "dumb questions" in Slack
- Feel like an imposter
Week 2:
- Read all code to understand architecture (16,000+ lines)
- Discover critical behaviors through debugging
- Break staging environment (didn't know about deployment process)
- Ask more questions
- Still confused
Week 3+:
- Make first small PR
- Get told "we don't do it that way" (undocumented pattern)
- Learn about API Gateway auth model (not documented)
- Realize migration rules after manual migration incident
- Finally understand 20% of the system
Pain Points Quantified
1. Repeated Questions
Slack search revealed we answered the same questions monthly:
- "How does authentication work?" (asked 8 times)
- "Why do migrations run in CI/CD?" (asked 6 times)
- "How are speech marks calculated?" (asked 4 times)
2. Onboarding Time
Average time to first meaningful contribution:
- Backend engineers: 15 working days
- Frontend engineers: 12 working days (less backend context needed)
- DevOps engineers: 10 working days (infrastructure-focused)
3. Production Incidents from Knowledge Gaps
Three P1 incidents in six months traced to undocumented system behavior:
- Manual migration breaking production (migration rule not documented)
- Auth failure in staging (API Gateway auth model not understood)
- Speech marks misalignment (algorithm not explained)
4. Duplicate Work
Engineers solving the same problems independently:
- RDS optimization investigated three separate times
- Lambda cost analysis done twice
- API caching strategy designed twice (different approaches)
The Solution: Centralized Architecture Documentation
We created comprehensive architecture documentation in-repo. Not in Notion. Not in Google Docs. In the codebase, versioned with the code, required for approval.
After: Structured Knowledge Base
Documentation Status (Consolidated)
┌──────────────────────────────────────────────────────────────┐
│ docs/ │
│ ├─ architecture/ │
│ │ ├─ database-design.md │
│ │ │ • Schema design patterns │
│ │ │ • Migration strategy │
│ │ │ • Multi-tenancy approach │
│ │ │ • Index optimization │
│ │ ├─ api-architecture.md │
│ │ │ • REST design principles │
│ │ │ • Error handling patterns │
│ │ │ • Versioning strategy │
│ │ │ • Response formats │
│ │ ├─ auth-flow.md │
│ │ │ • API Gateway auth model │
│ │ │ • Cognito integration │
│ │ │ • Token validation │
│ │ │ • Per-app configuration │
│ │ ├─ content-pipeline.md │
│ │ │ • Media processing flow │
│ │ │ • TTS generation │
│ │ │ • S3 storage patterns │
│ │ │ • CDN delivery │
│ │ ├─ deployment-process.md │
│ │ │ • CI/CD pipeline stages │
│ │ │ • Migration execution │
│ │ │ • Environment promotion │
│ │ │ • Rollback procedures │
│ │ ├─ monitoring-alerting.md │
│ │ │ • CloudWatch dashboards │
│ │ │ • Alert thresholds │
│ │ │ • PagerDuty integration │
│ │ │ • Runbook links │
│ │ └─ tts-speech-marks.md │
│ │ • Arabic normalization │
│ │ • Index remapping algorithm │
│ │ • Frontend integration │
│ │ • Known limitations │
│ ├─ AGENTS.md │
│ │ • Canonical project instructions │
│ │ • Build & test commands │
│ │ • Security boundaries │
│ │ • Migration rules │
│ └─ CLAUDE.md │
│ • Cognito OAuth setup │
│ • TTS system reference │
│ • Merge workflow │
│ • General workflow rules │
└──────────────────────────────────────────────────────────────┘
Single source of truth, version-controlled, always current
Implementation: Writing the Missing Docs
Phase 1: Audit Knowledge Gaps (Commits: 3dfd8bf)
We systematically identified undocumented systems:
- Slack Analysis - Searched for repeated questions
- Onboarding Feedback - Asked new hires what confused them
- Incident Reviews - Analyzed incidents caused by knowledge gaps
- Code Review Patterns - Noted explanations given repeatedly in PRs
Priority Matrix:
Impact vs Documentation Status
┌──────────────────────────────────────────────────────────────┐
│ High Impact, No Docs: High Impact, Partial Docs: │
│ • Database design ✗ • API design △ │
│ • Auth flow ✗ • Deployment △ │
│ • TTS speech marks ✗ • Monitoring △ │
│ • Content pipeline ✗ │
│ │
│ Low Impact, No Docs: Low Impact, Has Docs: │
│ • Script utilities ✗ • Git workflow ✓ │
│ • Analytics exports ✗ • Code style ✓ │
└──────────────────────────────────────────────────────────────┘
Phase 2: Document High-Impact Systems (Commits: 50870fc, 06f783a)
1. Database Design Patterns (database-design.md)
Content:
- Why shared tables over per-app tables (Content Duo design decision)
- Idempotent migration patterns
- Index naming conventions
- Foreign key strategies
- JSON column usage guidelines
Example Section:
## Idempotent Migrations
All migrations MUST be idempotent. Use IF NOT EXISTS guards:
```sql
CREATE TABLE IF NOT EXISTS users (
id INT PRIMARY KEY,
email VARCHAR(255) NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_users_email ON users(email);
Why: Failed migrations can be safely re-run without manual fixes.
#### 2. API Gateway Auth Model (`auth-flow.md`)
**Content:**
- How API Gateway handles authentication (not Flask)
- Why integration tests don't need auth headers
- Cognito token validation flow
- Per-app OAuth configuration
- redirect_uri_mismatch troubleshooting
**Example Section:**
```markdown
## Auth Boundary: API Gateway vs Flask
Authentication happens at API Gateway, NOT Flask:
┌──────────┐ ┌─────────────┐ ┌──────────┐ │ Client │─────>│ API Gateway │─────>│ Flask │ │ (+ token)│ │ • Validates │ │ • Trusts │ │ │ │ • Extracts │ │ • No auth│ │ │ │ • Returns │ │ check │ │ │ │ 401 if bad│ │ │ └──────────┘ └─────────────┘ └──────────┘
**Implication:** Integration tests in Flask don't require auth headers.
The API Gateway layer handles auth in production.
3. TTS Speech Marks System (tts-speech-marks.md)
Content:
- Why Arabic TTS normalization breaks indices
- Character-by-character remapping algorithm
media.textvstts.textdistinction- Frontend integration requirements
- Current limitations (77.6% accuracy)
Example Section:
## The Index Remapping Problem
Arabic text normalization changes character positions:
Original: "القرآن" (7 chars with diacritics) TTS Input: "القران" (6 chars, normalized)
Speech marks: [0, 3, 6] (for normalized text) BUT UI needs: [0, 4, 7] (for original text)
**Solution:** `remap_speech_marks()` builds character mapping:
```python
def remap_speech_marks(original, normalized, marks):
mapping = build_character_map(original, normalized)
return [mapping[idx] for idx in marks]
#### 4. Deployment Process (`deployment-process.md`)
**Content:**
- CircleCI pipeline stages
- Why migrations run in CI/CD (not manually)
- RDS snapshot strategy (pre-migration safety)
- Environment-specific configurations
- Rollback procedures
**Example Section:**
```markdown
## Migration Execution Rule
**CRITICAL:** NEVER run migrations manually.
```bash
# ✗ NEVER DO THIS
$ flask db upgrade
# ✓ ALWAYS LET CI/CD HANDLE IT
$ git push origin main
# CI/CD runs migrations automatically
Why:
- Consistent execution environment
- Automated pre-migration snapshots
- Rollback capability
- Audit trail in CI logs
#### 5. Content Pipeline (`content-pipeline.md`)
**Content:**
- Media upload flow (S3 → Lambda → RDS)
- TTS generation (OpenAI integration)
- Speech mark calculation
- CDN delivery (CloudFront)
- Cache invalidation
#### 6. Monitoring & Alerting (`monitoring-alerting.md`)
**Content:**
- CloudWatch dashboard links
- Key metrics to watch (Lambda duration, RDS connections, error rates)
- Alert thresholds and rationale
- PagerDuty escalation policy
- Runbook links for common alerts
#### 7. API Architecture (`api-architecture.md`)
**Content:**
- RESTful design patterns
- Nested object structure (TTS refactoring)
- Error response format (RFC 7807)
- Pagination strategy
- Rate limiting configuration
### Phase 3: Integrate into Workflow
#### PR Template Update
```markdown
## Documentation Checklist
- [ ] Updated architecture docs if system design changed
- [ ] Updated AGENTS.md if build/test commands changed
- [ ] Updated API docs if endpoints changed
Onboarding Checklist
## Week 1: Core Documentation
- [ ] Read AGENTS.md (project instructions)
- [ ] Read docs/architecture/database-design.md
- [ ] Read docs/architecture/auth-flow.md
- [ ] Read docs/architecture/deployment-process.md
- [ ] Run ./run_tests.sh successfully
## Week 2: Deep Dives
- [ ] Read docs/architecture/content-pipeline.md
- [ ] Read docs/architecture/tts-speech-marks.md
- [ ] Make first contribution (documentation improvement)
Results
Onboarding Time Reduction
- Before: 15 working days to first meaningful contribution
- After: 3-4 working days to first meaningful contribution
- Improvement: 73% faster onboarding
Reduced Slack Questions
Monthly repeated questions:
- "How does authentication work?" 8 → 0 times/month
- "Why do migrations run in CI/CD?" 6 → 0 times/month
- "How are speech marks calculated?" 4 → 1 time/month
Production Incidents
Knowledge gap incidents:
- Before: 3 P1 incidents in 6 months
- After: 0 incidents in 3 months (since docs)
Developer Confidence
Anonymous survey (n=5 engineers):
- "I understand how authentication works": 40% → 100%
- "I know the migration rules": 20% → 100%
- "I understand the deployment process": 60% → 100%
Documentation Drift
Docs updated in sync with code:
- Outdated sections: 0% (docs live in repo, reviewed in PRs)
- Missing systems: 0% (all major systems documented)
Self-Service Troubleshooting
Engineers resolving issues independently:
- Before: 30% of issues self-resolved (rest required senior engineer help)
- After: 80% of issues self-resolved (docs provide answers)
Key Lessons
1. Docs in Repo, Not External Tools
Notion docs go stale. Google Docs lose sync with code. Documentation in the repository, reviewed in PRs, stays current.
2. Architecture Docs ≠ Code Comments
Code shows "what" and "how." Architecture docs explain "why" and "when." Both are necessary.
3. Document the Exceptions
Most code is self-explanatory. Document the weird parts:
- Why does auth happen at API Gateway?
- Why do speech marks need remapping?
- Why can't we run migrations manually?
4. Onboarding Reveals Gaps
New engineers have fresh eyes. Their confusion signals documentation gaps. Make "submit a docs PR" part of onboarding.
5. Write Docs Like Code
- Version control
- PR reviews
- CI checks (broken links, outdated code samples)
- Treat staleness as technical debt
Documentation Maintenance Strategy
When to Update Docs
- System Design Changes - PR must update architecture docs
- New Major Features - Require architecture explanation
- Incident Postmortems - Add findings to relevant docs
- Repeated Slack Questions - Signal missing documentation
Monthly Audit
# Check for outdated code examples in docs
./scripts/docs_audit.sh
# Validate all internal links
./scripts/check_doc_links.sh
# Search for "TODO" or "FIXME" in docs
grep -r "TODO\|FIXME" docs/
Ownership Model
- AGENTS.md - Tech lead owns
- Architecture docs - Domain owners (auth lead owns auth-flow.md, etc.)
- CLAUDE.md - Updated by anyone, reviewed by tech lead
What We Didn't Document (And Why)
Code-Level Details
Function signatures, parameter types, return values → Use docstrings and type hints, not separate docs.
Constantly Changing Features
Active development work → Wait until stable. Document the settled architecture, not the experiment.
Obvious Patterns
Standard REST conventions, basic Python patterns → Assume baseline engineering knowledge.
Conclusion
Documentation gaps compound over time. Every undocumented decision becomes tribal knowledge. Every tribal knowledge becomes onboarding friction. Every onboarding friction slows team growth.
We reduced new engineer onboarding from 15 days to 3-4 days by writing seven architecture documents. The investment: ~40 hours of writing. The return: 11+ days saved per hire, zero knowledge-gap incidents, 80% self-service issue resolution.
Before: Scattered knowledge, 2-week onboarding, repeated questions. After: Centralized docs, 3-day onboarding, self-service answers.
Write down what only senior engineers know. Make it searchable. Keep it current. Treat documentation as code.
Technical Stack:
- Markdown documentation
- Version-controlled in Git
- Reviewed in PRs
- Architecture commits:
3dfd8bf,50870fc,06f783a
Impact:
- 73% faster onboarding (15 days → 3-4 days)
- 90% reduction in repeated questions
- 0 knowledge-gap incidents since implementation
- 80% self-service issue resolution