Tech Strand – Engineering Architecture, Patterns, and Standards
Most products don’t die from bad ideas — they die because their technical DNA can’t survive growth. The Tech Strand defines the engineering backbone of your company:- what you build on (stack),
- how it’s wired (architecture),
- how data lives and moves (storage + flows),
- and how far it can go before it breaks (capacity).
🧠 What the Tech Strand Owns
The Tech Strand is responsible for:- Runtime & architecture – monolith vs services, languages, protocols.
- Client stack – web, desktop, mobile, and how they share logic.
- Data & storage – database engines, schemas, sharding model.
- Scalability & reliability – how the system behaves at 10×, 100× usage.
- Integrations & platform – APIs, events, and app surfaces.
- Standards & observability – how engineers ship and see what’s happening.
every feature, library, and integration must obey it.
🔗 Inputs from Other Strands
The Tech Strand never works in isolation. It implements contracts defined by other strands:- Product Strand →
- Core jobs-to-be-done (e.g., real-time messaging, search, file sharing).
- Stress features (e.g., enterprise orgs, cross-workspace channels).
- Roadmap items that will heavily load infra (workflows, bots, automations).
- UX Strand →
- Latency budgets (e.g., message send feedback < 200ms).
- Real-time expectations (presence, typing indicators, live updates).
- Collaboration models (DMs, channels, threads, reactions).
- Brand Strand →
- Reliability promises (“always on” vs “good enough”).
- Security/compliance bar (e.g., enterprise-grade, data residency).
- Platform narrative (“open app ecosystem”, “secure by design”).
Architecture Decision Axes
Instead of “what framework is trendy?”, the Tech Strand decides across six axes.1. Runtime & Service Stack
Questions it answers- What runtimes best match our workload (real-time, web-heavy, compute-heavy)?
- Do we start monolith-first or services-first?
- How do we avoid painting ourselves into a scaling corner?
- Runtime choices
- PHP/Hack, Rails, Django, Node, Go, Java, Elixir depending on:
- engineering talent,
- latency constraints,
- type-safety needs,
- maturity of ecosystem.
- PHP/Hack, Rails, Django, Node, Go, Java, Elixir depending on:
- Architecture patterns
- Monolith-first with clear modules.
- Modular monolith → microservices when necessary.
- Cell-based architecture to limit blast radius.
- BFF (Backend-for-Frontend) for each client surface.
Slack – Runtime & Services
- App Layer:
- PHP → Hack (on HHVM) for the core web application.
- Real-time Messaging:
- Java services for WebSocket handling, message routing, and fanout.
- Voice/Video:
- Elixir services dedicated to calls and media.
2. Client Stack & Delivery
Questions it answers- Which clients do we support: web, desktop, mobile?
- How do we reuse logic and design tokens across platforms?
- Web
- React (or equivalent) front-end.
- Shared design tokens + components (from UI Strand).
- Desktop
- Electron or native shells wrapping the web app.
- Mobile
- Native iOS (Swift) and Android (Kotlin) for performance-critical UX, or
- React Native/Flutter with clear tradeoffs.
Slack – Clients
- Web: React front-end with a Node-powered core engine.
- Desktop: Electron apps wrapping the React app.
- Mobile: Native iOS & Android clients consuming the same APIs.
3. Data Layer & Database Strategy
Questions it answers- What is the shape of our data? (messages, channels, orgs, files…)
- What are our consistency vs latency requirements?
- How do we scale beyond a single DB?
UserWorkspace / OrganizationChannelMembership(User↔Channel, User↔Workspace)Message(with thread/reply chains)File / AttachmentReaction / EmojiApp / Bot / Integration
- Normalize core relationships.
- Denormalize for read-heavy paths (unreads, channel lists, summaries).
- Use append-only logs for critical events (audit, recovery).
Capacity tiers (DB and data)
- Tier 0 – Prototype
- Single MySQL/Postgres instance.
- Read replica if needed.
- Suitable up to ~10–50k DAU with good indexing.
- Tier 1 – Growth
- Horizontal partitioning / early sharding.
- Background jobs, heavier caching.
- Tier 2 – Slack-scale
- Fully sharded DB layer with a routing and management system.
Slack – Data & Storage
- Primary DB Engine: MySQL.
- Sharding & Management: Vitess, handling:
- sharding,
- query routing,
- connection pooling,
- online schema changes.
- Caching: Memcached + mcrouter for routing and caching hot data.
- Async & Streams:
- Kafka for event streaming,
- Redis for short-lived data and queues.
- Analytics: Warehouse & batch stack (Presto/Spark/Airflow/Hadoop-style system).
4. Scalability, Reliability & Topology
Questions it answers- How do we design for 10×, 100× growth?
- How do we isolate failures so one bad shard doesn’t kill everything?
- What happens when a massive customer reconnects all at once?
- Horizontal sharding (often by tenant/workspace).
- Gateway layer for WebSockets and API traffic.
- Dedicated fanout services for broadcasting events.
- Backpressure & rate-limiting at all external edges.
- Multi-AZ deployments with automatic failover.
- Cellular architecture: split traffic into cells to reduce blast radius.
- Graceful degradation: search might be slow, messaging stays alive.
- Feature flags to decouple deploy from release.
Slack – Topology & Scale
- Cloud: Amazon EC2-based infra for dev and app environments.
- Real-time topology:
- Gateway servers for WebSocket connections.
- Channel servers for routing and message fanout.
- Presence servers for user online/offline state.
- Admin/control-plane services for coordination.
- DB topology:
- Vitess-managed MySQL shards with co-located proxy and shards.
- Millisecond-level query latencies across huge clusters.
5. Integration Surface & Platform
Questions it answers- What’s the official way external systems talk to us?
- How do we prevent “one-off hacks” for each integration?
- REST/WebSocket APIs for primary usage.
- Events API (webhooks) for external consumers.
- Standardized app primitives:
- slash commands,
- bots,
- interactive components,
- workflow hooks.
- OAuth2 with scoped permissions.
- Rate limits & quotas.
- API versioning and deprecation windows.
- App review / validation flows.
Slack – Platform
- APIs & SDKs:
- Slack Web API & Events API.
- Bolt framework (Node, Python, Java) on top of the SDKs.
- Capabilities:
- Slash commands, message actions, interactive UIs, workflows.
- Internal rule:
Internal systems should consume the same platform abstractions as external apps — no “secret” internal DB shortcuts.
6. Engineering Standards, Tooling & Observability
Questions it answers- How do teams ship fast without breaking everything?
- How do we debug issues across thousands of services and shards?
- Code quality
- Static typing where practical (Hack, Java, TS).
- Code review as default, not exception.
- Service templates with logging/metrics/tracing built in.
- CI/CD
- Automated tests and builds per change.
- Canary & phased rollouts.
- Fast rollbacks and feature flags.
- Observability
- Centralized structured logging.
- Metrics on latency, errors, saturation.
- Distributed tracing across APIs and async jobs.
- Dev environments
- Remote dev envs mirroring production topology.
- Per-developer or per-team sandboxes.
Slack – Standards & Dev Envs
- Remote development environments on EC2 running full Slack app replicas.
- Migration from plain PHP to Hack to enforce static types and long-term maintainability.
Capacity Planning Blueprint
Use capacity tiers to keep your Tech Strand honest.Tier 0 – Prototype
- Architecture:
- Monolith.
- 1× DB (MySQL/Postgres) + optional read replica.
- Simple cache (Redis/Memcached).
- Suitable for: up to ~10–50k DAU.
Tier 1 – Growth
- Architecture:
- Modular monolith or early microservices.
- Dedicated real-time services if needed.
- Heavier caching, queues, scheduled jobs.
- Suitable for: ~250k DAU.
Tier 2 – Slack-Scale
- Architecture:
- Cell-based microservices.
- Fully sharded storage (Vitess-like).
- Dedicated real-time grid (gateways, fanout, presence).
- Rich platform layer (APIs, SDKs, events).
- Suitable for: 10M+ DAU, billions of messages/day.
🧩 Third-Party & Integrations Catalog
The Tech Strand also maintains a catalog of external bets:- Messaging & Queues
- Kafka, Redis Streams, SQS, etc.
- Chosen by throughput, ordering, and ops complexity.
- Search & Indexing
- Solr/Elasticsearch/OpenSearch.
- Multi-tenant index design, latency vs freshness.
- Analytics & Warehousing
- Presto, Spark, Airflow, Hadoop, Snowflake, BigQuery.
- Chosen by query model, retention, and cost.
- Monitoring & Observability
- Prometheus+Grafana, Datadog, New Relic, OpenTelemetry.
- Chosen by tracing capabilities, service correlation, alerting quality.
- what it’s used for,
- why it was chosen,
- how hard it is to migrate away from it.
🛠 How to Use This Strand in Practice
- Write the constraints first
- From Product, UX, Brand: latency, scale, security, platforms.
- Pick a capacity tier
- Prototype, Growth, or Slack-scale.
- Document “what breaks next” as you grow.
- Fill out the six decision axes
- Runtime & services
- Client stack
- Data & storage
- Scalability & topology
- Integrations & platform
- Standards & observability
- Define Slack-style reference
- For each axis, add at least one real company profile (Slack here) to anchor reality.
- Revisit quarterly
- Tech Strand is living DNA.
- Every major architectural evolution should be reflected here.
Quote to steal:
“Your product is what users see — but your Tech Strand decides whether it survives contact with reality.”

