Introduction
The company I worked at recently shut down. I worked there for over four and a half years, and I was involved in many of the decisions that influenced the engineering organization’s technology choices. Some of those decisions paid great dividends down the line, and some did not. I found myself thinking: If I were to join a new organization, would I make the same bets, or what would I change?
Disclaimer
This is purely my subjective opinion. Every organization’s context is different. Understanding that context and the long term vision is crucial to making sound decisions.
Decision Log
Choice | Decision | Notes |
---|---|---|
AWS | Good | To name a few benefits: AWS allowed us to scale dynamically to the business growth and demands. It was a cornerstone of our DevOps culture and kept our infrastructure teams lean. AWS was responsible for 2-3 incidents that crippled the business which is impressive for the duration we used them. I would stay out of us-east-1, however. |
Fargate | Good | We run most of the production systems on Fargate. Fargate enabled us to delegate most of maintenance work to AWS (ec2 batching, upgrades,..etc) and have the SREs focus on other value-enabling streams. |
Kubernetes | Bad | There are legitimate use-cases for k8s. However I believe that most of the organizations that think they need k8s, don’t need it. Supporting k8s platform would have required massive bandwidth from the infrastructure teams for the ongoing maintenance and KTLO. Public Cloud providers have solutions that work for most organizations unless there is a business need that demands k8s. |
Postgres | Good | We run one of the biggest transactional DBs I’ve ever worked with in my career (~10TB-14TB). It was a solid choice, and I’d definitely use it again |
Aurora | Bad | Expensive!! We were able to scale well with vanilla Postgres high throughput (read and write) systems. |
Redis | Good | Great In-memory database and cache. Never use it as your main db however (well…never say never) |
Elasticcache | Bad | Expensive! Most of the usecases we had didn’t really need ElasticSearch. If I ever needed a full-text-search or realtime metrics analytics then ES is a great tool for that. |
SNS, SQS, Eventbus | Good | They are reliable and reasonably easy to setup. |
Kinesis | Bad | Too expensive if all you need is a regular message bus, try SNS, SQS, Eventbus first. There are legitimate usecases for Kinesis and Firehose and those are strong products in their respective areas |
Dynamodb | Bad | It is ok for a simple key-value storage. However, for the love of god, don’t use it to hammer every nail. It struggles to adapt to changing business requirements and can get expensive real quick if you aren’t careful. |
Cloudfront | Good | While it doesn’t have as many features as Akami and Cloudflare,it does integrate well with AWS and can do some powerful routing when combined with Lambdas. |
AWS WAF | Good | After launching WAF, we started blocking one third of production traffic outside ofbusiness hours. Overall easy to use with AWS Firewall Manager to ensure WAF protection is enabled for every endpoint in your environment. The AWS-managed rules worked well.. |
ACM | Good | Setting up SSL certificates has never been easier. |
VPC Lattice | Good | For connecting applications across multiple VPCs, I highly recommmend VPC Lattice for most usecases. |
VPC Private Endpoint | Good | Perfect for connecting to datastores across multiple VPCs |
VPC Peering | Bad | It’s been offset by a few product launches over the past few years (Transit Gateway, Lattice, Private Endpoints) |
VPC Transit Gateway | Bad | Try Lattice and Private Endpoints first. |
Terraform | Good | This was one of the best bets we made. We started using Terraform to manage AWS infrastructure, but eventually expanded to Github, Launchdarkly, Datadog and configuration management in many providers. |
Cloudformation | Bad | Unreliable, unreadable and limited to AWS |
Serverless Lambda Apps | Bad | Lambda is excellent for building custom integrations between AWS services and small one-off scripts. However, as with Dynamodb, things get ugly when you hammer every nail with a lambda function. Building complex business application workflows on Lambda will give you a handicap. It gets expensive, complicated, and challenging to maintain really quickly. |
CDK | Bad | Infrastucture and configuration management should be easy and simple. CDK stacks can sometime get complicated and allow engineers to shoot themselves in the foot by building infrastructure configuration that are as complicated as the business logic itself |
Terraform Cloud (TFC) | Bad | It’s a glorified and expensive Terraform CICD. Overtime, it became too expensive. Their new pricing model that charges you by the number of resources is absolutely insane, and they know it. It also leads to having two cicd platforms (infra cicd and services cicd) which isn’t ideal and too much cognitive load. You could replace it with Terraform Cli + S3 backend + Your regular CICD platform |
Terraform Sentinel | Bad | I’m all for compliance as policy. However, Sentinel is coupled with TFC. Nowadays, You can replace it with OpenPolicyAgent |
Python | Good | Given Python’s popularity, hiring Python engineers is easy. Many of the vendors on this list treat it as a first-class citizen with support at first launch. |
Ruby | Undecided | Many of the engineers we hired had to learn Ruby on Day 1. Given how opinionated the Rails ecosystem is, this meant that engineers would often run into some friction for the first few weeks. In addition, many of the vendors we worked with didn’t support Ruby on the same level as Python or Java. That being said, in the hands of capable engineers familiar with how Ruby (and especially Rails) operates, it is a performant solution that can accelerate development velocity/. |
Github | Good | While GH is great, Gitlab is a really good alternative. It offers full devops package inlcuding cicd, sast, test coverage,…etc and would be a better choice. However, Github Copilot is uncontested |
Buildkite | Bad | At the time (2020), it was the best solution in the market for hybrid CICD platforms. You host the compute and Buildkite hosts the control plane. BK is a great vendor and I greatly respect what they built. As of 2024, GH Action added support for self-hosted agents and their library of GH Actions is uncontested. |
Github Actions | Good | Great community support and massive library of ready-to-use actions. They have added self-hosted agents in recent years. Their great integration with Github platform is a huge plus |
Artifactory | Good | There are good alternatives depending on which languages you need to support (GIthub Packages, AWS CodeArtifact). Unfortunately, neither supports both Ruby and Python. |
Opsgenie | Bad | I can’t remember the last time Opsgenie released anything new that I was excited about. I’d go for PagerDuty instead or DD Incident Management tool. I wouldn’t be surprised if Datadog expanded their catalogue to provide both incident managenet (currently supported) and oncall (not supported yet) functionality |
Datadog | Good | It is expensive but it is the best in the market. It is no longer just a monitoring tool but rather a platform that ingest most of the engineering operations telemetries with best in the market integrations. Datadog was one of the most innovative vendors we worked with |
Honeybadger | Bad | Good errors tracker, but in the spirit of consolidating vendors, DD can replace it |
Splunk | Bad | I have nothing nice to say about them: bad support, bad product, bad company and rotten sales department. |
Trivy | Good | Great tool for enforcing terraform and infrastructure best practices |
Snyk | Bad | it does the job, but it lacks some core functionality like native github integration (work inprogress). DD can replace it and there are free open source tools that would work well for small companies |
LogRocket | Bad | Not a bad product; but DD has replaced it |
Swagger Hub | Bad | Fewer features than Postman |
Postman | Good | It can be expensive but it is a good product overall. |
OpenVPN | Undecided | It gets the job done and is reliable. The free version however is “not enterprisey” enough. Another alternative that i never explored myself but was recommended to me is https://tailscale.com/pricing |
LaunchDarkly | Good | We introduced it too late but I had positive experience overall. Great Terraform support |
Sophos EDR | Bad | Having seen what Defender offers, it is difficult to consider Sophos ever again |
Slack | Good | I love the integrations and level of customization Slack provides. |
Jira | Good | It is bulky, it is slow but it is customizable. We used it to track incidents postmortem analysis, services, teams and all sorts of metadata. |
Confluence | Good | I love what Confluence provides. However, knowledge management at scale is challenging for every product and organization. As years go by, the knowledge keeps growing, and it takes hard and deliberate effort to maintain it (remove irrelevant pages, organize content, simplify knowledge discovery, etc.). I’m not aware of a product in the market that has solved these problems yet but I am hopeful that someone will finally crack them. |
Zoom | Good | Overall, it is stable and reliable, and Zoom keeps innovating. The meeting summary feature has so much potential. |
EntraID | Good | I would not want to manage on-prem Active Directory. It is an enterprise-grade identity provider. It can be complicated to manage, but overall, the positives outweigh the negatives. |
Microsoft Defender | Good | One of the best products out there. Defender can ingest signals from many systems with minimal configuration and provide full end-to-end security monitoring. It is expensive, but I think it is worth every penny. |
Exchange Online | Good | Reasonably reliable (With the exception of when your email gets quarantined and it is not very clear why it got flagged). When combined with the Microsoft entire eco-system you get an enterprise grade package (DLPs, Compliance, Security,..etc). |
Auth0 | Good | Identity management is challenging, and Auth0 provides an initial boost to help you get started. However, as the company’s requirements evolved, we had to build additional internal abstractions on top of Auth0. |
Meraki | Good | Reliable hardware, cloudbased and easy to setup and maintain. |
Arbua | Bad | Given a redo chance, sticking with one vendor such as Meraki would be prime |
Rapid7 | Bad | Their product is more geared towards on-prem enterprises. We had few bad experiences with their CSOC. I’d like to give CrowdStrike a shot. |
New Relic | Bad | My experience is based on an older pricing model they changed four years ago. That being said, Datadog’s platform offers more features beyond observability platform (Khajiit has wares if you got the coins) |