<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Thoughtful Engineering]]></title><description><![CDATA[Insights on software scale, platform engineering, and strong tech teams.]]></description><link>https://www.adityachowdhry.me</link><image><url>https://substackcdn.com/image/fetch/$s_!il6o!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7de5978-98c4-4304-92b9-5965066b7714_1024x1024.png</url><title>Thoughtful Engineering</title><link>https://www.adityachowdhry.me</link></image><generator>Substack</generator><lastBuildDate>Wed, 06 May 2026 10:48:04 GMT</lastBuildDate><atom:link href="https://www.adityachowdhry.me/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Aditya Chowdhry]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[adityachowdhry@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[adityachowdhry@substack.com]]></itunes:email><itunes:name><![CDATA[Aditya Chowdhry]]></itunes:name></itunes:owner><itunes:author><![CDATA[Aditya Chowdhry]]></itunes:author><googleplay:owner><![CDATA[adityachowdhry@substack.com]]></googleplay:owner><googleplay:email><![CDATA[adityachowdhry@substack.com]]></googleplay:email><googleplay:author><![CDATA[Aditya Chowdhry]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[When the CLI Felt Like Magic]]></title><description><![CDATA[From Heroku to AI-native tools, the CLI has evolved beyond command execution. As systems grow complex, interface design becomes core to engineering velocity.]]></description><link>https://www.adityachowdhry.me/p/when-the-cli-felt-like-magic</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/when-the-cli-felt-like-magic</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Thu, 12 Feb 2026 04:30:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3FZZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There are two broad ways to design a CLI.</p><p>One approach mirrors the underlying system. The CLI exposes the full object model and API surface area. The other approach structures interaction. It organizes complexity so developers can navigate it with less cognitive effort.</p><p>As systems grow in scale, that distinction becomes more important.</p><div><hr></div><h2>Heroku: Structuring the Common Path</h2><p>I recently heard that Heroku is effectively moving into maintenance mode. That news made me reflect on how much influence it had on developer experience, especially through its CLI.</p><p>For many developers, Heroku was the first time infrastructure felt simple.</p><p>You ran:</p><p><code>heroku create</code></p><p><code>git push heroku master</code></p><p>And your application was live.</p><p>The effectiveness of this flow was not accidental. Heroku built deployment directly on top of Git. Instead of introducing a new operational model, it aligned infrastructure with an existing developer habit.</p><p>You pushed code. The platform handled build, release, routing, and process management.</p><p>Heroku did not expose infrastructure primitives like instances, networking rules, or load balancers. Those concerns existed, but they were abstracted behind an opinionated workflow.</p><p>The CLI was intentionally limited. It did not expose every possible knob. It constrained the surface area in exchange for cohesion.</p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55aee882-2170-409a-ba4c-0feb96ad0264_1400x980.png&quot;},{&quot;type&quot;:&quot;image/webp&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aaa5f9da-4835-4bbc-8ff4-6698a487c60b_982x680.webp&quot;}],&quot;caption&quot;:&quot;&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f8d723d-1c9b-4e0b-a40c-b828a8dbd1a0_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p>Heroku did not mirror infrastructure. It structured access to it around the most common developer workflow.</p><p>That design decision reduced both operational and decision complexity.</p><div><hr></div><h2>AWS CLI: Complete System Exposure</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DCxE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DCxE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png 424w, https://substackcdn.com/image/fetch/$s_!DCxE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png 848w, https://substackcdn.com/image/fetch/$s_!DCxE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!DCxE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DCxE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png" width="1206" height="1406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef649f75-1263-46d5-a272-0921936c931f_1206x1406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1406,&quot;width&quot;:1206,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:327004,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/187671173?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DCxE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png 424w, https://substackcdn.com/image/fetch/$s_!DCxE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png 848w, https://substackcdn.com/image/fetch/$s_!DCxE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!DCxE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef649f75-1263-46d5-a272-0921936c931f_1206x1406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The AWS CLI represents a different philosophy.</p><p>It exposes more than 300 services and thousands of operations. Nearly every API call is accessible. In some cases, the CLI even provides capabilities not available in the console.</p><p>It is comprehensive.</p><p>But it does not impose a unified workflow across services. Each service has its own namespace, naming conventions, and flags. Discoverability is limited. Exploration usually requires documentation rather than interaction.</p><p>You cannot realistically hold the entire surface area in your head. You must know what you are looking for.</p><p>The AWS CLI mirrors the system architecture faithfully. It reflects organizational scale and API breadth.</p><p>That fidelity is powerful. But it transfers the burden of structure to the developer.</p><div><hr></div><h2>K9s: Reducing Interaction Cost</h2><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6cbcfc0c-be3f-4e0e-97d0-ac43b9548b6a_2512x1316.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19a7ec8b-f4ee-43d7-866b-077b517f7ca7_2520x1320.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/823dbdf5-9936-4e53-a49b-3573db9808c8_2514x1318.png&quot;}],&quot;caption&quot;:&quot;K9s dashboards&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42297325-37f8-4249-9b17-abb63c027df5_1456x474.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p>Kubernetes is also a large and complex system.</p><p>The kubectl CLI exposes the Kubernetes object model directly: pods, deployments, replica sets, stateful sets, services, config maps, ingress objects, and more. Working with it often involves multiple terminals and carefully assembled commands.</p><p>At Probo, while running production workloads on Kubernetes, we used K9s extensively.</p><p>K9s does not hide Kubernetes complexity. It preserves the object model. What it changes is the interaction cost.</p><p>Instead of repeatedly assembling commands, you navigate a live, continuously updating view of cluster state. You can sort by CPU or memory with a keystroke. You can switch namespaces quickly. You can inspect, edit, and drill into objects without constructing new command chains each time.</p><p>It uses real-time watch APIs, so state updates are continuous rather than discrete.</p><p>For engineers unfamiliar with kubectl, onboarding is faster because the system becomes discoverable. You explore available objects instead of recalling exact syntax.</p><p>K9s mirrors Kubernetes. But it organizes access to it in a way that reduces cognitive effort.</p><div><hr></div><h2>Claude CLI: Structuring Interaction With Uncertainty</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3FZZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3FZZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png 424w, https://substackcdn.com/image/fetch/$s_!3FZZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png 848w, https://substackcdn.com/image/fetch/$s_!3FZZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png 1272w, https://substackcdn.com/image/fetch/$s_!3FZZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3FZZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png" width="1456" height="933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e694e683-0425-410c-9736-919b9f430cbb_2300x1474.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:933,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:835493,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/187671173?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3FZZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png 424w, https://substackcdn.com/image/fetch/$s_!3FZZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png 848w, https://substackcdn.com/image/fetch/$s_!3FZZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png 1272w, https://substackcdn.com/image/fetch/$s_!3FZZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694e683-0425-410c-9736-919b9f430cbb_2300x1474.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Claude CLI applies similar structuring principles, but in a different category of system.</p><p>Traditional CLIs interact with deterministic systems. You issue a command and receive a predictable result. Even if the system is complex, the behavior is stable.</p><p>AI systems are different. They are probabilistic and opaque. Outputs are not guaranteed. Internal reasoning is not directly visible. Without structure, interaction can feel uncertain.</p><p>Claude CLI addresses that by structuring the interaction loop.</p><p>It provides visible processing feedback instead of blocking silently. It adapts cleanly to terminal resizing. It surfaces contextual shortcuts when relevant rather than requiring memorization. It avoids rigid flows while still offering structured guidance. When proposing code changes, it allows review before acceptance.</p><p>These choices reduce uncertainty. They introduce checkpoints where trust matters. They help developers reason inside a probabilistic system rather than treating it like a deterministic one.</p><p>Claude CLI does not simplify the model. It structures how developers engage with it.</p><h3>The Real Shift</h3><p>The evolution of the CLI is not from commands to interactivity.</p><p>It is from exposing complexity to structuring it.</p><p>As systems expand in capability &#8212; whether cloud infrastructure, container orchestration, or AI models &#8212; faithfully mirroring internal APIs becomes insufficient. The surface area grows beyond what can be navigated through recall alone.</p><p>The more complex the system, the more responsibility shifts to the interface.</p><p>Modern platform teams are no longer just exposing capability. They are shaping how engineers think inside complex systems.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Joy of Coding]]></title><description><![CDATA[(or what it looks like now)]]></description><link>https://www.adityachowdhry.me/p/the-joy-of-coding</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/the-joy-of-coding</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Fri, 06 Feb 2026 08:46:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vigN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vigN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vigN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!vigN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!vigN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!vigN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vigN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png" width="1024" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:608,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vigN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!vigN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!vigN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!vigN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82d69e1b-84ad-436a-91fa-7a394ee72ca3_1024x608.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">future of coding</figcaption></figure></div><p>There is a lot happening in the tech scene right now.</p><p>New model releases, new coding agents, and more conversations around autonomous coding. Things are moving fast. I don&#8217;t know if it&#8217;s a bubble around me or if it&#8217;s actually everywhere, but it&#8217;s very real in my circle. The people I follow, the people I talk to in real life. Everyone seems to be discussing the same shift.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>It feels important. Like this is not just another tool upgrade, but a change in direction.</p><p>At the same time, I sometimes feel a bit discouraged. Not depressed, but something closer to uncertainty. What we used to do, and how we used to do it, is changing. The way I learned to code, the way I spent years getting better at it, doesn&#8217;t map one-to-one anymore.</p><p>I&#8217;m not resisting this change. I&#8217;m actively using these tools. Claude, Codex, multiple agents. I experiment, I adapt, I try to understand what works. Work still gets done. In many cases, it gets done faster.</p><p>The joy of coding is still there for me. It just looks different now.</p><p>I&#8217;m not writing every single line of code anymore. Instead, I&#8217;m guiding agents. Shaping direction. Reviewing outcomes. Translating thoughts into something executable through these tools. That process still gives me joy.</p><p>Recently, this became very clear while I was building my <strong><a href="https://www.linkedin.com/posts/adityachowdhry09_how-does-your-claudemd-or-agentsmd-file-activity-7425095635311280129-A2pq?utm_source=social_share_send&amp;utm_medium=member_desktop_web&amp;rcm=ACoAABNEYEEBpPJSpvNg3Iao5JEtM7rQl6X7kpQ">claude.md</a></strong>.</p><p>I spent time writing down my engineering philosophy. Not as theory, but as a summary of what I&#8217;ve learned over the years. Things like YAGNI with seams, being pragmatic about DRY, preferring simple code over clever abstractions, designing for deletion, treating security as a first-class concern, and focusing on integration tests over excessive mocking.</p><p>Writing it forced me to reflect. On mistakes I&#8217;ve made. On systems that became hard to change. On abstractions that looked elegant early on but became liabilities later. On production incidents that shaped how I think about observability, testing, and security.</p><p>I felt genuine pride while writing it. Not because it was perfect, but because it felt honest. Like compressing years of experience into something explicit. Something that could guide not just me, but also the agents I work with.</p><p>It wasn&#8217;t a new realization. More like a confirmation.</p><p>The joy of coding was never about typing code.</p><p>It was always about the end product.</p><p>But the process mattered. It was challenging, sometimes frustrating, and that&#8217;s exactly what made it satisfying. The thinking. The tradeoffs. The discipline to decide what not to build. The willingness to delete things that no longer served the system.</p><p>Today, that same process still exists. It has just moved up a layer.</p><p>Instead of expressing judgment through keystrokes, I express it through constraints, principles, reviews, and guidance. Through documents like claude.md. Through how I steer agents rather than how fast I type.</p><p>I don&#8217;t know how the future will look. I don&#8217;t know how far autonomy will go, or what the role of an engineer will eventually become.</p><p>What I do know is that the part I&#8217;ve always enjoyed hasn&#8217;t disappeared. It has just shifted.</p><p>The joy was never in typing code. It was in the process of thinking deeply, making tradeoffs, and turning ideas into systems that work.</p><p>If that process still exists in some form, I&#8217;m curious to see where this goes.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Software Engineering: Living With Uncertainty]]></title><description><![CDATA[On uncertainty, adaptation, and the craft that keeps people going]]></description><link>https://www.adityachowdhry.me/p/software-engineering-living-with</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/software-engineering-living-with</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Mon, 29 Dec 2025 13:32:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f7fcd181-6e55-46b4-aeee-8b2b7673c10e_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>From the outside, software engineering appears to be a safe and predictable career. It offers good pay, strong demand, and what looks like a logical progression over time. Once you enter the field, it feels as though the future should take care of itself.</p><p>Anyone who has spent years building real systems knows that this sense of stability is mostly an illusion. Software engineering is not built on certainty. It is built on continuous change.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>The Promise of &#8220;Final Answers&#8221;</strong></h2><p>Every few years, the industry confidently presents a new answer that is supposed to last.</p><ul><li><p>Java was once considered the safest long-term choice</p></li><li><p>Python became the language of productivity</p></li><li><p>JavaScript expanded beyond the browser and powered everything</p></li><li><p>React was seen as the solution to UI complexity</p></li><li><p>Kubernetes promised to make scaling manageable</p></li><li><p>Big Data was expected to reveal insights at scale</p></li><li><p>Machine Learning became a differentiator</p></li><li><p>Now AI is framed as unavoidable</p></li></ul><p>Each of these waves arrived with confidence and implied permanence. None of them truly delivered it. The tools themselves are not the problem. The problem is the belief that any tool can remove uncertainty from the career.</p><div><hr></div><h2><strong>Two Tracks That Shape Every Career</strong></h2><p>A software engineering career usually runs on two tracks at the same time.</p><h3><strong>Knowledge That Compounds</strong></h3><p>Some knowledge lasts and deepens with experience.</p><ul><li><p>How systems fail under load</p></li><li><p>Why simple designs outlive clever ones</p></li><li><p>How performance, cost, and reliability trade off</p></li><li><p>How incentives influence architecture more than documentation</p></li><li><p>How users behave in unexpected but repeatable ways</p></li></ul><p>This knowledge compounds. It remains useful regardless of trends.</p><h3><strong>Adaptation That Never Stops</strong></h3><p>The second track is adaptation.</p><p>Languages change. Frameworks evolve. Architectures reset. Tooling gets replaced. These skills are learned, applied, and eventually left behind. They do not compound in the same way.</p><p>Careers tend to stall when engineers confuse these tracks. Some chase novelty without building depth. Others rely only on past experience and resist change. Longevity comes from balancing both.</p><div><hr></div><h2><strong>The Hidden Mental Load</strong></h2><p>Software engineering is endless problem-solving. Not the kind that ends with closure, but the kind where solving one problem quietly introduces several more. You rarely finish work. You move from one set of constraints to another.</p><p>Over time, this creates mental pressure. FOMO creeps in slowly. A new framework trends. A peer switches jobs. Salary benchmarks move. Even when things are going well, it can feel like you are falling behind.</p><p>Financial uncertainty adds to this. When savings, investments, or expenses are not fully under control, every slowdown feels risky and every market shift feels personal.</p><div><hr></div><h2><strong>Pressure Inside Organizations</strong></h2><p>Within companies, the pressure continues through reviews, ratings, promotions, and increments. These systems are designed to encourage growth, but they often feel like a race with no clear finish line.</p><p>The focus slowly shifts:</p><ul><li><p>from learning to proving</p></li><li><p>from building to performing</p></li><li><p>from long-term thinking to short-term validation</p></li></ul><p>The work stops being only about improving systems. It becomes about constantly optimizing yourself.</p><div><hr></div><h2><strong>When Things Break Beyond Your Control</strong></h2><p>There are also failures engineers cannot influence.</p><p>Poor management decisions. Overhiring during good times. Sudden strategy shifts. Layoffs driven by financial models rather than performance. You can do everything right and still lose your job.</p><p>Unlike many traditional professions, software engineering offers little institutional protection. There is no government-backed job security and no guarantee that years of contribution will translate into continuity. Stability depends heavily on market conditions and leadership choices.</p><p>This reality quietly shapes behavior. People save aggressively, spend cautiously, and stay alert even when things seem stable.</p><div><hr></div><h2><strong>What Experience Eventually Teaches</strong></h2><p>With time, most engineers learn a difficult but useful truth. This career does not reward certainty. It rewards the ability to adapt without losing direction or burning out.</p><p>You stop asking what to learn forever. You start asking what helps you stay relevant now. You separate identity from tools, focus on fundamentals, adapt when necessary, and let go when required.</p><div><hr></div><h2><strong>Closing Thought</strong></h2><p>Seeing the uncertainty in software engineering does not mean disliking the work. Both can coexist.</p><p>Many people stay in this field not because it is stable, but because they genuinely enjoy problem solving. There is a quiet satisfaction in turning vague ideas into working systems, in watching something abstract become real, and in seeing logic take shape on a screen. That sense of creation is difficult to replace.</p><p>Titles change. Tools evolve. Guarantees remain elusive. But the joy of solving problems and seeing things work keeps people going through the cycles.</p><p>Software engineering is not a ladder. It is a landscape that keeps changing shape. Uncertainty is not a temporary phase. It is part of the job.</p><p>Learning to live with it, while still enjoying the craft, is what allows people to last.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Cricket Scale, Part 3 - The Not-So-Technical Grind]]></title><description><![CDATA[the unglamorous side of tech]]></description><link>https://www.adityachowdhry.me/p/cricket-scale-part-3-the-not-so-technical</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/cricket-scale-part-3-the-not-so-technical</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Sat, 20 Sep 2025 04:30:40 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/64a6f3d2-2c42-4767-b114-cf43950ff51c_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>Continuation of <a href="https://www.adityachowdhry.me/p/introducing-the-cricket-scale-series?r=nq4mu">Cricket Scale, Part 3</a></p></blockquote><p></p><p>We often get so excited about doing &#8220;technical work&#8221; that we forget first principles and end up deep in yak shaving &#8212; optimizing the wrong thing, adding complexity no one asked for, or celebrating fixes that shouldn&#8217;t have existed in the first place.</p><p>At Cricket Scale, some of the most impactful wins weren&#8217;t about clever infra or cutting-edge tools. They came from the not-so-technical grind: listing APIs, questioning every spike, trimming redundant calls, and removing waste. Unglamorous chores, but they saved us more than any scaling trick ever could.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>So in this post, I&#8217;ll start discussing those chores &#8212; the work that rarely makes it into conference talks, but quietly makes systems faster, cheaper, and easier to run.</p><p></p><h2><strong>1. Questioning the Existence</strong></h2><p>During load testing, we noticed a few APIs with throughput that stood out &#8212; far higher than expected. Our tests mimicked real user behavior in the app, mapping common flows to the APIs behind them. That exercise quickly pointed us in the direction of listing every API and looking at the numbers: throughput, p90/p95/p99 latencies, average response times, and error rates.</p><div><hr></div><h3><em>why are these APIs being called so often?</em></h3><p>The first question we asked was simple: <em>why are these APIs being called so often?</em></p><ul><li><p><strong>Instant Refresh Assumptions</strong> &#8212; Frontend logic assumed users always needed the &#8220;freshest&#8221; state, so certain endpoints were being hit far more frequently than necessary.</p></li><li><p><strong>Reloads and Missed Caching</strong> &#8212; Some APIs reloaded every time a user landed on a key screen (like the home page), even when nothing had changed. A few endpoints also slipped past client-side caching. Each case seemed small in isolation, but multiplied at DAU scale, they created massive unnecessary load.</p><p></p></li></ul><blockquote><p>Take the home page as an example. Every time a user landed there, our <strong>feed API</strong> was called &#8212; which made sense, because the feed had to be fresh. We couldn&#8217;t afford to show expired events.</p><p>But alongside the feed, our <strong>hamburger API</strong> was also being triggered on every visit. This endpoint pulled details like user balance, skill score, and help ticket status &#8212; things that didn&#8217;t need to be refreshed with the same urgency as the feed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!65uo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!65uo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png 424w, https://substackcdn.com/image/fetch/$s_!65uo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png 848w, https://substackcdn.com/image/fetch/$s_!65uo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png 1272w, https://substackcdn.com/image/fetch/$s_!65uo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!65uo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png" width="316" height="396.6458333333333" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1446,&quot;width&quot;:1152,&quot;resizeWidth&quot;:316,&quot;bytes&quot;:330875,&quot;alt&quot;:&quot;Probo&#8217;s Home Page&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/173641202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Probo&#8217;s Home Page" title="Probo&#8217;s Home Page" srcset="https://substackcdn.com/image/fetch/$s_!65uo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png 424w, https://substackcdn.com/image/fetch/$s_!65uo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png 848w, https://substackcdn.com/image/fetch/$s_!65uo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png 1272w, https://substackcdn.com/image/fetch/$s_!65uo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed97e3e0-1c50-4d3f-a620-41a2e9eefb2d_1152x1446.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Probo&#8217;s Home Page</figcaption></figure></div><p>Because the home page has the highest DAU, the hamburger API ended up matching the feed API in throughput. And that wasn&#8217;t the end of it &#8212; each hamburger call fanned out into other services, hitting the wallet service and skill score service to fetch details.</p><p>So what looked like a harmless reload turned into a <strong>cascade of unnecessary load</strong>, inflating throughput across multiple systems. At scale, this kind of oversight meant we were spending infra and engineering effort just to serve data users didn&#8217;t actually need that often.</p></blockquote><div><hr></div><h3><em>why does this API exist at all?</em></h3><p>Once we answered <em>why they were called so often</em>, the next question was even more fundamental: <em>why does this API exist at all?</em></p><ul><li><p><strong>Zombie APIs</strong> &#8212; Endpoints left behind from old experiments or features long since retired.</p></li><li><p><strong>Unfinished Features</strong> &#8212; Half-built ideas that never made it to production, but still left endpoints active.</p></li></ul><p>Cleaning these up wasn&#8217;t glamorous work &#8212; it meant auditing flows, talking to product managers about what was still relevant, and carefully deprecating endpoints so nothing broke. But each removal was one less thing to scale, one less source of noise in our metrics, and one more bit of clarity in how the system actually worked.</p><p></p><h3><em>Does this product feature even matter?</em></h3><p>We didn&#8217;t stop at APIs. We went a step further and questioned the <strong>product features themselves</strong>.</p><p>We made a list of APIs that were expensive to run and mapped them against their importance to the user experience. Wherever we saw features that weren&#8217;t contributing much, or had very low to no usage, we sat with the product team and evaluated whether they were still worth keeping.</p><p>In many cases, the honest answer was no. Those features were quietly adding cost, complexity, and operational overhead without giving users any real value. With the product team&#8217;s support, we removed them altogether.</p><p>At scale, sometimes the best optimization isn&#8217;t caching or tuning &#8212; it&#8217;s simply <strong>deleting what no longer matters.</strong></p><p></p><h3><em>What about expensive but necessary APIs?</em></h3><p>Then there were cases where the APIs were genuinely expensive but couldn&#8217;t just be deleted. A good example was <strong>expired or past events in which users had participated</strong> &#8212; essentially their long-term portfolio.</p><p>For an app that had been live for more than a year, this meant a massive amount of historical data. Fetching and rendering all of it on demand didn&#8217;t make sense &#8212; the cost was high, the latency was painful, and most users only cared about their recent activity.</p><p>Here, the solution wasn&#8217;t removal, but redesign. Instead of hitting expensive APIs every time, we thought carefully about what actually needed to be shown up front, and what could be deferred, summarized, or tucked behind pagination.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yB_g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yB_g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png 424w, https://substackcdn.com/image/fetch/$s_!yB_g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png 848w, https://substackcdn.com/image/fetch/$s_!yB_g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png 1272w, https://substackcdn.com/image/fetch/$s_!yB_g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yB_g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png" width="210" height="420.58171745152356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1446,&quot;width&quot;:722,&quot;resizeWidth&quot;:210,&quot;bytes&quot;:284003,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/173641202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yB_g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png 424w, https://substackcdn.com/image/fetch/$s_!yB_g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png 848w, https://substackcdn.com/image/fetch/$s_!yB_g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png 1272w, https://substackcdn.com/image/fetch/$s_!yB_g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43173ce9-6cb2-4f67-8b69-409eb81900a8_722x1446.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">download events older than 60 days</figcaption></figure></div><p></p><p>This was another kind of grind &#8212; not heroic infra scaling, but thoughtful product and engineering decisions about what truly mattered to the user experience.</p><p></p><h2><strong>2. The Ops Grind</strong></h2><p>Not all of the grind was about APIs or product features. A big part of scale was operational &#8212; managing observations, tuning infra, and staying ahead of limits.</p><h3><strong>Observability and Checklists</strong></h3><p>We built a top-level system view, inspired by <a href="https://tech.dream11.in/blog/2021-11-30_Observability-at-Scale--How-we-built-a-cutting-edge-Dream11-monitoring-ecosystem---c3ac8cfeca1">Dream11&#8217;s observability work</a>, that could instantly tell us which service or dependency in the entire system was having an issue. And with the help of a master dashboard, we had the ability to drill down further &#8212; moving from the big picture to the exact service, API, or dependency causing trouble.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8wnW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8wnW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png 424w, https://substackcdn.com/image/fetch/$s_!8wnW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png 848w, https://substackcdn.com/image/fetch/$s_!8wnW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png 1272w, https://substackcdn.com/image/fetch/$s_!8wnW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8wnW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png" width="366" height="224.72802197802199" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:894,&quot;width&quot;:1456,&quot;resizeWidth&quot;:366,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8wnW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png 424w, https://substackcdn.com/image/fetch/$s_!8wnW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png 848w, https://substackcdn.com/image/fetch/$s_!8wnW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png 1272w, https://substackcdn.com/image/fetch/$s_!8wnW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aafd974-8430-4aa3-b00c-6ade23c6c6e0_2048x1258.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">master dashboard</figcaption></figure></div><blockquote><p>&#128073; If you&#8217;d like me to go deeper into how we approached observability &#8212; from designing top-level views to drill-down dashboards and fine-tuning alerting &#8212; let me know in the comments. That might be the focus of a future post.</p></blockquote><p>Alongside this, we relied heavily on checklists, timely enhancements, and alert fine-tuning. This wasn&#8217;t glamorous, but it gave us clarity and helped us avoid firefights turning into outages.</p><h4></h4><h3><strong>Infra Limits and Quotas</strong></h3><p>From the infrastructure side, the grind looked different but was just as important. Scaling wasn&#8217;t only about adding nodes or tuning autoscalers &#8212; it also meant making sure we never hit the practical limits of the cloud.</p><p>Cloud providers usually enforce <strong><a href="https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html">soft limits</a></strong> on resources, and if these are overlooked, they can easily cause downtime during peak events. To stay ahead, we sat with each team to understand their scaling requirements and helped set their scaling configs in place well in advance. Depending on their needs, we also did <strong>capacity reservations</strong> so resources were guaranteed when traffic spiked.</p><p>Beyond quotas, we focused on resilience:</p><ul><li><p><strong>Creating backups</strong> so we could quickly recover in case of failures.</p></li><li><p><strong>Preparing SOPs for incidents</strong> so everyone knew what to do if something went wrong.</p></li><li><p><strong>Warming up redundant dependencies</strong> so failover paths were ready and wouldn&#8217;t add cold-start latency during a match.</p></li></ul><p>None of this was glamorous, but without it, all the scaling design in the world wouldn&#8217;t have mattered &#8212; the system would have hit a ceiling we didn&#8217;t control.</p><p></p><h3><strong>The Playbook Grind</strong></h3><p>Scaling wasn&#8217;t just about systems &#8212; it was about people. For the <strong>first day of the largest cricketing event, the IPL</strong>, we developed detailed <strong>SOPs</strong> that laid out exactly how things would run. Nothing was left to chance.</p><ul><li><p>A <strong>game marshal</strong> was appointed to oversee coordination.</p></li><li><p>We documented the <strong>points of contact</strong> across customer experience, business, marketing, and other key teams.</p></li><li><p>A dedicated <strong>Slack channel</strong> was created for real-time coordination.</p></li><li><p>An <strong>on-call rotation</strong> was set up so engineers knew exactly who was responsible at any given moment, ensuring quick response without chaos.</p></li><li><p>We defined <strong>pre-match activities</strong> (scaling checks, quota reviews, dry runs), <strong>during-match activities</strong> (traffic monitoring, alerts, notifications), and <strong>post-match activities</strong> (user comms, debriefs, reporting).</p></li></ul><p>Each step was written down, and reviewed. It wasn&#8217;t glamorous engineering work, but it created alignment. Everyone knew their role, the handoffs were smooth, and there was no confusion when the real traffic hit.</p><h3><strong>Closing Thoughts</strong></h3><p>The &#8220;not-so-technical grind&#8221; rarely makes it into architecture diagrams or scaling war stories, but it&#8217;s often where the real wins come from. Listing APIs, questioning why they exist, debating product features, preparing infra checklists, and writing SOPs &#8212; none of it is glamorous, but all of it compounds at scale.</p><p>Sometimes the most powerful scaling move isn&#8217;t adding more infra. It&#8217;s asking simple, first-principle questions and having the discipline to trim, prepare, or delete.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Scaling with Karpenter: The Good and The Bad]]></title><description><![CDATA[A candid story of trading Cluster Autoscaler for flexibility in Kubernetes scaling.]]></description><link>https://www.adityachowdhry.me/p/scaling-with-karpenter-the-good-and</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/scaling-with-karpenter-the-good-and</guid><dc:creator><![CDATA[Tushar Khanka]]></dc:creator><pubDate>Wed, 17 Sep 2025 09:59:53 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2191676f-e100-4c03-84c9-9c920a969004_1520x1022.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We first discovered <strong>Karpenter</strong> back in 2023. At the time, Kubernetes scaling for us meant living with the <strong>Cluster Autoscaler</strong> on EKS. It was stable, predictable, and well documented. But it was also rigid and slow: new nodes could take minutes to join, and the only way to tune behaviour was by pre-scaling some placeholder pods.</p><p>Karpenter promised something fundamentally different. Instead of relying on ASGs, it provisions EC2 instances directly, based on rules you define with just two CRDs: <strong>Provisioner</strong> and <strong>AWSNodeTemplate</strong>. That means instead of being locked into a fixed pool of machines, Karpenter can pick the <strong>best available and cheapest instance</strong> at runtime.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>That idea was powerful enough for us to give it a try in our lower environments, as it wasn&#8217;t production-ready at the time.</p><div><hr></div><h2><strong>Why We Looked at Karpenter</strong></h2><p>The motivation was simple: our workloads were scaling up and the cracks in Cluster Autoscaler were starting to appear.</p><ul><li><p><strong>Speed</strong>: Cluster Autoscaler could take about 2 minutes for a new node to become ready. With Karpenter (plus a lightweight AMI), it was closer to 40 seconds.</p></li><li><p><strong>Flexibility</strong>: We wanted more freedom in choosing machines, mixing on-demand with spot instances, and selecting from a wider set of EC2 types.</p></li><li><p><strong>Cost</strong>: By leaning on spot capacity and flexible provisioning, we could save <strong>up to 40% on compute</strong> without changing how workloads were deployed.</p></li><li><p><strong>Drift Detection</strong>: Karpenter can detect when nodes become outdated or misaligned with current requirements and proactively replace them. (Note: This feature was introduced in later versions, not available in early 2023)</p></li></ul><p>This wasn&#8217;t about chasing new tech for the sake of it rather it was about building flexibility into our scaling model so we could adapt under unpredictable demand.</p><div><hr></div><h2><strong>The Early Days</strong></h2><p>When we started using Karpenter in 2023, it was far from stable. The stability journey was gradual:</p><ul><li><p>Pre-v0.32: Breaking changes were common, sometimes with every minor release</p></li><li><p>v0.32: Marked a turning point where APIs started stabilizing</p></li><li><p>v1.0.0: Finally delivered the production-ready stability we needed</p></li></ul><p>Some of the specific pain points:</p><ul><li><p><strong>Cryptic errors</strong>: Pods stuck in Pending with vague &#8220;capacity not available&#8221; messages. Nodes failing silently because of IAM misconfigurations.</p></li><li><p><strong>IAM complexity</strong>: We had to manage multiple roles (controller, node bootstrap, EC2 instance profile). If trust policies didn't line up, nodes would launch but fail to register. For example, if the node instance profile lacked the necessary EKS worker node policies, pods would stay pending even though EC2 showed healthy instances.</p></li><li><p><strong>Breaking changes</strong>:</p><ul><li><p><strong>v0.27</strong> changed default consolidation logic, breaking scaling policies.</p></li><li><p><strong>v0.32</strong> overhauled scheduling APIs, forcing CRD rewrites.</p></li></ul></li></ul><p>Despite all this, the potential upside kept us going.</p><h2><strong>How Karpenter Works (in brief)</strong></h2><p>Cluster Autoscaler depends on <strong>Auto Scaling Groups</strong>: you define which instances you want, and the autoscaler requests capacity by resizing those groups.</p><p>Karpenter removes ASGs from the equation instead:</p><ol><li><p>A pod is pending.</p></li><li><p>Karpenter evaluates your Provisioner and AWSNodeTemplate. {in 2023}</p></li><li><p>It chooses the cheapest + most available EC2 type that satisfies requirements (on-demand or spot).</p></li><li><p>It launches the instance directly and joins it to the cluster.</p></li></ol><p>That direct model is what makes Karpenter faster and more flexible.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aF9a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aF9a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic 424w, https://substackcdn.com/image/fetch/$s_!aF9a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic 848w, https://substackcdn.com/image/fetch/$s_!aF9a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic 1272w, https://substackcdn.com/image/fetch/$s_!aF9a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aF9a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic" width="1242" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1242,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80975,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/173728741?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!aF9a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic 424w, https://substackcdn.com/image/fetch/$s_!aF9a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic 848w, https://substackcdn.com/image/fetch/$s_!aF9a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic 1272w, https://substackcdn.com/image/fetch/$s_!aF9a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9f69864-8622-4ee5-8931-9a8769eec227_1242x1182.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><div><hr></div><h2><strong>What Worked, What Hurt, What Broke</strong></h2><p><em>After months of experimenting, tweaking provisioners, and firefighting, we started to see a clearer picture of what Karpenter actually meant for us. It wasn&#8217;t just theory or benchmarks anymore, it was lived experience. The results were mixed: genuine wins, persistent frustrations, and a few near-disasters that taught us hard lessons.</em></p><p><em>Here&#8217;s how it really played out.</em></p><h2><strong>The Good: When Demand Hit Us Out of Nowhere</strong></h2><p>The first time Karpenter proved its worth. It was a weekday afternoon, a fairly low-traffic period, when suddenly one of our events got picked up on social media. Within minutes, requests shot up nearly 3&#215;.</p><p>With Cluster Autoscaler, this would&#8217;ve meant a painful 2&#8211;3 minutes of waiting for new nodes. Instead, Karpenter spun up fresh capacity in close to 40 seconds. Pods scheduled almost immediately, the customer impact graph barely wobbled.</p><p>The second win was cost. By default, Karpenter mixed spot and on-demand intelligently, <strong>shaving ~40%</strong> off our compute <strong>bill</strong> without needing us to micro-manage instance types. That kind of flexibility and no more rigid ASGs felt transformative.<br><br>The third win was subnet planning. With Cluster Autoscaler was always a headache. IP exhaustion could quietly block scaling, and fixing it meant manual subnet juggling across availability zones.</p><p>Karpenter flipped that model entirely. As long as we tagged subnets with <code>karpenter.sh/discovery=&lt;cluster-name&gt;</code>, it could shift nodes across them automatically. We could add bigger subnets without re-architecting capacity, and scaling bottlenecks from IP exhaustion mostly disappeared.</p><p>And by the time <strong>v1.0.0</strong> shipped, Karpenter was finally stable enough. It went from &#8220;experimental toy&#8221; to a critical part of our stack.</p><p></p><h3><strong>The Bad: Living on the Edge of Release Notes</strong></h3><p>But before <strong>v1.0</strong>, running Karpenter felt like being part of a beta program we never signed up for. Some minor upgrades came with breaking changes. Provisioners that worked last week suddenly refused to validate this week.</p><p>The documentation made things worse. Most of it still assumed EKS clusters used the old aws-auth ConfigMap, even though AWS had deprecated it in favor of EKS access entries back in 2022. We ended up digging through GitHub issues more than reading official docs.</p><p>And then there was setup overhead. IAM roles, trust relationships, provisioner tuning. It was never a plug-and-play install. Every new environment meant hours of yak-shaving before workloads could actually run.</p><p>These weren&#8217;t &#8220;cute edge cases.&#8221; They were the kind of  pain that leaves scars where you&#8217;re refreshing Grafana dashboards while combing through opaque controller logs, trying to understand why autoscaling, the thing you <em>promised would be smoother</em>, is actively betraying you.<br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L6lD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L6lD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic 424w, https://substackcdn.com/image/fetch/$s_!L6lD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic 848w, https://substackcdn.com/image/fetch/$s_!L6lD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic 1272w, https://substackcdn.com/image/fetch/$s_!L6lD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L6lD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic" width="727" height="319.56043956043953" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1456,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:145389,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/173728741?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L6lD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic 424w, https://substackcdn.com/image/fetch/$s_!L6lD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic 848w, https://substackcdn.com/image/fetch/$s_!L6lD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic 1272w, https://substackcdn.com/image/fetch/$s_!L6lD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d649b8-df82-4ca8-a287-80c0fba6f84a_2958x1300.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>When to Consider Karpenter</strong></h2><ul><li><p><strong>Choose Karpenter</strong> if:</p><ul><li><p>Node startup latency is hurting your workloads.</p></li><li><p>You&#8217;re under cost pressure and want to use spot effectively.</p></li><li><p>You want flexibility to adapt to unpredictable demand.</p></li></ul></li><li><p><strong>Stick with Cluster Autoscaler</strong> if:</p><ul><li><p>Stability &gt; flexibility for your environment.</p></li><li><p>Your clusters are small, predictable, or heavily regulated.</p></li><li><p>Your team doesn&#8217;t have bandwidth to manage IAM and configuration complexities.</p></li></ul></li><li><p><strong>Recommended starting point</strong>: Begin with the latest <strong>v1.x stable release</strong>. Avoid older versions entirely.</p></li></ul><div><hr></div><h2><strong>Final Reflection</strong></h2><p>Adopting Karpenter early wasn't smooth. We endured cryptic errors, broken upgrades, and IAM headaches. But the payoff was real: <strong>faster scaling</strong>, <strong>lower costs</strong>, and <strong>flexibility</strong> that Cluster Autoscaler couldn't match.</p><p>The real lesson isn't about Karpenter itself, it's about pragmatic engineering. Sometimes the safe choice (Cluster Autoscaler) is fine. But sometimes, the risky bet (Karpenter in 2023) pays off because it unlocks new ways to scale under pressure.</p><p>In our case, it was worth it. If you're hitting the walls of traditional autoscaling, it might be time to consider the trade-offs.</p><div><hr></div><h3><strong>What&#8217;s Next</strong></h3><p>Considering Karpenter for your setup? The implementation has its gotchas&#8212;IAM configuration, provisioner/Nodepool  tuning, and migration strategies all need careful planning. If there's enough interest, I'll cover the practical setup details and lessons learned in a follow-up post.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Cricket Scale, Part 2 - Redis: A Silver Bullet With a Cost]]></title><description><![CDATA[Challenges with redis in action]]></description><link>https://www.adityachowdhry.me/p/cricket-scale-part-2-redis-a-silver</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/cricket-scale-part-2-redis-a-silver</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Thu, 11 Sep 2025 04:30:34 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b340d655-79ec-469a-a363-8074f5c0f048_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is an ongoing series on <a href="https://www.adityachowdhry.me/p/introducing-the-cricket-scale-series">Cricket Scale</a> &#8212; a kind of scale that&#8217;s different from normal. It&#8217;s sudden, spiky, and event-driven, and here I share how we at Probo built systems to handle it.</em></p><p></p><p>When we first brought Redis into the mix, it felt like the silver bullet. Everything got faster &#8212; API responses dropped, counters updated instantly, and user experience stayed snappy even during match peaks.</p><blockquote><p>At peak, ~99.7% of requests were served in under 100 ms, with an average response time of just ~15 ms.</p></blockquote><p>Redis gave us the low-latency backbone we needed to keep the platform responsive, even as millions of requests were flowing every minute. It quickly became our default answer for anything that needed to be fast:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ul><li><p>Hot caches for APIs</p></li><li><p>Real-time counters for trades, traders, portfolio</p></li><li><p>Sorted sets for leaderboards</p></li><li><p>Sessions, rate limiting, and throttling</p></li></ul><p>And for a while, Redis felt like the silver bullet.</p><p>But Redis also came with its own set of challenges, and those challenges became more noticeable as we scaled:</p><h2><strong>The Rug for Our Tech Debt</strong></h2><p>Redis didn&#8217;t just make our systems faster &#8212; it also made our problems less visible. In hindsight, we were guilty of <strong>sweeping tech debt under the Redis rug.</strong></p><ul><li><p>Slow database queries? Put them behind Redis.</p></li><li><p>Unoptimized schema? Hide it under a cache.</p></li><li><p>Missing indexes? Redis will cover it up.</p></li><li><p>Poorly thought-out architectures? Add Redis in the middle and call it &#8220;scalable.&#8221;</p></li></ul><p>It worked &#8212; until the rug was pulled away. The moment Redis faltered, all the hidden dirt came spilling out: the database choked, the APIs stalled, and the architectural shortcuts stood exposed.</p><p>The lesson for us was clear: Redis is excellent for performance, but it&#8217;s not a substitute for good database practices or sound architecture.</p><p>We learned this the hard way. Sometimes it was a Redis downtime that exposed all the hidden issues at once. Other times, it was during load testing combined with a bit of chaos testing that showed just how fragile the underlying systems really were.</p><h2>Not all redis data is equal: Ephemeral vs Permanent</h2><p>Redis shines when it&#8217;s used for <strong>ephemeral data</strong>: sessions with TTLs, rate limiting, short-lived caches. If Redis goes down, the system can repopulate or tolerate the loss.</p><p>But things get tricky when Redis is used for <strong>permanent data</strong> &#8212; like leaderboards without TTLs. Suddenly, Redis isn&#8217;t a cache anymore; it&#8217;s acting as a database, but without durability guarantees. And when it&#8217;s not available, the question becomes less technical and more product-driven:</p><ul><li><p><strong>What should the user see when Redis is down?</strong></p></li><li><p><strong>Do you try to calculate the leaderboard on the fly?</strong> (expensive and slow)</p></li><li><p><strong>Or do you handle it gracefully?</strong> For example, showing a message that &#8220;leaderboard is being built&#8221; or temporarily hiding the feature.</p></li></ul><p>This is where having a <strong>product mindset</strong> matters. If Redis is critical to the experience, you need a fallback plan that puts the user first &#8212; whether that&#8217;s hiding the feature, showing stale data, or communicating clearly that it&#8217;s temporarily unavailable.</p><p>Redis taught us that system design and product design can&#8217;t be separated. At scale, resilience isn&#8217;t just about infra choices; it&#8217;s also about <strong>how you want users to experience failure.</strong></p><p></p><h2>Hot &amp; Heavy: The trouble with Keys at scale</h2><p>One of the toughest Redis lessons came during a high-stakes India cricket match. Traffic spiked, DAUs surged, and suddenly our app started crashing. The events feed &#8212; the most visible part of the experience &#8212; stopped working.</p><p>On debugging, we discovered Redis response times had shot up. The culprit: a <strong>hot key.</strong></p><p>For simplicity, think of it as one prebuilt event feed cached under a single Redis key. (In reality, we had multiple keys with combinations like user segmentation and language, but the point remains the same.) That feed was updated asynchronously through an algorithm, and all requests hit it.</p><p>During normal days, this design held up. But when India played &#8212; and tens of millions of users came in at once &#8212; that one key became a <strong>single point of failure</strong>, even though Redis was running in a cluster. All the load funneled into a single shard, overwhelming it, while other shards stayed underutilized.</p><p>The first bottleneck we hit wasn&#8217;t CPU or memory &#8212; it was <strong>network throughput</strong>. That shard simply couldn&#8217;t push responses fast enough. We also realized our <strong>key size</strong> made the situation worse: the feed payload was large, so every request pushed a heavy response over the network.</p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c38eeb08-ce94-4a87-9298-617ea90269b0_1374x542.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f06d1df-530c-4943-8586-ad4d3519f6f9_1384x546.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86e38377-111c-4426-81cb-9f9e8f793fd4_1365x538.png&quot;}],&quot;caption&quot;:&quot;Load testing findings - Network, CPU, Number of commands&quot;,&quot;alt&quot;:&quot;Load testing findings - Network, CPU, Number of commands&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01d3a01f-b42c-42ef-abc1-42e053938037_1456x474.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p></p><p>In subsequent years, hot keys continued to be a recurring theme. This time, they showed up as <strong>high CPU utilization</strong> and <strong>latency spikes</strong> on particular shards. </p><p>To solve this there can be multiple solutions:</p><ul><li><p><strong>Multiple read replicas:</strong> adding replicas and enabling reads from them to spread the load</p></li><li><p><strong>Distributing load across multiple keys</strong> with the same value, to avoid a single point of pressure. This would involve some code changes to your cache get method.</p></li><li><p><strong>Client-side caching</strong> at the Kubernetes pod level, so not every request had to hit Redis. This reduced load on Redis, but brought its own challenges &#8212; especially update propagation and cache invalidation across hundreds of pods.</p></li><li><p><strong>Compression for big keys</strong> where payload size couldn&#8217;t be reduced at the source. This helped ease network throughput issues, but added CPU overhead for compress/decompress trade-offs.</p></li></ul><p>In our case, since this was a <strong>feed</strong>, the writes (i.e. feed update logic) weren&#8217;t very frequent. That meant replication lag wasn&#8217;t a concern, which made the <strong>Redis master/replica method</strong> a clean fit. We scaled Redis simply by enabling multiple reads across replicas. The Redis client we used (ioredis) even exposed this as a simple boolean flag &#8212; making the shift straightforward while leveraging higher network throughput&#8211;enabled instances.</p><p></p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59ff0be7-9450-4bf4-9f1a-c267c2bf7f8b_2430x1187.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72472507-12de-425e-81e0-d94335195a84_1948x1099.png&quot;}],&quot;caption&quot;:&quot;Hot shard example with its workaround&quot;,&quot;alt&quot;:&quot;Hot shard example with its workaround&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75455ce4-6e03-46d4-8897-cbc188cc525e_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p></p><h2>Costing and Migrations - The price of speed</h2><p>Our Redis journey wasn&#8217;t only about performance &#8212; cost and operations played a big role too.</p><p>We started with <strong>self-hosted Redis</strong>, which worked fine in the early days. But as scale grew, the cracks began to show:</p><ul><li><p><strong>Downtime:</strong> our setup couldn&#8217;t keep up with rapid traffic spikes, and the feed (again) was the culprit.</p></li><li><p><strong>Recovery pain:</strong> bringing Redis back after failures was slow and messy, especially under load.</p></li><li><p><strong>Management overhead:</strong> handling failovers, patching, and scaling manually took too much engineering time.</p></li></ul><p>The first big shift came when we migrated to <strong>AWS ElastiCache</strong>. That move gave us reliability and faster operations, but it also introduced new challenges:</p><ul><li><p>To meet <strong>higher memory requirements</strong>, we had to pick larger instances.</p></li><li><p>That meant running <strong>8-core or 16-core machines</strong> even though Redis itself is <strong>single-threaded</strong> &#8212; so most of those extra cores just sat idle. We were paying for CPU we couldn&#8217;t use, simply to get more memory.</p></li><li><p>This led to <strong>inefficient resource utilization</strong>, and since managed offerings from AWS aren&#8217;t cheap at scale, the cost quickly became significant.</p></li></ul><p>Eventually, we moved again &#8212; this time to <strong>RedisLabs (now Redis Enterprise)</strong>. Surprisingly, this turned out to be a <strong>cheaper alternative while maintaining the same scale.</strong> Our guess is that Redis Enterprise optimizes resource utilization differently &#8212; possibly by running multiple Redis processes on the same underlying machine &#8212; which reduced waste compared to ElastiCache. Whatever the exact reason, it gave us a better balance between <strong>cost efficiency</strong> and <strong>operational reliability.</strong></p><p></p><h2>The Obvious Good Practices (That Are Easy to Forget)</h2><p>A lot of Redis pain can be avoided by sticking to the basics. These aren&#8217;t fancy tricks &#8212; just good hygiene that makes a big difference at scale:</p><ul><li><p><strong>Monitor slowlogs:</strong> keep an eye on commands that take longer than expected. They&#8217;re often early warnings of a bigger issue.</p></li><li><p><strong>Choose data types thoughtfully:</strong> don&#8217;t just reach for strings. Sorted sets, hashes, and bitmaps all have trade-offs in memory and performance.</p></li><li><p><strong>Set TTLs by default:</strong> ephemeral data should expire on its own; otherwise, you risk memory bloat.</p></li><li><p><strong>Avoid big keys:</strong> break payloads into smaller chunks to reduce network and serialization overhead.</p></li><li><p><strong>Watch replication lag:</strong> especially if you&#8217;re reading from replicas under heavy load.</p></li><li><p><strong>Plan for failover:</strong> assume Redis will go down at some point and decide what users should see when it does.</p></li></ul><p>At cricket scale, these fundamentals turn from &#8220;nice to have&#8221; into &#8220;must have.&#8221;</p><h2>Final Thoughts</h2><p>Redis was one of the most valuable tools in our stack. It helped us achieve p99 latencies under 100 ms at scale and kept the platform responsive during peak cricket moments.</p><p>At the same time, it highlighted the importance of using the right tool for the right job. Redis worked very well for certain use cases, but it also showed its limits when we stretched it too far.</p><div class="pullquote"><p>The takeaway for us: Redis isn&#8217;t a silver bullet. It&#8217;s extremely effective when applied thoughtfully, but it can create challenges if relied on without careful design.</p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Scaler : our in-house scaling solution]]></title><description><![CDATA[&#8220;nuts and bolts&#8221; walkthrough of how We tied Google Calendar to scaling logic.]]></description><link>https://www.adityachowdhry.me/p/scaler-our-in-house-scaling-solution</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/scaler-our-in-house-scaling-solution</guid><dc:creator><![CDATA[Tushar Khanka]]></dc:creator><pubDate>Thu, 04 Sep 2025 16:59:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ESV_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ESV_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ESV_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic 424w, https://substackcdn.com/image/fetch/$s_!ESV_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic 848w, https://substackcdn.com/image/fetch/$s_!ESV_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic 1272w, https://substackcdn.com/image/fetch/$s_!ESV_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ESV_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic" width="1456" height="853" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86998,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/172550528?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ESV_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic 424w, https://substackcdn.com/image/fetch/$s_!ESV_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic 848w, https://substackcdn.com/image/fetch/$s_!ESV_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic 1272w, https://substackcdn.com/image/fetch/$s_!ESV_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc09cb0f-0491-412e-99ec-ff6a39de0d80_1830x1072.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong><br><br>Integrating Google Calendar as the &#8220;Control Plane&#8221;</strong></h3><p>The first building block of <strong>Scaler</strong> was surprisingly non-technical: <strong>a shared Google Calendar</strong>. We realized that our traffic spikes weren&#8217;t random &#8212; they were <em>event-driven</em>. And who in the company knew about events ahead of time? The marketing team. Instead of building an elaborate custom interface, we decided to piggyback on something everyone already used and trusted.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OPll!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OPll!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic 424w, https://substackcdn.com/image/fetch/$s_!OPll!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic 848w, https://substackcdn.com/image/fetch/$s_!OPll!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic 1272w, https://substackcdn.com/image/fetch/$s_!OPll!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OPll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic" width="282" height="241.30632911392405" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:790,&quot;resizeWidth&quot;:282,&quot;bytes&quot;:14955,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/172550528?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OPll!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic 424w, https://substackcdn.com/image/fetch/$s_!OPll!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic 848w, https://substackcdn.com/image/fetch/$s_!OPll!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic 1272w, https://substackcdn.com/image/fetch/$s_!OPll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab2588-d1f8-4e3e-a2b1-a47fd3540e36_790x676.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 1: Set up a shared calendar</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>We created a dedicated Google Calendar called Scaling Events. Access was shared with engineering and marketing. Each person could add or edit events based on their role. Engineers got the <strong>calendar ID</strong>, which let us fetch the data programmatically, while non-technical teammates just saw &#8220;another calendar&#8221; show up in their Google Calendar UI.<br></p><p><strong>Step 2: Encode scaling intent as events</strong></p><p>Each event in the calendar represented a <strong>scaling profile</strong>. The title was simple and human-readable&#8212;high, medium, night, low&#8212;and always <strong>tied to a cricket match, campaign, or product launch</strong>. These were created after consulting the marketing/business team, so engineers didn&#8217;t have to guess when to scale.</p><p></p><p><strong>Step 3: Describe what to scale</strong></p><p>The event <strong>description</strong> field became our schema. Inside it, we listed which resources needed scaling&#8212;comma-separated:</p><pre><code><code>HPA, ASG, RDS, REDIS</code></code></pre><p>That way, a marketing manager could say <em>&#8220;this IPL playoff match is high traffic, scale Redis and RDS accordingly&#8221;</em>without touching infrastructure code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nLxN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nLxN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic 424w, https://substackcdn.com/image/fetch/$s_!nLxN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic 848w, https://substackcdn.com/image/fetch/$s_!nLxN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic 1272w, https://substackcdn.com/image/fetch/$s_!nLxN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nLxN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic" width="442" height="297.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:980,&quot;width&quot;:1456,&quot;resizeWidth&quot;:442,&quot;bytes&quot;:61601,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/172550528?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nLxN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic 424w, https://substackcdn.com/image/fetch/$s_!nLxN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic 848w, https://substackcdn.com/image/fetch/$s_!nLxN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic 1272w, https://substackcdn.com/image/fetch/$s_!nLxN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ee957f-583a-4290-b766-338a6f0e836d_2326x1566.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><strong>Step 4: Process the inputs</strong></p><p>Once these events were in place, the next step was automation. We wrote an Airflow DAG to poll the Google Calendar API every <strong>10 minutes</strong>. Any scheduler would have worked, but at Probo we leaned heavily on <strong>Airflow</strong> for orchestrating time-based tasks, so it fit right in.<br></p><div><hr></div><h3><strong>Airflow DAG: Calendar &#8594; Scaling Actions</strong></h3><p>The DAG worked in two layers:</p><ol><li><p><strong>Template Definition</strong></p><p>A generic scaling_template described how each resource <em>could</em> scale across different traffic levels. It covered everything from RDS instance types to Redis dataset size, broken down by profiles (default, medium, high, night).</p><p>Example (simplified for RDS):</p></li></ol><pre><code><code>{
    "RDS": {
        "&lt;ENVIRONMENT&gt;": [
            {
                "name": "&lt;Name of the RDS Cluster&gt;",
                "scaling_enabled": "&lt;true/false&gt;",
                "instance_ids": ["&lt;RDS Instance Names&gt;"],
                "traffic_config": {
                    "default": {
                        "instance_type": "&lt;db.instance.type&gt;",
                        "min_capacity": "&lt;number&gt;",
                        "max_capacity": "&lt;number&gt;"
                    },
                    "medium": {
                        "instance_type": "&lt;db.instance.type&gt;",
                        "min_capacity": "&lt;number&gt;",
                        "max_capacity": "&lt;number&gt;"
                    },
                    "high": {
                        "instance_type": "&lt;db.instance.type&gt;",
                        "min_capacity": "&lt;number&gt;",
                        "max_capacity": "&lt;number&gt;"
                    },
                    "night": {
                        "instance_type": "&lt;db.instance.type&gt;",
                        "min_capacity": "&lt;number&gt;",
                        "max_capacity": "&lt;number&gt;"
                    }
                }
            }
        ]
    },
    "calendar_id": "&lt;YOUR_CALENDAR_ID&gt;@group.calendar.google.com"
}</code></code></pre><ol><li><p><strong>Configuration in Git</strong></p><p>To keep things auditable, environment-specific configs were stored in Git. Every change (say, adding a new RDS cluster to scale) was reviewed like any other code change.</p><p>Example (simplified):</p></li></ol><pre><code><code>{
    "RDS": {
        "PROD": [
            {
                "name": "main-db",
                "scaling_enabled": true,
                "instance_ids": ["db-instance-1", "db-instance-2"],
                "traffic_config": {
                    "default": {
                        "instance_type": "db.t4g.medium",
                        "min_capacity": 1,
                        "max_capacity": 1
                    },
                    "medium": {
                        "instance_type": "db.r6g.large",
                        "min_capacity": 3,
                        "max_capacity": 3
                    },
                    "high": {
                        "instance_type": "db.r6g.xlarge",
                        "min_capacity": 10,
                        "max_capacity": 50
                    }
                }
            }
        ]
    },
    "calendar_id": "&lt;YOUR_CALENDAR_ID&gt;@group.calendar.google.com"
}</code></code></pre><p>This two-step structure&#8212;<strong>templates for patterns, Git for actual config</strong>&#8212;kept the system both flexible and reviewable. Marketing could decide <em>when</em> to scale, engineers defined <em>how</em> scaling worked, and Airflow acted as the glue in between.</p><div><hr></div><h3><strong>Closing thoughts</strong></h3><p>What we built with Google Calendar + Airflow may look deceptively simple, but it was the <strong>foundation</strong> of Scaler. Instead of chasing fancy abstractions, we gave the business team a language they already understood&#8212;calendar events&#8212;and wired it directly into infrastructure logic. </p><h4><em>&#8220;This meant engineers weren&#8217;t firefighting during every cricket match, and marketers could directly influence capacity without opening a single AWS console.&#8221;</em></h4><p><br>That said, this is just the first layer. There&#8217;s a lot more beneath the surface&#8212;like:</p><ul><li><p><strong>The processing logic:</strong> how calendar events were parsed, mapped to Git configs, and converted into scaling actions on AWS resources (ASGs, RDS, Redis, HPA).</p></li><li><p><strong>Safety nets &amp; guardrails:</strong> how we ensured that a bad calendar entry or misconfigured template didn&#8217;t accidentally scale down production during a high-stakes event.</p></li><li><p><strong>Real-world outcomes:</strong> cost savings, reliability gains, and some hard lessons learned about prediction vs. reaction.</p></li></ul><p>If you&#8217;d like me to go deeper into these areas, let me know&#8212;I can cover them in the next post.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Introducing the Cricket Scale Series]]></title><description><![CDATA[Lessons From Building Probo]]></description><link>https://www.adityachowdhry.me/p/introducing-the-cricket-scale-series</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/introducing-the-cricket-scale-series</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Tue, 26 Aug 2025 10:02:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/eb1dfc66-38e3-4a8c-91f4-ceb14746d718_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In India, cricket isn&#8217;t just a sport &#8212; it&#8217;s a stress test for technology. Organisations connected to cricket, directly or indirectly, quickly learn that the excitement of a match translates into massive, unpredictable spikes in their systems.</p><p>During our journey at Probo, we experienced this first-hand. The start of an IPL match would bring sharp surges in traffic &#8212; far beyond what daily active user metrics or linear growth curves could prepare us for. A single ball could trigger millions of requests at once.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>This is commonly referred to as <strong>cricket scale</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4vTN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4vTN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png 424w, https://substackcdn.com/image/fetch/$s_!4vTN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png 848w, https://substackcdn.com/image/fetch/$s_!4vTN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png 1272w, https://substackcdn.com/image/fetch/$s_!4vTN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4vTN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png" width="1456" height="580" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:580,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222891,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/171924762?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4vTN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png 424w, https://substackcdn.com/image/fetch/$s_!4vTN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png 848w, https://substackcdn.com/image/fetch/$s_!4vTN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png 1272w, https://substackcdn.com/image/fetch/$s_!4vTN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc6a871-1557-47c4-803f-6f6b6d9807b5_1836x732.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">traffic pattern for a cricket match</figcaption></figure></div><p></p><p>It was different from normal scale &#8212; spiky, sudden, and emotionally driven. Handling it required rethinking not only our infrastructure, but also our processes and how our teams operated.</p><p>This series is an attempt to document the lessons we learned along the way.</p><p></p><h3><strong>What to Expect</strong></h3><ul><li><p><strong>Part 1</strong> &#8594; <a href="https://www.adityachowdhry.me/p/cricket-scale-part-1-ec2-vs-kubernetes">EC2 vs Kubernetes: picking the right scaling foundation</a></p></li><li><p><strong>Part 2 </strong>&#8594;<a href="https://www.adityachowdhry.me/p/cricket-scale-part-2-redis-a-silver"> Redis: A Silver Bullet With a Cost</a></p></li><li><p><strong>Part 3</strong> &#8594; <a href="https://open.substack.com/pub/adityachowdhry/p/cricket-scale-part-3-the-not-so-technical?r=nq4mu&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">The Not-So-Technical Grind</a></p></li></ul><p></p><p>While RMG applications may no longer operate in India, the challenges of spiky traffic patterns remain. Concert ticketing, IPO subscriptions, elections, or live streaming platforms all face their own versions of the cricket scale.</p><p>Through this series, I&#8217;ll share what worked for us, what didn&#8217;t, and how we built systems (and teams) that could survive the pressure of cricket scale.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Cricket Scale, Part 1 — EC2 vs Kubernetes]]></title><description><![CDATA[Picking the Right Scaling Foundation]]></description><link>https://www.adityachowdhry.me/p/cricket-scale-part-1-ec2-vs-kubernetes</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/cricket-scale-part-1-ec2-vs-kubernetes</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Tue, 26 Aug 2025 09:58:35 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ee477ba6-4cf8-4a27-be00-18698d047127_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is an ongoing series on <a href="https://www.adityachowdhry.me/p/introducing-the-cricket-scale-series">Cricket Scale</a> &#8212; a kind of scale that&#8217;s different from normal. It&#8217;s sudden, spiky, and event-driven, and here I share how we at Probo built systems to handle it.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3XYF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3XYF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png 424w, https://substackcdn.com/image/fetch/$s_!3XYF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png 848w, https://substackcdn.com/image/fetch/$s_!3XYF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png 1272w, https://substackcdn.com/image/fetch/$s_!3XYF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3XYF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png" width="1456" height="537" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:537,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231817,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.adityachowdhry.me/i/171966823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3XYF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png 424w, https://substackcdn.com/image/fetch/$s_!3XYF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png 848w, https://substackcdn.com/image/fetch/$s_!3XYF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png 1272w, https://substackcdn.com/image/fetch/$s_!3XYF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c64cd0-893b-4301-b758-9f9a8bf6cba0_1832x676.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An IPL match between SRH vs RR</figcaption></figure></div><p>In the second year of Probo, even regular India cricket matches (not just IPL) pushed our systems to the edge. Traffic patterns during the game were sudden, steep, and unpredictable.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Our EC2-based setup, which relied on autoscaling groups, was reliable but slow to react. Scaling new instances could take minutes, while traffic spikes arrived in seconds. We realized this approach might not hold up as Probo grew. We needed something quicker.</p><p>At the same time, we were tinkering with Kubernetes. It promised <strong>more configurable scaling</strong> &#8212; the ability to tune how workloads scaled up and down, to use horizontal and vertical pod autoscalers, and to react to more than just CPU or memory.</p><div><hr></div><h3><strong>EC2 &#8212; The Baseline</strong></h3><p>When we started, running Node.js applications on AWS EC2 instances was straightforward:</p><ul><li><p><strong>Simple model:</strong> launch an instance, deploy the app, attach it to a load balancer.</p></li><li><p><strong>Autoscaling groups:</strong> scale in/out based on CPU and memory</p></li><li><p><strong>Predictable performance:</strong> Node.js processes on EC2 gave us stable latency under load.</p></li></ul><p>We used <strong>pm2</strong> as the process manager, running multiple processes per instance &#8212; usually <strong>n-1</strong>, where <em>n</em> is the number of cores &#8212; to maximize CPU utilization while leaving some headroom.</p><p>This setup worked fine for steady traffic and moderate growth. But at cricket scale, the coarse granularity of EC2 autoscaling meant we were always a step behind the spikes.</p><h3><strong>Kubernetes &#8212; The Next Step</strong></h3><p>Kubernetes looked like the natural answer:</p><ul><li><p><strong>Configurable scaling:</strong> horizontal and vertical pod autoscalers, custom metrics, and more control over how workloads scale up and down.</p></li><li><p><strong>Bin packing:</strong> schedule multiple pods with different resource needs onto the same EC2 node, making better use of CPU and memory.</p></li><li><p><strong>Faster churn:</strong> pods can come up quicker than booting whole VMs.</p></li></ul><p>On paper, it solved the problems we were facing. But our early results showed the opposite &#8212; applications that ran smoothly on EC2 performed worse when moved to Kubernetes.</p><p>The reasons became clear: networking overhead, noisy neighbors, CoreDNS bottlenecks, and the operational burden of tuning clusters under sudden bursts. I&#8217;ve written about some of these challenges in more detail here: <a href="https://engineering.probo.in/production-grade-pain-lessons-from-scaling-kubernetes-on-eks-03571838c7a3">Production-Grade Pain: Lessons From Scaling Kubernetes on EKS</a>.</p><p></p><h3><strong>The Trade-offs</strong></h3><p>In theory, Kubernetes gave us more knobs to tune. In practice, we had a small team and IPL was around the corner. There were already higher-priority tasks on the table, and investing deep time into tuning Kubernetes clusters wasn&#8217;t realistic.</p><p>So we took the pragmatic route:</p><ul><li><p>For <strong>high-scale events like IPL matches</strong>, we resorted to <strong>EC2 instances</strong></p></li><li><p>Instead of pouring resources into scaling Kubernetes, we invested time into building a <strong>switcher</strong> &#8212; a mechanism that shifted traffic to EC2 instances when we expected spikes.</p></li></ul><p>This simple switcher eventually evolved into a <strong>full-fledged scaler solution</strong>, giving us the reliability of EC2 when we needed it most, while still allowing us to experiment with Kubernetes in parallel.</p><p>The trade-off was clear: <strong>predictability now vs flexibility later.</strong> And at cricket scale, predictability was what kept us alive.</p><div><hr></div><h3><strong>Scaler &#8212; Inhouse Scaling Solution</strong></h3><p>That switcher eventually evolved into a <strong>full-fledged scaler solution</strong>.</p><p>It started simple: a <strong>predictive scaling strategy</strong>. The marketing team would share expected traffic numbers before big matches, we looked at DAU, and from there we planned the number of EC2 instances we&#8217;d need.</p><p>The scaler worked off <strong>calendar events</strong>, each tagged with a scale profile:</p><ul><li><p><strong>High-scale</strong> for IPL and international cricket.</p></li><li><p><strong>Medium-scale</strong> for smaller matches or anticipated spikes.</p></li><li><p><strong>Low-scale</strong> for regular days.</p></li><li><p>And to optimize cost, we also introduced a <strong>night-time-scale</strong>, dialing resources down when traffic naturally dropped.</p></li></ul><p>This hybrid model gave us predictability when it mattered most, and cost efficiency when it didn&#8217;t &#8212; while still leaving room to experiment with Kubernetes in the background.</p><p>&#128073; If you&#8217;d like me to go deeper into how this inhouse scaler worked, let me know &#8212; I can cover it in a dedicated post.</p><h3><strong>Closing thoughts</strong></h3><p>Cricket scale forced us to move faster than we planned. It showed us that scaling infrastructure isn&#8217;t about chasing the latest tool &#8212; it&#8217;s about making trade-offs that match your workload and stage of growth.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Thinking in Systems: A Builder’s View]]></title><description><![CDATA[Balancing speed and foresight]]></description><link>https://www.adityachowdhry.me/p/thinking-in-systems-a-builders-view</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/thinking-in-systems-a-builders-view</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Sat, 16 Aug 2025 06:31:01 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5af2cd42-7b76-4bbe-b93e-7e9446ad8205_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my <a href="https://www.adityachowdhry.me/p/the-self-aware-team">previous post</a>, I wrote about building self-aware teams &#8212; teams that have context, vision, and the ability to see the bigger picture. They understand the trade-offs when taking shortcuts and make those calls deliberately, not by accident.</p><p>&#8220;Thinking in systems&#8221; is the same principle, applied to the systems we build.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h3><strong>What System Thinking Means to Me</strong></h3><p>For me, <strong>system thinking</strong> means building with the whole picture in mind, not just the task in front of me.</p><p>I don&#8217;t just aim to &#8220;complete&#8221; something; I aim to create a <strong>system</strong> &#8212; one with a <strong>solid, reliable core</strong> that doesn&#8217;t break under pressure. Around that core, it should be <strong>extensible</strong> so it can grow, <strong>configurable</strong> so it can adapt, and resilient enough to handle <strong>edge cases</strong> without special hacks.</p><p>It&#8217;s about designing in a way where today&#8217;s solution doesn&#8217;t limit tomorrow&#8217;s possibilities. A well-thought-out system makes scaling, maintenance, and evolution natural &#8212; not painful.</p><p></p><h3><strong>An Example from My Work</strong></h3><p>In my early years at Probo, I was tasked with catching unfair usage patterns &#8212; for example, users running multiple cloned apps to place trades. Some of these behaviours qualified as fraud, and my job was to prevent them and block those users.</p><p>On paper, it was a simple task: detect the pattern &#8594; block the user. But Probo is building something unique, and the fraud patterns were still evolving. If I had only solved for the few cases we knew about, the system would have failed the moment new patterns appeared.</p><p>So my approach was to build a <strong>system</strong> &#8212; one that was <strong>extensible</strong>, <strong>configurable</strong>, and <strong>observable</strong>, without directly depending on me for every new case. I implemented a <strong>rule engine</strong> combined with our in-house tagging system.</p><p>The <strong>workflow</strong> was simple:</p><ol><li><p>A user journey event &#8212; like signup or login &#8212; was sent to the fraud rule engine.</p></li><li><p>The user, along with their metadata, was evaluated against a set of rules.</p></li><li><p>If a violation was found, the user was tagged under a specific fraud category.</p></li><li><p>Each fraud tag carried its own set of restrictions, which could be applied instantly.</p></li></ol><p>This meant that as fraud behaviours changed &#8212; and they did &#8212; we could respond quickly without rewriting the entire detection logic. The system became a foundation we could keep improving, instead of a one-time patch.</p><p></p><h3><strong>The Cost of Task-First Thinking</strong></h3><p>I&#8217;ve seen this countless times &#8212; a feature gets shipped, works fine for a while, and then has to be <strong>completely re-written</strong> when new use cases emerge.</p><p>The reason is usually the same: we focused on <strong>getting the task done</strong>, not on how it fits into the bigger system. The design was fine for today, but brittle for tomorrow.</p><p>This doesn&#8217;t just cause rewrites &#8212; it also takes a hit on <strong>scalability</strong>. A solution that seems lightweight at low usage can turn into a bottleneck as traffic grows.</p><p>In fact, when we first built the fraud rule engine at Probo, daily active users were relatively low. At that time, it could have been tempting to run it as a synchronous system in the user flow. But had we done that, today&#8217;s scale would have meant massive slowdowns in normal operations. By designing it as an <strong>event-driven, decoupled system</strong>, we avoided a scalability wall that would have crippled us later.</p><p>The task-first approach delivers speed in the moment &#8212; but it often leaves behind hidden costs that surface only when scale and complexity catch up.</p><h3><strong>System Thinking &#8800; Over-Engineering</strong></h3><p>Thinking in systems doesn&#8217;t mean building something that will last untouched for the next 10 years. In a fast-paced startup, <strong>velocity is critical</strong>. You can&#8217;t freeze progress in the name of perfect architecture.</p><p>What it does mean is <strong>judging wisely</strong> the level of configuration and flexibility a solution really needs. Sometimes spending <strong>2&#8211;3 extra days</strong> on a task today can save you <strong>weeks of painful rewrites</strong> a few months later.</p><p>That&#8217;s an underappreciated skill &#8212; knowing when to invest that extra effort, and when to deliberately choose the quick path with eyes wide open.</p><h4><strong>A Time I Over-Engineered</strong></h4><p>I learned this lesson the hard way.</p><p>Once, we were experimenting with ways to create delight for users when they won a trade. For example, if someone won more than a certain amount, we wanted to show a celebratory meme. The idea was simple: a small surprise that might encourage virality.</p><p>If I had built this in the most direct way, it could have been done in 2&#8211;3 days. Instead, I spent almost two weeks building a <strong>templatization engine</strong> that could handle multiple scenarios &#8212; based on winning amount, number of trades, leaderboard rank, and more.</p><p>The reality? The experiment didn&#8217;t pan out. We closed it without ever needing those additional templates. The two weeks of careful system-building had little practical value.</p><p>That experience reminded me: <strong>system thinking is valuable, but over-engineering is still a risk.</strong> The real skill is not just in designing flexible systems, but in knowing <em>when</em> to keep it simple and wait for validation.</p><p></p><h3><strong>Shortcuts Aren&#8217;t Evil &#8212; Blind Shortcuts Are</strong></h3><p>Just like self-aware teams can take a shortcut <em>consciously</em> when the trade-off is worth it, systems thinking accepts that sometimes you <strong>must</strong> optimise for the immediate need.</p><p>The difference is:</p><ul><li><p><strong>Without systems thinking:</strong> &#8220;This works for now &#8212; ship it.&#8221;</p></li><li><p><strong>With systems thinking:</strong> &#8220;This works for now, but here&#8217;s the cost, and here&#8217;s how we&#8217;ll address it later.&#8221;</p></li></ul><p>That awareness is what keeps future you from cursing past you.</p><p></p><h3><strong>Why It Matters</strong></h3><p>Software issues often don&#8217;t come from a single bad decision. More commonly, they build up over time as a series of small, local choices that don&#8217;t work well together.</p><p>Thinking in systems helps reduce this risk. It encourages designing with future growth and change in mind, so today&#8217;s solution doesn&#8217;t become tomorrow&#8217;s bottleneck.</p><p>By approaching problems this way, you&#8217;re not just addressing the immediate need &#8212; you&#8217;re making it easier and more efficient to handle what comes next.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Self-Aware Team]]></title><description><![CDATA[Building a Culture Where Everyone Knows What&#8217;s Next]]></description><link>https://www.adityachowdhry.me/p/the-self-aware-team</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/the-self-aware-team</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Thu, 14 Aug 2025 10:55:03 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d59ae0cc-2ac2-414e-a330-1d4ba03b83ec_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>You join a standup and everyone is clear on what&#8217;s next.</p><p>They know what&#8217;s blocked, what&#8217;s in progress, and why it matters.</p><p>Work moves steadily without needing constant direction.</p></blockquote><p></p><p>This is what a self-aware team looks like: a group that doesn&#8217;t just work on assigned tasks, but understands the context, priorities, and impact behind them. They can adapt, make decisions, and keep moving even when a manager isn&#8217;t present.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h2><strong>Why Self-Awareness Matters in Teams</strong></h2><p>In most engineering teams, people are good at completing the work in front of them. But if the &#8220;why&#8221; is missing, two things often happen:</p><ul><li><p>Effort goes into optimising for the wrong outcomes.</p></li><li><p>Progress slows when dependencies shift or leaders aren&#8217;t available.</p></li></ul><p>Without the bigger picture, developers can make short-term decisions that create <strong>technical debt</strong> or move the system further away from the ideal state. These choices often seem harmless in isolation, but over time they compound &#8212; leading to a situation that&#8217;s expensive, risky, or slow to reverse.</p><p>Self-awareness means building a <strong>shared understanding</strong> of:</p><ul><li><p>The problem we&#8217;re solving.</p></li><li><p>The outcome we&#8217;re aiming for.</p></li><li><p>The trade-offs we&#8217;re willing to make along the way &#8212; and their impact on the long-term vision.</p></li></ul><p>When developers hold this understanding, they:</p><ul><li><p>Know what to do next to move closer to the ideal state.</p></li><li><p>Are self-sufficient in identifying and creating their own work.</p></li><li><p>Can set their own pace without waiting for constant approvals.</p></li><li><p>Feel a stronger sense of autonomy and ownership over the outcome.</p></li></ul><p>From a manager&#8217;s perspective, this translates to:</p><ul><li><p>Fewer delays and bottlenecks.</p></li><li><p>Less day-to-day handholding.</p></li><li><p>Smoother progress even when priorities shift.</p></li></ul><p>A self-aware team is not just more effective &#8212; it&#8217;s easier to lead.</p><p></p><h2><strong>How to Build Self-Awareness in Teams</strong></h2><p>Self-awareness doesn&#8217;t happen automatically &#8212; it&#8217;s cultivated. And one of the most important ingredients is <strong>ownership</strong>.</p><p>Ownership comes when people are <strong>aligned to the outcome</strong>, not just assigned tasks. When team members understand the &#8220;why,&#8221; the ideal state, and the trade-offs involved, they naturally start making decisions that move the work forward. They don&#8217;t wait to be told what to do next &#8212; they create their own next steps within that shared context.</p><p>The following practices help build that alignment and ownership from the start:</p><p></p><h3><strong>1. Start With Motivation, Not Tasks</strong></h3><p>Every project begins with a short, clear statement of <em>why</em> it matters:</p><ul><li><p><strong>The problem:</strong> What issue or gap are we addressing?</p></li><li><p><strong>The impact:</strong> What improves if we succeed?</p></li><li><p><strong>The cost of inaction:</strong> Why this work is important now. This helps with prioritisation</p></li></ul><p>This context helps the team see the work as part of a bigger picture.</p><p></p><h3><strong>2. Define the Ideal State</strong></h3><p>In real-world engineering, we often take shortcuts to ship faster. That&#8217;s fine &#8212; as long as the team knows what the end goal looks like.</p><p>The ideal state answers:</p><ul><li><p>&#8220;If there were no constraints, what would the best version of this be?&#8221;</p></li><li><p>&#8220;In six months, what result would we be proud to share?&#8221;</p></li></ul><p>This keeps short-term compromises from becoming long-term defaults.</p><p></p><h3><strong>3. Build Acceptance Criteria Together</strong></h3><p>Self-awareness grows when the team <strong>co-creates the definition of done</strong>.</p><p>Involve the team directly:</p><ul><li><p>&#8220;What must be true for us to call this complete?&#8221;</p></li><li><p>&#8220;What would make us confident this solves the problem?&#8221;</p></li></ul><p>These discussions often uncover assumptions or blind spots early. When the team helps define success, they are more likely to maintain quality and consistency without reminders.</p><p></p><h3><strong>4. Keep the Document Alive</strong></h3><p>The initial project document &#8212; motivation, ideal state, and acceptance criteria &#8212; works best when it stays relevant throughout the project:</p><ul><li><p>Update it when priorities or scope change.</p></li><li><p>Refer to it in standups and retros.</p></li><li><p>Mark shortcuts explicitly so they can be revisited later.</p></li></ul><p>In practice, this can be difficult to maintain alongside day-to-day work. Treat it as a <em>good to have</em> &#8212; even partial updates can help keep the team aligned and avoid losing sight of the bigger picture.</p><p></p><h3><strong>5. Connect Daily Work Back to the Big Picture</strong></h3><p>A self-aware team keeps the &#8220;why&#8221; in mind during execution:</p><ul><li><p>Every task links back to the project motivation.</p></li><li><p>Code reviews ask: &#8220;Does this move us closer to the ideal state?&#8221;</p></li><li><p>Shortcuts are taken consciously, documented, and followed up.</p></li></ul><h2><strong>Challenges and Navigating Resistance</strong></h2><p>Not everyone in a team starts with the same level of ownership or clarity. Some naturally take initiative, while others wait for direction. Building self-awareness across the board means recognising these differences and addressing the barriers that hold people back.</p><p>Common barriers include:</p><ul><li><p><strong>Ego</strong> &#8212; &#8220;I already know what&#8217;s best.&#8221;</p></li><li><p><strong>Fear</strong> &#8212; concern that ownership means more responsibility without adequate support.</p></li><li><p><strong>Low clarity</strong> &#8212; not knowing how to apply self-awareness in day-to-day work.</p></li></ul><p>A big part of overcoming these challenges is <strong>the manager&#8217;s role</strong>. Managers need to give confidence to team members by providing freedom and autonomy, allowing them to fail safely, and taking time to understand their perspectives. This builds both competence and trust over time.</p><p>Practical ways to reduce resistance:</p><ul><li><p><strong>1-on-1 conversations</strong> to understand motivation and hesitation.</p></li><li><p><strong>Appreciation and recognition</strong> when initiative is shown.</p></li><li><p><strong>A free hand</strong> where possible, instead of micromanaging.</p></li><li><p><strong>Probing questions</strong> to guide thinking, rather than direct answers.</p></li><li><p><strong>Modelling the behaviour</strong> &#8212; openly sharing context, acknowledging trade-offs, and showing how to adapt when things change.</p></li></ul><p>When managers create an environment where autonomy is safe and valued, resistance fades, confidence grows, and team members begin anticipating the next move instead of waiting for instructions.</p><h2><strong>The Payoff</strong></h2><p>When teams share the same context and are aligned to the outcome, they can decide their own next steps, maintain momentum, and adapt without friction.</p><p>For developers, that&#8217;s a sense of autonomy and ownership.</p><p>For managers, it&#8217;s confidence that progress will continue without constant intervention.</p><p>A self-aware team doesn&#8217;t just complete tasks &#8212; it consistently moves the work closer to the ideal state, even when the path changes.</p><p></p><blockquote><p>Do you think your team is self-aware?</p><p>I&#8217;d love to hear how you&#8217;ve built (or struggled to build) it &#8212; drop a comment or DM me your thoughts.</p></blockquote><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Platform Teams: Avoid Becoming a Catch-All]]></title><description><![CDATA[Real problems, hard lessons, and how to stay focused]]></description><link>https://www.adityachowdhry.me/p/platform-teams-avoid-becoming-a-catch</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/platform-teams-avoid-becoming-a-catch</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Mon, 28 Jul 2025 13:29:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/665b788a-e935-4014-8e1f-fd0500ec6af1_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my <a href="https://www.adityachowdhry.me/p/platform-teams-and-the-developer">previous post on Developer Experience</a>, I wrote about the value platform teams bring to engineering orgs&#8212;and why DevX is core to making that work.</p><p>Lately, platform teams have gained a lot of attention, often compared to DevOps or SRE teams. But having worked in platform teams across multiple orgs, I&#8217;ve seen that while the <em>promise</em> of a platform team is great, the <em>reality</em> often gets messy.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p>Here are a few challenges I&#8217;ve faced&#8212;and lessons learned along the way.</p><div><hr></div><h2><strong>1. Platform Work Has a Long Feedback Loop</strong></h2><p>Platform projects are usually long-term. Add slow adoption (especially in large orgs), and it can take weeks or even months to know if a release made an impact.</p><p>This is very different from product teams, which get feedback quickly from users, metrics, or A/B tests. For platform work, success is often subtle and delayed&#8212;and harder to measure.</p><p>Over time, that delay can drain motivation.</p><h3><strong>What helped:</strong></h3><ul><li><p><strong>Celebrate small wins.</strong> Even a working POC, adoption by one team, or shaving minutes off CI time is a win.</p></li><li><p><strong>Define platform-specific metrics.</strong></p><ul><li><p>Improving observability? Track MTTD (mean time to detect) or MTTR (mean time to repair).</p></li><li><p>Building CI/CD tooling? Track deploy frequency, failure rate, or downtime hours.</p></li></ul></li><li><p><strong>Demo internally.</strong> Show your work. Build trust.</p></li></ul><div><hr></div><h2><strong>2. Don&#8217;t Become the Bottleneck</strong></h2><p>In fast-moving orgs, platform teams often become a blocker&#8212;usually unintentionally.</p><p>Product teams depend on the platform for infra access, deploy automation, observability setup, or network configs. But when automation lags or processes are unclear, product teams either wait&#8212;or find hacks.</p><p>This puts platform teams in a tough spot:</p><p>Ship fast and fix it later? Or slow things down to &#8220;do it right&#8221;?</p><h3><strong>What helped:</strong></h3><ul><li><p><strong>Convention over configuration.</strong> Standardize defaults, configs, naming patterns&#8212;anywhere a dev has to make a decision, help them make the right one faster.</p></li><li><p><strong>Invest in documentation-as-a-product.</strong> If someone&#8217;s stuck, docs should be the unblocker&#8212;not Slack.</p></li><li><p><strong>Pre-empt needs.</strong> Stay one step ahead by tracking repeat asks and turning them into products.</p></li></ul><div><hr></div><h2><strong>3. Avoid Becoming the Dumping Ground</strong></h2><p>This one&#8217;s common: anything that doesn&#8217;t fit cleanly into a product team&#8217;s scope ends up on a platform ticket. From one-off S3 buckets to VPN questions to &#8220;hey can you help with this test suite,&#8221; the list grows quickly.</p><p>Most of these aren&#8217;t complex. The issue isn&#8217;t difficulty&#8212;it&#8217;s <strong>unclear ownership</strong>.</p><p>Platform teams then spend increasing time on support work, often needing separate sprints or even dedicated sub-teams to handle tickets.</p><h3><strong>What helped:</strong></h3><ul><li><p><strong>Clarify ownership boundaries.</strong> If a task doesn&#8217;t need deep platform knowledge, it shouldn&#8217;t come to the platform team.</p></li><li><p><strong>Create playbooks and self-serve tools.</strong> Most support tickets can be avoided with good internal guides and small CLI tools.</p></li><li><p><strong>Enable, don&#8217;t gate.</strong> Help product teams help themselves.</p></li></ul><div><hr></div><h2><strong>Final Thoughts</strong></h2><p>Platform teams can be incredibly impactful&#8212;but only when they stay focused.</p><p>The goal isn&#8217;t to be <em>everything to everyone.</em> It&#8217;s to build leverage, not load.</p><p>That means saying no, simplifying where it matters, and making space to celebrate the progress you&#8217;re making behind the scenes.</p><div><hr></div><h2><strong>&#128640; Let&#8217;s Chat</strong></h2><p>What&#8217;s your experience been like working with or inside a platform team?</p><p>Have you run into any of these challenges&#8212;or solved them differently?</p><p>Drop a comment or reach out&#8212;I&#8217;d love to learn from your story.</p><p>You can also follow <a href="https://www.adityachowdhry.me/">Thoughtful Engineering</a> for more lessons from real systems and real teams.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Platform Teams and the Developer Experience Gap]]></title><description><![CDATA[Why Internal Platforms Succeed or Fail Based on DevX]]></description><link>https://www.adityachowdhry.me/p/platform-teams-and-the-developer</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/platform-teams-and-the-developer</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Mon, 28 Jul 2025 13:19:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c7b0203f-87d1-45c4-8e16-05155fcab567_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>Before diving into developer experience (DevX), it&#8217;s important to understand the role of a platform team.</p><div><hr></div><h2><strong>What Is a Platform Team?</strong></h2><p>Gergely Orosz described it well in his <a href="https://blog.pragmaticengineer.com/platform-teams/">blog </a><em><a href="https://blog.pragmaticengineer.com/platform-teams/">The Pragmatic Engineer</a></em>: platform teams build the internal building blocks that product teams use to ship customer-facing features. A good platform helps teams move faster, more safely, and with less friction.</p><p>Platform teams can work at many layers&#8212;providing infrastructure, central tooling, service creation flows, CI/CD, config management, observability, experimentation, and more.</p><div><hr></div><h2><strong>Developer Experience &#8800; Just Developer Tools</strong></h2><p>When your users are developers, it&#8217;s easy to assume they&#8217;ll figure things out. So we ship command-line tools with verbose flags, dashboards that need digging, or runbooks buried in wikis&#8212;because &#8220;they&#8217;ll manage.&#8221;</p><p>But this mindset leads to:</p><ul><li><p>Over-complicated setup processes</p></li><li><p>Layers of configuration</p></li><li><p>Weak documentation and poor onboarding</p></li><li><p>Confusion around what&#8217;s owned, automated, or expected</p></li></ul><p>This is very different from how we treat <em>non-developer</em> end-users, where we aim for delight and simplicity.</p><p>Just because our users are developers doesn&#8217;t mean they deserve less.</p><div><hr></div><h2><strong>Why DevX Matters More Than Ever</strong></h2><p>In 2025, <strong>developer velocity is a product moat</strong>. The easier it is for engineers to create, test, and ship code, the faster a company moves.</p><p>Good DevX:</p><ul><li><p>Reduces cognitive load</p></li><li><p>Speeds up onboarding</p></li><li><p>Prevents avoidable mistakes</p></li><li><p>Encourages consistency without enforcing rigidity</p></li></ul><p>For platform teams, DevX isn&#8217;t a nice-to-have. It&#8217;s the <strong>core product</strong>.</p><div><hr></div><h2><strong>A Simple Example</strong></h2><p>Let&#8217;s walk through a simplified, ideal developer experience.</p><p><strong>Context</strong>: Nilesh from the Growth team needs to launch a rewards microservice.</p><h3><strong>What the flow looks like in a DevX-focused org:</strong></h3><ol><li><p>Nilesh logs into the internal developer dashboard.</p></li><li><p>He clicks &#8220;Create new app&#8221;, selects a language (say, Node.js), and chooses dependencies like Postgres and Redis.</p></li><li><p>He&#8217;s given a GitHub repo with:</p><ul><li><p>Boilerplate folder structure</p></li><li><p>Linting and testing setup</p></li><li><p>Docker Compose with service dependencies</p></li><li><p>CI/CD pipeline configured with build, test, and deploy stages</p></li></ul></li><li><p>He writes code. CI/CD deploys it to staging, then prod, without needing help from infra or DevOps.</p></li></ol><p><strong>What Nilesh skipped:</strong></p><ul><li><p>Filing infra requests</p></li><li><p>Setting up GitHub permissions</p></li><li><p>Picking or aligning framework versions</p></li><li><p>Writing config files for lint, test, and deploy from scratch</p></li><li><p>Coordinating with DevOps for deploy pipelines</p></li></ul><p>This isn&#8217;t about 100% automation&#8212;it&#8217;s about <strong>removing unnecessary steps</strong> so engineers can focus on solving business problems.</p><div><hr></div><h2><strong>What DevX Really Means</strong></h2><p>Developer experience isn&#8217;t about flashy portals or adding more tools. It&#8217;s about thoughtful defaults, reducing friction, and enabling autonomy without chaos.</p><p>At its core, good DevX means:</p><ul><li><p><strong>Better documentation</strong> and discoverability</p></li><li><p><strong>Simple integration and testing paths</strong></p></li><li><p><strong>Clear ownership models</strong></p></li><li><p><strong>Fast feedback loops</strong></p></li><li><p><strong>Decentralised decisions with guardrails</strong></p></li></ul><div><hr></div><h2><strong>Closing Thoughts</strong></h2><p>A high-growth engineering org is like a well-oiled machine. The platform team is the layer that keeps the rest running smoothly. When DevX is neglected, platform efforts become bottlenecks. But when DevX is intentional, it becomes a multiplier.</p><p>If you&#8217;re building or evolving a platform team, start with developer experience.</p><p>It&#8217;s not a feature&#8212;it&#8217;s the foundation.</p><div><hr></div><p>If this resonated with you, I&#8217;d love to hear how your team approaches DevX.</p><p>You can also subscribe to <a href="https://www.adityachowdhry.me/">Thoughtful Engineering</a> for more posts like this.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Democratizing Ownership: The Secret to Scalable Teams]]></title><description><![CDATA[Building a culture where ownership is shared, not assigned.]]></description><link>https://www.adityachowdhry.me/p/democratizing-ownership-the-secret</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/democratizing-ownership-the-secret</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Mon, 28 Jul 2025 13:01:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/525ab221-5bd6-49ae-9238-6529c9cca76b_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the early days at our org, the same few people were always the ones looking into alerts or performance issues. They understood the system well, knew where to look, and cared a lot about keeping things reliable.</p><p>But after some time, we noticed a few problems:</p><ul><li><p>These people were getting tired and burned out. They were always on call and always needed.</p></li><li><p>The rest of the team was slower, not because they didn&#8217;t care, but because they waited for these few to take the lead.</p></li><li><p>Everyone said reliability was important, but only a few were actually working on it. It looked like a shared responsibility, but in reality, it wasn&#8217;t.</p></li></ul><p>No one did this on purpose&#8212;it just happened over time. People assumed, &#8220;They&#8217;ll handle it.&#8221;</p><p>But what if one of them was on leave? Or decided to leave the company?</p><p>We realized that for our team to grow in a healthy way, ownership couldn&#8217;t stay with just a few people. It needed to be shared, clear, and part of how we worked every day.</p><p>Looking back, the problem wasn&#8217;t just that a few people were handling all the alerts. The real question was:</p><p><strong>Why did everyone else stay away?</strong></p><p>It wasn&#8217;t because people didn&#8217;t care. It was more about things like:</p><ul><li><p>&#8220;I don&#8217;t know if this concerns me&#8220;</p></li><li><p>&#8220;I don&#8217;t know enough to help.&#8221;</p></li><li><p>&#8220;It&#8217;s not my area&#8212;someone else owns it.&#8221;</p></li><li><p>&#8220;This is how it&#8217;s always worked here.&#8221;</p></li></ul><p>No one was told to stay out, but the way things were set up made them feel that way. Even if they wanted to help, they didn&#8217;t feel confident or responsible.</p><p></p><h3><strong>Problem 1: &#8220;I don&#8217;t know if this concerns me&#8221;</strong></h3><p>This was the most common reason people stayed away from alerts. When something broke or an alert fired, many didn&#8217;t step in&#8212;not because they didn&#8217;t care, but because they weren&#8217;t sure if it was their responsibility.</p><p>A big part of the issue was how our alerting was set up:</p><ul><li><p>Alerts were too generic and not tied to specific services or teams.</p></li><li><p>Many alerts were triggered on shared infrastructure, so it was unclear what part of the system was actually affected.</p></li><li><p>Most alerts landed in a shared Slack channel, without context or clear ownership.</p></li></ul><p>So even when someone saw an alert, their first thought was usually:</p><blockquote><p>&#8220;Is this mine? Should I look into it? Or is someone else already on it?&#8221;</p></blockquote><p>That hesitation added delay. And more often than not, the same people who always responded&#8230; responded again.</p><p>We learned that <strong>without clear signals, most people will assume it&#8217;s someone else&#8217;s job</strong>. Especially in high-stakes situations, people avoid stepping in if they&#8217;re unsure.</p><p>This was the first thing we had to fix&#8212;<strong>make ownership visible and obvious</strong>.</p><p></p><h3><strong>Problem 2: &#8220;I don&#8217;t know enough to help&#8221;</strong></h3><p>Even when people knew the alert was related to their area, many still didn&#8217;t take action. The reason? They didn&#8217;t feel confident enough.</p><p>We heard things like:</p><ul><li><p>&#8220;I don&#8217;t know where to start.&#8221;</p></li><li><p>&#8220;What if I try something and make it worse?&#8221;</p></li><li><p>&#8220;Someone else will probably fix it faster than me.&#8221;</p></li></ul><p>In many cases, the required knowledge lived only in a few people&#8217;s heads. There were no clear runbooks, no shared dashboards, and no easy way to understand what was going on unless you had been around long enough or had seen it before.</p><p>This made reliability feel like a &#8220;<strong>specialist&#8217;s job</strong>,&#8221; not something everyone could contribute to.</p><p>We realized that <strong>without easy access to context or tooling, even willing team members won&#8217;t step up</strong>. Not because they don&#8217;t want to&#8212;but because they feel stuck.</p><p>To fix this, we had to invest in documentation, observability, and simple debugging tools&#8212;so anyone on the team could at least start investigating, even if they weren&#8217;t the expert.</p><p></p><h3><strong>Problem 3: &#8220;It&#8217;s not my area&#8212;someone else owns it.&#8221;</strong></h3><p>Even when someone understood the alert and knew how to help, they often stayed away because they believed it wasn&#8217;t their job.</p><p>We&#8217;d hear things like:</p><ul><li><p>&#8220;This isn&#8217;t my service.&#8221;</p></li><li><p>&#8220;That team usually handles this.&#8221;</p></li><li><p>&#8220;I don&#8217;t want to step on anyone&#8217;s toes.&#8221;</p></li></ul><p>This mindset became more common as we started splitting our systems and teams. <strong>Silos started to form</strong>&#8212;each team had their own services, their own dashboards, their own processes. While separation helped with focus, it also created invisible walls. People became hesitant to step outside their boundaries, even when they had something useful to add.</p><p>The end result? A narrow view of ownership. Teams only focused on their piece, even if issues were affecting others too.</p><p>We realised that <strong>clear ownership is important, but collaboration across teams is just as critical.</strong> Especially in complex systems, problems rarely sit neatly within one team&#8217;s boundaries.</p><p>A simple step that helped break silos was starting a rotating on-call program, which gave more people system-wide context and shared responsibility.</p><p></p><h3><strong>Problem 4: &#8220;This is how it&#8217;s always worked&#8221;</strong></h3><p>Sometimes the biggest blocker is not technical&#8212;it&#8217;s habit.</p><p>Many people stayed away from alerts or reliability work simply because <strong>that&#8217;s how things had always been</strong>. A few people had always handled it, so the rest of the team assumed that&#8217;s how it&#8217;s supposed to be.</p><p>There was no discussion, no decision&#8212;it just became the norm over time.</p><p>This mindset is hard to spot, and even harder to change. It leads to passive behaviour, where people wait instead of act. Even when ownership shifts or teams grow, these old patterns often stay in place.</p><p>To break this pattern, we realized that change doesn&#8217;t always start with a policy. It starts with <strong>conversations</strong>.</p><ul><li><p>In 1-1s, we talked openly about reliability work and whether people felt confident owning it.</p></li><li><p>In career conversations, we highlighted how taking initiative on such problems is a sign of growth.</p></li><li><p>During RCAs and postmortems, we made space for people to ask questions and understand areas outside their team.</p></li></ul><p>Over time, these regular conversations helped shift mindsets. People started seeing ownership not as extra work, but as part of being a strong engineer.</p><h3><strong>Closing Thoughts</strong></h3><p>We didn&#8217;t fix ownership gaps overnight. It took a mix of changes&#8212;some technical, like routing alerts better, and some cultural, like encouraging open conversations during 1-1s and RCAs.</p><p>But the biggest shift came when we stopped relying on a few &#8220;go-to&#8221; people and made reliability a shared goal.</p><p>We&#8217;re still evolving, but one thing is clear: ownership doesn&#8217;t have to come from authority. It grows when people feel safe, supported, and informed enough to step up.</p><p>And that&#8217;s what we&#8217;ve been working toward&#8212;<strong>a culture where ownership is not assigned, it&#8217;s shared.</strong></p><div><hr></div><p>Seen similar patterns in your team? I&#8217;d love to hear how you&#8217;ve approached this.</p><p></p><blockquote><p>If this resonated with you, subscribe to get future posts&#8212;straight from the production floor.</p></blockquote><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How we scaled our Monolith to execute 250Mn+ trades in a day?]]></title><description><![CDATA[This post was originally published on engineering.probo.in]]></description><link>https://www.adityachowdhry.me/p/how-we-scaled-our-monolith-to-execute</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/how-we-scaled-our-monolith-to-execute</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Sun, 22 Jun 2025 13:06:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Jo7R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jo7R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jo7R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png 424w, https://substackcdn.com/image/fetch/$s_!Jo7R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png 848w, https://substackcdn.com/image/fetch/$s_!Jo7R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!Jo7R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jo7R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png" width="1400" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Jo7R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png 424w, https://substackcdn.com/image/fetch/$s_!Jo7R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png 848w, https://substackcdn.com/image/fetch/$s_!Jo7R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!Jo7R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b62048-f266-4411-8b02-44120d7a0a37_1400x800.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You might have read dozens of articles around Monolith vs Microservices. And most of them suggest to keep your distance with the microservices until you have solid reasons to do so.</p><p>But micro-services are tempting, you can get these solid reasons well early into your development cycle or as late as never. There are numerous articles on how to split monolith to microservices. But very few articles talk about how to do your monolith right. This is an attempt at the latter.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>Context</strong></h2><p>At Probo, we are a small team of about 30 engineers. One basic principle we try to adopt is to keep things simple. As a result, we are running a monolithic version of our backend from day 1 and we are well into our 4th year. We have a seen a journey of <strong>0 trades to 250Mn+</strong> trades in a day.</p><p>Things that helped us in this journey</p><h2><strong>Code structure</strong></h2><ul><li><p>Our code structure is inspired by Domain Driven Design. From the start itself we had three major bounded contexts that we recognised and developed the code around it.</p></li><li><p>It was as simple as &#8212; <strong>User</strong> (profile, referral, etc), <strong>Trading</strong> (order-management, trade-management etc) and <strong>Payments</strong> (user balance, transaction history etc) ecosystem.</p></li></ul><blockquote><p><em>We tried to keep interaction between these contexts loosely coupled. Most of the database joins were in their own bounded contexts.</em></p></blockquote><ul><li><p>We had a very basic layered code architecture &#8212; Presentation &#8594; Domain logic &#8594; Data management layer.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MLTf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MLTf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png 424w, https://substackcdn.com/image/fetch/$s_!MLTf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png 848w, https://substackcdn.com/image/fetch/$s_!MLTf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png 1272w, https://substackcdn.com/image/fetch/$s_!MLTf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MLTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png" width="1381" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1381,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!MLTf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png 424w, https://substackcdn.com/image/fetch/$s_!MLTf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png 848w, https://substackcdn.com/image/fetch/$s_!MLTf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png 1272w, https://substackcdn.com/image/fetch/$s_!MLTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb481fe8-002e-443a-82a1-667e258f98fc_1381x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Bounded Contexts</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vDzg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vDzg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png 424w, https://substackcdn.com/image/fetch/$s_!vDzg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png 848w, https://substackcdn.com/image/fetch/$s_!vDzg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png 1272w, https://substackcdn.com/image/fetch/$s_!vDzg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vDzg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png" width="950" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:950,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vDzg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png 424w, https://substackcdn.com/image/fetch/$s_!vDzg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png 848w, https://substackcdn.com/image/fetch/$s_!vDzg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png 1272w, https://substackcdn.com/image/fetch/$s_!vDzg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c47e552-7aae-48fb-b9d6-84f78c56f1cc_950x547.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Layered Architecture</figcaption></figure></div><ul><li><p>Concept of services in a monolith, with a focus on preventing domain leakage.</p></li><li><p>If required each individual service used to communicate with other service through service&lt;&gt;service methods and not by data layer.</p></li></ul><h2><strong>Splitting into Server &amp; Worker modes</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5SIx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5SIx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png 424w, https://substackcdn.com/image/fetch/$s_!5SIx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png 848w, https://substackcdn.com/image/fetch/$s_!5SIx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png 1272w, https://substackcdn.com/image/fetch/$s_!5SIx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5SIx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png" width="1261" height="721" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:721,&quot;width&quot;:1261,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5SIx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png 424w, https://substackcdn.com/image/fetch/$s_!5SIx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png 848w, https://substackcdn.com/image/fetch/$s_!5SIx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png 1272w, https://substackcdn.com/image/fetch/$s_!5SIx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cf52c1-1b8b-4833-a04c-f1eea82cbd91_1261x721.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Application Split</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Dq6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Dq6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png 424w, https://substackcdn.com/image/fetch/$s_!9Dq6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png 848w, https://substackcdn.com/image/fetch/$s_!9Dq6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png 1272w, https://substackcdn.com/image/fetch/$s_!9Dq6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Dq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png" width="1329" height="524" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:1329,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9Dq6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png 424w, https://substackcdn.com/image/fetch/$s_!9Dq6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png 848w, https://substackcdn.com/image/fetch/$s_!9Dq6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png 1272w, https://substackcdn.com/image/fetch/$s_!9Dq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dc7563c-b6cf-4628-944c-61038e33e42b_1329x524.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Horizontal Scaling for server and multiple workers</figcaption></figure></div><ul><li><p>We encountered performance issues well early into our journey when everything was handled with a single process. To solve, we first identified the background tasks and separated them out of server process.</p></li><li><p>As a result, both server process and various worker processes could now independently scale without affecting one another.</p></li></ul><h2><strong><a href="https://zerodha.tech/blog/scaling-with-common-sense/">The bottleneck is always the database</a></strong></h2><blockquote><p><em>We use AWS managed database offering &#8212; MySQL RDS</em></p></blockquote><ul><li><p>As mentioned by Kailash Nadh, <a href="https://zerodha.tech/blog/scaling-with-common-sense/">&#8220;97.42%* of all scaling bottlenecks stem from databases&#8221;</a>. And we can confirm that it is true!</p></li><li><p>Focus on fundamentals. &#8220;No Indexes&#8221; or &#8220;incorrect Indexing&#8221; is the most popular answer I get during my interviews when I ask about downtimes.</p></li><li><p>Leveraging performance insights of RDS helped us pin point multiple issues. We were able to solve long running queries, reduce expensive queries, understand transactions.</p></li><li><p>Investing time in learning about database parameters helped us scale our database better and avoid downtimes. This is required even when you are using a managed database like AWS RDS.</p></li><li><p>One thing we did actively was to post slow queries on slack &#8212; this kept us on toes!</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FjH9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FjH9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png 424w, https://substackcdn.com/image/fetch/$s_!FjH9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png 848w, https://substackcdn.com/image/fetch/$s_!FjH9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png 1272w, https://substackcdn.com/image/fetch/$s_!FjH9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FjH9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png" width="1250" height="604" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:604,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FjH9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png 424w, https://substackcdn.com/image/fetch/$s_!FjH9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png 848w, https://substackcdn.com/image/fetch/$s_!FjH9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png 1272w, https://substackcdn.com/image/fetch/$s_!FjH9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413871ab-a137-438b-87fe-f4b2acebf404_1250x604.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">slow queries on slack</figcaption></figure></div><ul><li><p>Use of database proxy helped scale our application horizontally, without worrying of connection limitations.</p></li><li><p>On a team level, everyone had access to database monitoring dashboards, that really helped to understand and learn about the issues together. I have seen multiple organisations which limit database access to a certain team, namely devops or dba team which I feel limits the developers to understand and write optimised code.</p></li></ul><h2><strong>Caching Layer</strong></h2><ul><li><p>Initially, we started investing in our caching layer to ease of pressure on our database and improve API response times.</p></li><li><p>The setup quickly expanded to solve other use cases like <a href="https://engineering.probo.in/blazing-fast-leaderboards-using-redis-f48ba26ae4d6">Leaderboard</a>, Authentication, Realtime, Feed, Graphs etc.</p></li></ul><h2><strong>Independent endpoint scaling</strong></h2><ul><li><p>Segregating our high throughput endpoints, helped us in scaling horizontally and help isolate issues faster.</p></li><li><p>Example &#8212; when a notification is sent to our users, we update the read/view status on our end. This is a burst traffic that impacts other APIs performance. Identifying these Api&#8217;s and redirecting to their own set of dedicated instances helped us prevent slowness.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R_Ga!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R_Ga!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png 424w, https://substackcdn.com/image/fetch/$s_!R_Ga!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png 848w, https://substackcdn.com/image/fetch/$s_!R_Ga!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png 1272w, https://substackcdn.com/image/fetch/$s_!R_Ga!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R_Ga!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png" width="1400" height="335" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:335,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!R_Ga!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png 424w, https://substackcdn.com/image/fetch/$s_!R_Ga!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png 848w, https://substackcdn.com/image/fetch/$s_!R_Ga!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png 1272w, https://substackcdn.com/image/fetch/$s_!R_Ga!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0543b8d0-5fb3-45f2-b5ca-6f5ca8aec811_1400x335.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Router send a group of API calls to their own dedicated auto scaling groups.</figcaption></figure></div><h2><strong>Language Specific Tweaking</strong></h2><ul><li><p>Whatever language you pick, there are always some basic level of configuration tweaking that is required to scale efficiently.</p></li><li><p>We use NodeJS majorly, and since it is single threaded we run it in cluster mode using PM2 to efficiently utilize our compute resources.</p></li><li><p>Some other configurations like<code>--max-old-space-size=SIZE</code> (sets the max memory size of V8's old memory section), <code>--prof</code> (for profiling) were used</p></li></ul><h2><strong>Observability &amp; Alerting</strong></h2><ul><li><p>We spent a considerable amount of time in setting up our observability right. I wouldn&#8217;t say it is perfect, but it gets the job done.</p></li><li><p>Our logging infrastructure is self-hosted, a simple Elasticsearch cluster with Kibana. Though it looks simple, but scaling Elasticsearch has been a challenge.</p></li><li><p>Initially, the scale on our infrastructure was a self-inflicted scale, since a lot of logs were not useful. To tackle this problem, we created a simple library &#8212; <a href="https://engineering.probo.in/from-chaos-to-clarity-the-evolution-of-logging-from-unstructured-to-structured-part-1-78b059bed362">https://engineering.probo.in/from-chaos-to-clarity-the-evolution-of-logging-from-unstructured-to-structured-part-1-78b059bed362</a></p></li><li><p>For metrics, we utilised <a href="https://last9.io/">last9</a>. It helped us to spend time on creating effective dashboards instead of spending time on installation/maintenance &amp; scaling.</p></li><li><p>Last9&#8217;s alert manager helped us in democratising alerts among our teams. Despite a monolith, each team owned their metrics and alerts.</p></li></ul><h1><strong>Not everything is good in Wonderland.</strong></h1><h2><strong>Deployment time</strong></h2><p>There is a significant increase in deployment time due to</p><ul><li><p>CI Pipeline tests and checks</p></li><li><p>Rolling deployment, hence the time taken to complete the deployment is directly proportional to the number of instances</p></li></ul><h2><strong>Fear of change</strong></h2><p>There is always a fear that a change might take complete service down. This has resulted in apprehensions in the frequency of deployments or during high traffic times, etc.</p><h2><strong>Multiple dependencies &amp; Resource Sizing</strong></h2><ul><li><p>Dependencies in terms of databases, cache etc keep on growing with use cases. Thus increasing your surface area for downtimes.</p></li><li><p>Horizontal scaling also puts pressure on your resources, like Database connections, which leads to unoptimized sizing. Proxy &amp; Pooling to the rescue!</p></li></ul><p>In conclusion, it&#8217;s quite possible to scale a monolithic architecture to handle a high volume of transactions, given the right practices and considerations.</p><p>We&#8217;ve focused on keeping our codebase simple and structured, while there are challenges but we&#8217;ve found solutions that work for us. The journey has been rewarding and educational, and we hope our experience can provide useful insights for others facing similar scaling challenges. At Probo, with increasing scale, we get regular opportunities to re-evaluate our architecture. Stay tuned for more updates!</p><p>If you enjoyed reading this, you&#8217;re one of us &#8212; someone who loves solving deep technical challenges. We&#8217;re building <strong>high-performance systems</strong> and tackling <strong>complex engineering problems</strong> every day.</p><p>Follow - https://engineering.probo.in/</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Thoughtful Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Coming soon]]></title><description><![CDATA[This is Thoughtful Engineering.]]></description><link>https://www.adityachowdhry.me/p/coming-soon</link><guid isPermaLink="false">https://www.adityachowdhry.me/p/coming-soon</guid><dc:creator><![CDATA[Aditya Chowdhry]]></dc:creator><pubDate>Sun, 22 Jun 2025 11:21:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!il6o!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7de5978-98c4-4304-92b9-5965066b7714_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is Thoughtful Engineering.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.adityachowdhry.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.adityachowdhry.me/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item></channel></rss>