Klenty
Chennai, Tamil Nadu, India
2022 Jul - Present
ChefAtHome FoodTech LLP
Tamil Nadu, India
2021 May - 2021 Sep
CodeSpeedy Technology Private Limited
West Bengal, India
2021 Jan - 2021 Mar
Einstein college of engineering
Computer Science
2018 - 2022
Tilak vidyalaya higher secondary school
Computer Science
2017 - 2018

cover imageAre you overpaying for AI because of your language? If youâre building LLM applications in Spanish, Hindi, or Greek, you could be spending up to 6 times more than English users for the exact same functionality.This blog insipred from the research paper Do All Languages Cost the Same? Tokenization in the Era of Commercial Language ModelsThe Hidden Tokenization TaxWhen you send text to GPT-4, Claude, or Gemini, your input gets broken into tokens chunks roughly 3â4 characters long in English. You pay per token for both input and output.The shocking truth: The same sentence costs wildly different amounts depending on your language.Real Example: âHello, my name is SarahâEnglish: 7 tokens â baseline â $16,425/yearSpanish: 11 tokens â 1.5Ă higher â $24,638/year (+$8,213)Hindi: 35 tokens â 5Ă higher â $82,125/year (+$65,700)Greek: 42 tokens â 6Ă higher â $98,550/year (+$82,125)Thatâs an $82,000 annual difference for the exact same chatbot purely because of language.The Complete Language Cost Breakdowntokenization cost by languageResearch from ACL 2023 and recent LLM benchmarks reveals systematic bias in how models tokenize different languages. Hereâs what it costs to process 24 major languages:Tokenization cost comparison across 24 languages showing how many times more expensive each language is compared to English due to tokenization differencesMost Efficient Languages (1.0â1.5x English)English: 1.0x (baseline)French: 1.2xItalian: 1.2xPortuguese: 1.3xSpanish: 1.5xModerately Expensive (1.6â2.5x)Korean: 1.6xJapanese: 1.8xChinese (Simplified): 2.0xArabic: 2.0xRussian: 2.5xHighly Expensive (3.0â6.0x)Ukrainian: 3.0xBengali: 4.0xThai: 4.0xHindi: 5.0xTamil: 5.0xTelugu: 5.0xGreek: 6.0x (most expensive)Why Writing Systems MatterchartsComparison of tokenization costs and efficiency across different writing systems, showing why Latin-based languages are most cost-effective for LLM applicationsThe script your language uses creates dramatic efficiency gaps:Latin script: 1.4x average (73.5% efficient)Hangul (Korean): 1.6x (63% efficient)Han/Japanese: 1.8â2.0x (50â56% efficient)Cyrillic: 2.75x average (36.5% efficient)Indic scripts: 4â5x average (20% efficient)Greek: 6.0x (17% efficientâââworst)Why This Inequality Exists1. Training Data BiasGPT-4, Claude, and Gemini are trained on English-dominant datasets. The Common Crawl corpus shows stark imbalance:~60% English~10â15% combined for Spanish/French/German<5% for most other languagesTokenizers learn to compress what they see most. English gets ultra-efficient encoding; everything else is treated as âforeign.â2. Morphological ComplexityLanguages with rich morphology generate far more word variationsEnglish: ârunâ â runs, running, ran (4 forms)Turkish: Single root â 50+ forms with suffixesArabic: Root system â thousands of variationsHindi: Complex verb conjugations with gender/number/tenseTokenizers canât learn compact patterns for high-variation, low-data languages.3. Unicode Encoding OverheadDifferent scripts need different byte counts:Latin: 1 byte per characterCyrillic: 2 bytes per characterDevanagari/Tamil: 3+ bytes per characterMore bytes = more tokens = higher cost even for the same semantic content.Real-World Cost ImpactHereâs what tokenization inequality means for actual business applications:Customer Support Chatbot (10,000 messages/day)English: $16,425/yearSpanish: $24,638/year (+50%, +$8,213)Hindi: $82,125/year (+400%, +$65,700)Content Generation Platform (1M words/month)English: $14,400/yearSpanish: $21,600/yearHindi: $72,000/yearDocument Translation Service (100K words/day)English: $65,700/yearSpanish: $98,550/year (+$32,850)Hindi: $328,500/year (+$262,800)Code Assistant (50K queries/day)English: $91,250/yearSpanish: $136,875/yearHindi: $456,250/year (+$365,000)Bottom line: A company serving Hindi users pays $262,800-$365,000 more annually than an identical English service.The Socioeconomic DimensionResearch reveals a disturbing -0.5 correlation between a countryâs Human Development Index and LLM tokenization cost.Translation: Less developed countries often speak languages that cost more to process.Users in developing nations pay premium ratesCommunities with fewer resources face higher AI barriersThis creates âdouble unfairnessâ in AI democratizationExample: A startup in India building a Hindi customer service bot pays 5x more than a US competitor despite likely having far less funding.The Future of Fair AILanguage should never determine how much intelligence costs. Yet today, the worldâs most spoken tongues pay a silent premium just to access the same models. Fixing this isnât about optimization itâs about fairness. Until every language is tokenized equally, AI remains fluent in inequality.

cover imageWelcome to Part 2 of the 2025 Voice AI Guide How to Build Your Own RealTime Voice Agent.In this section, weâll dive deep into Pipecat and create a simple âHello Worldâ program to understand how real-time voice AI works in practice.If you have not read the part 1 read herePipecatPipecat is an open-source Python framework developed by Daily.co for building real-time voice and multimodal conversational AI agents. It provides a powerful yet intuitive way to orchestrate audio, video, AI services, and transport protocols to create sophisticated voice assistants, AI companions, and interactive conversational experiencesWhat makes Pipecat special is its voice-first approach combined with a modular, composable architecture Instead of building everything from scratch, you can focus on what makes your agent unique while Pipecat handles the complex orchestration of real-time audio processing, speech recognition, language models, and speech synthesis.What You Can Build with PipecatPipecat enables a wide range of applicationsVoice AssistantsâââNatural, streaming conversations with AIAI CompanionsâââCoaches, meeting assistants, and interactive charactersPhone AgentsâââCustomer support, intake bots, and automated calling systemsMultimodal InterfacesâââApplications combining voice, video, and imagesBusiness AgentsâââCustomer service bots and guided workflow systemsInteractive GamesâââVoice-controlled gaming experiencesCreative ToolsâââInteractive storytelling with generative mediaPipecat Architecture: How It WorksUnderstanding Pipecatâs architecture is crucial for building effective voice agents. The framework is built around three core concepts:1. FramesFrames are data packages that move through your application. Think of them as containers that hold specific types of information:Audio framesâââRaw audio data from microphonesText framesâââTranscribed speech or generated responsesImage framesâââVisual data for multimodal applicationsControl framesâââSystem messages like start/stop signals2. Frame ProcessorsFrame processors are specialized workers that handle specific tasks. Each processor:Receives specific frame types as inputPerforms a specialized transformation (transcription, language processing, etc.)Outputs new frames for the next processorPasses through frames it doesnât handleCommon processor types include:STT (Speech-to-Text) processors that convert audio frames to text framesLLM processors that take text frames and generate response framesTTS (Text-to-Speech) processors that convert text frames to audio framesContext aggregators that manage conversation history3. PipelinesPipelines connect processors together, creating a structured path for data to flow through your application. They handle orchestration automatically and enable parallel processingâââwhile the LLM generates later parts of a response, earlier parts are already being converted to speech and played back to users.Voice AI Processing FlowHereâs how a typical voice conversation flows through a Pipecat pipeline:Audio InputâââUser speaks â Transport receives streaming audio â Creates audio framesSpeech RecognitionâââSTT processor receives audio â Transcribes in real-time â Outputs text framesContext ManagementâââContext processor aggregates text with conversation historyLanguage ProcessingâââLLM processor generates streaming response â Outputs text framesSpeech SynthesisâââTTS processor converts text to speech â Outputs audio framesAudio OutputâââTransport streams audio to userâs device â User hears responseThe key insight is that everything happens in parallelâââthis parallel processing enables the ultra-low latency that makes conversations feel natural.Hello World Voice Agent: Complete ImplementationNow letâs build a complete âHello Worldâ voice agent that demonstrates all the core concepts. This example creates a friendly AI assistant you can have real-time voice conversations with.PrerequisitesBefore we start, youâll need:Python 3.10 or lateruv package manager (or pip)API keys from three services:Deepgram for Speech-to-TextOpenAI for the language modelCartesia for Text-to-SpeechProject SetupFirst, letâs set up our project:# Install Pipecat with required servicesuv add "pipecat-ai[deepgram,openai,cartesia,webrtc]"Environment ConfigurationCreate a .env file with your API keys:# .envDEEPGRAM_API_KEY=your_deepgram_api_key_hereOPENAI_API_KEY=your_openai_api_key_here CARTESIA_API_KEY=your_cartesia_api_key_hereThe code is some what bigger so i have not shared the whole code you can check out the whole code hereasync def main():"""Main entry point for the Hello World bot.""" bot = HelloWorldVoiceBot()await bot.run_bot()Understanding the Code StructureLetâs break down the key components of our Hello World implementation:1. Service Initialization# Speech-to-Text serviceself.stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))# Language Model service self.llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-3.5-turbo")# Text-to-Speech serviceself.tts = CartesiaTTSService(api_key=os.getenv("CARTESIA_API_KEY"), voice_id="...")Each service is a frame processor that handles a specific part of the voice AI pipeline.2. Pipeline Configurationpipeline = Pipeline([ transport.input(), # Audio input from browser self.stt, # Speech â Text self.context_aggregator.user(),# Add to conversation history self.llm, # Generate response self.tts, # Text â Speech transport.output(), # Audio output to browser self.context_aggregator.assistant(), # Save response to history])The pipeline defines the data flow each processor receives frames, transforms them, and passes them to the next processor.3. Event-Driven Interactions@transport.event_handler("on_first_participant_joined")async def on_participant_joined(transport, participant): # Trigger bot to greet the user await task.queue_frame(LLMMessagesFrame(self.messages))Event handlers manage the conversation lifecycleâââwhen users join/leave, when they start/stop speaking, etc.The diagram below shows a typical voice assistant pipeline, where each step happens in real-time:Running Your Hello World BotSave the code as hello_world_bot.pyRun the bot: python hello_world_bot.pyOpen your browser to http://localhost:7860Click âConnectâ and allow microphone accessStart talking! Say something like âHello, how are you?âThe bot will:Listen to your speech (STT)Process it with OpenAI (LLM)Respond with natural speech (TTS)Remember the conversation contextFor more examples and advanced features, check out the Pipecat documentation and example repository.*What Next?Now that youâre familiar with Pipecat and can build your own real-time voice agent, itâs time to take the next step.In the upcoming part, weâll explore how to run all models locally even on a CPU and build a fully offline voice agent.Iâve created a GitHub repository VoiceAgentGuide for this series, where we can store our notes and related resources. Donât forget to check it out and share your feedback. Feel free to contribute or add missing content by submitting a pull request (PR).Stay tuned for the next part of the 2025 Voice AI Guide!

Imagine trying to teach a child whoâs never seen the world to recognize a face, feel that fire is hot, or sense when it might rain. How would you do it?For centuries, we thought intelligence required something mystical a soul, consciousness, divine spark. But what if itâs just pattern recognition at an extraordinary scale? What if learning is simply tuning millions of tiny parameters until inputs map correctly to outputs?Thatâs the bold idea behind deep learning: mathematical systems that can learn any pattern, approximate any function, and tackle problems once thought uniquely human.In 1989, mathematicians proved the Universal Approximation Theorem showing that even a single hidden layer neural network can approximate any continuous function. In theory, such a network can learn to translate, recognize, play, or predict anything.But theory alone isnât enough. The theorem says such a network exists not how to build or train it. Thatâs where the real craft of deep learning begins: finding the right weights, training efficiently, and learning patterns that generalize.Letâs unpack the six core ideas that make this possible.Note: This is a deep dive not a skim. Grab a coffee, settle in, and take your time. By the end, youâll understand neural networks from the ground up, not just in words but in logic.1. Neural Networks: Universal Function ApproximatorsWhat Are We Trying to Do?Before we understand neural networks, letâs start with something simpler: what is a function?In mathematics, a function is a relationship that maps inputs to outputs. f(x) = 2x + 1 is a function. You give it x = 3, it returns 7. Simple, deterministic, predictable.But real-world problems involve functions we canât write down. Consider:f(image) = "cat" or "dog"f(email_text) = "spam" or "not spam"f(patient_symptoms) = disease_probabilityThese are still functions they map inputs to outputs but we donât know their mathematical form. Traditional programming canât help us here because we canât write explicit rules for every possible image or email.Building Blocks: The Artificial NeuronLetâs build from the ground up. Start with a single neuron the atomic unit of a neural network.A neuron does three things:Receives multiple inputs (xâ, xâ, xâ, âŚ)Multiplies each input by a weight (wâ, wâ, wâ, âŚ)Sums everything up and adds a bias: z = wâxâ + wâxâ + wâxâ + ... + bWhy this structure? Because itâs the simplest way to combine multiple pieces of information into a single decision.Geometry of a Neuron: Drawing a LineLetâs ground this in a real example.Problem: Youâre a bank deciding whether to approve loans. You have two pieces of information:xâ = Annual income (in thousands)xâ = Credit scoreGoal: Separate âapproveâ from ârejectâ applications.A Single Neuron Creates a Line (2D) or Hyperplane (Higher Dimensions)The equation z = wâxâ + wâxâ + b is actually the equation of a line! Let's see how:Example neuron with specific weights:z = 0.5¡income + 2¡credit_score - 150This neuron outputs positive values for âapproveâ and negative for ârejectâ. The decision boundary is where z = 0:0 = 0.5¡income + 2¡credit_score - 150credit_score = 75 - 0.25¡incomeThis is a line! Letâs plot it:What the weights mean geometrically:wâ = 0.5: For every $1000 increase in income, the decision shifts by 0.5 units toward approvalwâ = 2.0: For every 1-point increase in credit score, the decision shifts by 2 units toward approval (4Ă more important than income!)b = -150: The bias shifts the entire line. Without it, the line would pass through origin (0,0)The learning process is finding the right line:Start with a random line (random weights)See which points it classifies wrongAdjust the weights to rotate and shift the lineRepeat until the line best separates the two groupsWhat One Neuron Can and Cannot DoCannot separate (non-linearly separable):XOR is the classic example: you need (0,1) and (1,0) to be class 1, but (0,0) and (1,1) to be class 0. No single line can achieve this separation.This is why we need multiple layers.Multiple Neurons, Multiple Lines: Building Complex BoundariesIf one neuron creates one line, what happens with multiple neurons in one layer?Example: 3 neurons in one layerNeuron 1: zâ = wââxâ + wââxâ + bâ [Line 1]Neuron 2: zâ = wââxâ + wââxâ + bâ [Line 2]Neuron 3: zâ = wââxâ + wââxâ + bâ [Line 3]Each neuron draws a different line! But without additional layers, we still canât solve XOR. Why? Because weâre just drawing multiple lines without combining them in complex ways.The key insight: We need to combine these lines non-linearly. This is where activation functions and depth come in.The Layer AbstractionNow stack multiple neurons side by side thatâs a layer. Each neuron in the layer:Receives the same inputsHas its own unique weights and biasProduces its own outputA layer with 10 neurons transforms one input vector into 10 different outputs, each representing a different âfeatureâ or âpatternâ it has detected.Solving XOR: A Complete ExampleLetâs solve XOR step-by-step to understand how layers work together.The XOR Problem:Input (xâ, xâ) â Output(0, 0) â 0(0, 1) â 1(1, 0) â 1(1, 1) â 0Two-Layer Solution:Layer 1: Create useful features (2 neurons with ReLU)Neuron 1: Detects âat least one input is 1âzâ = xâ + xâ - 0.5aâ = ReLU(zâ)Testing:(0,0): zâ = -0.5, aâ = 0(0,1): zâ = 0.5, aâ = 0.5(1,0): zâ = 0.5, aâ = 0.5(1,1): zâ = 1.5, aâ = 1.5Neuron 2: Detects âboth inputs are 1âzâ = xâ + xâ - 1.5aâ = ReLU(zâ)Testing:(0,0): zâ = -1.5, aâ = 0(0,1): zâ = -0.5, aâ = 0(1,0): zâ = -0.5, aâ = 0(1,1): zâ = 0.5, aâ = 0.5Layer 2: Combine features (1 neuron with Sigmoid)zâ = aâ - 2¡aâ - 0.25output = Sigmoid(zâ)Testing:(0,0): zâ = 0 - 0 - 0.25 = -0.25 â â0 â(0,1): zâ = 0.5 - 0 - 0.25 = 0.25 â â1 â(1,0): zâ = 0.5 - 0 - 0.25 = 0.25 â â1 â(1,1): zâ = 1.5 - 1 - 0.25 = 0.25 â â0 âWhat happened geometrically?Layer 1 transformed the space:The first layer created new features where the problem becomes linearly separable!aâ captures âOR-nessâ (at least one is true)aâ captures âAND-nessâ (both are true)Layer 2 drew a simple line in this new space:aâ - 2¡aâ = 0.25 [decision boundary]This line easily separates XOR in the transformed space!The key insight:Layer 1: Creates useful intermediate features by drawing multiple lines/planesLayer 2: Combines these features with another line/planeTogether: They can represent any decision boundary!The Complete ArchitectureA typical neural network:Input Layer (raw data) â Hidden Layer 1 (low-level features) â Hidden Layer 2 (mid-level features) â Hidden Layer 3 (high-level features) â Output Layer (predictions)The power lies not in any single neuron, but in the billions of connections between them, each with its own weight, collectively forming a function approximator of extraordinary flexibility.Universal Approximation TheoremThe Universal Approximation Theorem (1989) proves:A neural network with just one hidden layer can approximate any continuous function, given enough neurons.But âenoughâ might mean billions, which is impractical.Deep (multi-layer) networks achieve the same expressive power more efficiently through hierarchical composition like compression for abstractions.So, in theory, neural networks can learn any mapping; in practice, depth makes it tractable.2. Activation Functions: Breaking LinearityThe Linear Trap: A Fundamental ProblemImagine we build a neural network with three layers, but we donât use activation functions. Letâs trace through what happens mathematically:Layer 1: zâ = Wâx + bâLayer 2: zâ = Wâzâ + bâ = Wâ(Wâx + bâ) + bâ = WâWâx + Wâbâ + bâLayer 3: zâ = Wâzâ + bâ = Wâ(WâWâx + Wâbâ + bâ) + bâSimplifying: zâ = (WâWâWâ)x + (WâWâbâ + Wâbâ + bâ)Notice what happened? No matter how many layers we add, we always end up with: Wx + b a simple linear function. Matrix multiplication of matrices is still a matrix. We've built an expensive way to do simple linear regression.This is catastrophic. Linear functions can only model linear relationships. The real world is non-linear. The path of a thrown ball, the spread of a virus, the relationship between study time and test scoresâââall non-linear.The Solution: Non-Linear Activation FunctionsAfter each neuron computes its weighted sum, we pass it through a non-linear activation function: a = Ď(z)This single addition breaks the linear trap. Now our layers actually do different things, building increasingly complex representations.What Makes a Good Activation Function?Letâs think about what properties we need:Non-linearity (obviously, or weâre back where we started)Differentiability (weâll need derivatives for learning)Computational efficiency (weâll apply it billions of times)Avoid saturation (outputs shouldnât always be at extremes)Zero-centered or positive (depending on the problem)Common Activation FunctionsReLU (Rectified Linear Unit): f(x) = max(0, x)Why it works:Dead simple: if input is positive, output equals input; if negative, output is zeroNon-linear despite looking linear (the âkinkâ at zero creates non-linearity)Computationally trivial: just one comparison and zero multiplicationDoesnât saturate for positive values (unlike sigmoid)Induces sparsity: many neurons output exactly zero, creating efficient representationsThe problem:âDying ReLUâ: if a neuronâs weights push it permanently into negative territory, its gradient becomes zero and it stops learning foreverNot zero-centered: all outputs are positive, which can slow convergenceVariants:Leaky ReLU: f(x) = max(0.01x, x)âââallows small gradients when x < 0, preventing deathELU (Exponential Linear Unit): Smooth curve for negative values, better learning dynamicsSigmoid: f(x) = 1/(1 + e^(-x))Why it exists:Squashes any input into range (0, 1)Historically motivated by biological neurons (firing rates between 0 and 1)Output can be interpreted as probabilitywhatâs happening?For large positive x: e^(-x) approaches 0, so output approaches 1For large negative x: e^(-x) approaches infinity, so output approaches 0At x = 0: output is 0.5Why itâs problematic:Vanishing gradients: For large positive or negative inputs, the sigmoid is nearly flat. The derivative approaches zero. During backpropagation, gradients get multiplied across layers; zeros multiply to deepen zeros. Deep networks canât learn.Not zero-centered: Outputs always positive (0 to 1), causing zig-zagging during optimizationComputationally expensive: Exponential functionWhere itâs still used:Output layer for binary classification (want probability between 0 and 1)Tanh: f(x) = (e^x - e^(-x))/(e^x + e^(-x))Advantages over sigmoid:Zero-centered: outputs range from -1 to 1Stronger gradients: derivative at zero is 1 (compared to 0.25 for sigmoid)Still suffers from:Vanishing gradients for extreme valuesComputational cost of exponentialsSoftmax: f(x_i) = e^(x_i) / Σe^(x_j)Completely different purpose:Not used between hidden layersExclusively for multi-class classification output layersIn simple termsTakes a vector of arbitrary values (logits)Converts them into probabilities that sum to 1Exponentiation ensures all values are positiveDivision by sum ensures they sum to 1Higher inputs get exponentially higher probabilitiesExample:Input: [2.0, 1.0, 0.1]After softmax: [0.659, 0.242, 0.099]Notice: still ordered the same way, but now theyâre probabilitiesWhy Different Layers Need Different ActivationsHidden Layers: ReLU family (efficiency, avoiding vanishing gradients)Binary Classification Output: Sigmoid (get probability for one class)Multi-class Classification Output: Softmax (get probability distribution over all classes)Regression Output: Often no activation (or linear)âââwe want the raw value, not a bounded one3. Forward Propagation: The Prediction ProcessWhat is Propagation?âPropagationâ is just a fancy word for âpassing information through.â Forward propagation is the process of taking input data and pushing it through every layer until we get a prediction.Letâs build this concept from absolute scratch.The Single Neuron CaseYou have:Input: x = 3Weight: w = 2Bias: b = 1Step 1: Linear combination z = wx + b = 2(3) + 1 = 7Step 2: Activation a = ReLU(z) = max(0, 7) = 7Thatâs it. The neuron outputs 7. This output might be the final prediction (if itâs the only neuron), or it might be input to the next layer.Multiple Inputs, Single NeuronNow you have three inputs:Inputs: x = [xâ=2, xâ=3, xâ=1]Weights: w = [wâ=0.5, wâ=-1, wâ=2]Bias: b = 1Step 1: Weighted sumz = wâxâ + wâxâ + wâxâ + bz = 0.5(2) + (-1)(3) + 2(1) + 1z = 1 - 3 + 2 + 1 = 1Step 2: Activation a = ReLU(1) = 1Single Layer: Multiple NeuronsNow suppose we have 3 neurons in one layer, all receiving the same 3 inputs.Neuron 1:Weights: [wââ, wââ, wââ], Bias: bâOutput: aâ = ReLU(wââxâ + wââxâ + wââxâ + bâ)Neuron 2:Weights: [wââ, wââ, wââ], Bias: bâOutput: aâ = ReLU(wââxâ + wââxâ + wââxâ + bâ)Neuron 3:Weights: [wââ, wââ, wââ], Bias: bâOutput: aâ = ReLU(wââxâ + wââxââ + wââxâ + bâ)The layer transforms input vector [xâ, xâ, xâ] into output vector [aâ, aâ, aâ]Matrix Representation: Scaling to Thousands of NeuronsWriting out every neuron individually is tedious. We use matrix notation:Weight Matrix W:W = [wââ wââ wââ] [wââ wââ wââ] [wââ wââ wââ]Each row represents one neuronâs weights.Input Vector x:x = [xâ] [xâ] [xâ]Forward propagation for the layer:z = Wx + ba = ReLU(z)This single matrix multiplication computes all neurons simultaneously. With modern GPUs optimized for matrix operations, we can process thousands of neurons in parallel.Deep Networks: Chaining LayersNow stack multiple layers. The output of layer 1 becomes the input to layer 2:Layer 1:zš = Wšx + bšaš = ReLU(zš)Layer 2:z² = W²aš + b²a² = ReLU(z²)Layer 3 (output):zÂł = WÂła² + b³š = softmax(zÂł) [if classification]The final output š is our prediction.Concrete Example: Digit RecognitionInput: 28Ă28 pixel image of a handwritten digit (flattened to 784 values)Architecture:Input layer: 784 neuronsHidden layer 1: 128 neurons (with ReLU)Hidden layer 2: 64 neurons (with ReLU)Output layer: 10 neurons (with softmax for digits 0â9)Forward propagation:zš = Wšx + bš [128 values]aš = ReLU(zš) [128 values]z² = W²aš + b² [64 values]a² = ReLU(z²) [64 values]zÂł = WÂła² + bÂł [10 values]š = softmax(zÂł) [10 probabilities summing to 1]Output might be: [0.01, 0.02, 0.05, 0.7, 0.1, 0.05, 0.03, 0.02, 0.01, 0.01]The network predicts â3â with 70% confidence (index 3 has highest probability).Why âForwardâ?Because information flows in one direction: from input â through hidden layers â to output. No loops, no feedback (in standard feedforward networks). Each layer only looks forward, never backward.Later, during learning, weâll propagate in the opposite direction (backward) to adjust weights. But prediction is always forward.4. Loss Functions: Quantifying ErrorWhy Do We Need Loss?Imagine youâre teaching a child to draw circles. They draw something. How do you tell them how âwrongâ it is? You need a measurement some way to quantify the difference between what they drew and a perfect circle.Neural networks face the same problem. After forward propagation, we have a prediction š. We also have the true answer y. The loss function L(š, y) measures how wrong the prediction is.This single number is crucial because:It tells us how well the model is performingIt guides the learning process (weâll adjust weights to minimize this number)Different problems need different ways of measuring âwrongnessâProperty Requirements for Loss FunctionsNon-negative: L ⼠0 always (can't be "negative wrong")Zero when perfect: L = 0 when š = y exactlyIncreases with error: Worse predictions â higher lossDifferentiable: We need gradients for learning (calculus requirement)Appropriate for the task: Regression vs classification need different measuresMean Squared Error (MSE): For RegressionThe Problem: Predict a continuous value (house price, temperature, stock price)The most intuitive approach: absolute difference |š - y|If true value is 100 and we predict 90, error = 10Simple, interpretableBut thereâs a problem: absolute value isnât differentiable at zero (the derivative has a discontinuity). This complicates learning algorithms.Better approach: Square the differenceL = (š - y)²Why squaring?Always positive (negative errors donât cancel positive ones)Differentiable everywhere: dL/dš = 2(š - y)Penalizes large errors more (error of 10 contributes 100, but error of 1 contributes only 1)Mathematically convenient (leads to elegant solutions)For multiple predictions (a batch):MSE = (1/n) Σᾢ(šᾢ - yᾢ)²We average across all samples to get a single loss value.Concrete Example:Predicting house pricesTrue prices: [200k, 300k, 250k]Predicted: [210k, 280k, 255k]Errors: [10k, -20k, 5k]Squared errors: [100M, 400M, 25M]MSE = (100M + 400M + 25M) / 3 = 175MThe large middle error dominates the loss, signaling thatâs where improvement is needed most.Variant: MAE (Mean Absolute Error)MAE = (1/n) Σᾢ|šᾢ - yᾢ|More robust to outliers (doesnât square them)Less sensitive to large errorsHarder to optimize (non-smooth at zero)Cross-Entropy Loss: For ClassificationThe Problem: Predict discrete categories (cat vs dog, spam vs ham, digit 0â9)MSE doesnât work well here. Why? Because classification outputs are probabilities, and we need to measure âhow wrongâ a probability distribution is.Binary Cross-Entropy (Two Classes)Setup:True label: y â {0, 1} (e.g., 0 = not spam, 1 = spam)Predicted probability: š â [0, 1] (from sigmoid activation)If true label is 1 (positive class):If we predict š = 1.0 (certain itâs positive): perfect, loss should be 0If we predict š = 0.9 (very confident): small lossIf we predict š = 0.5 (uncertain): moderate lossIf we predict š = 0.1 (confident itâs negative): large lossIf we predict š = 0.0 (certain itâs negative): infinite loss (catastrophically wrong)The formula that captures this:L = -[y¡log(š) + (1-y)¡log(1-š)]Why this works:Case 1: y = 1 (true class is positive)L = -log(š)If š = 1: L = -log(1) = 0 âIf š = 0.5: L = -log(0.5) â 0.69If š = 0.1: L = -log(0.1) â 2.30If š â 0: L â â (massive penalty for confident wrong answer)Case 2: y = 0 (true class is negative)L = -log(1-š)If š = 0: L = -log(1) = 0 âIf š = 0.5: L = -log(0.5) â 0.69If š = 0.9: L = -log(0.1) â 2.30If š â 1: L â âThe logarithm creates the right penalty structure: small errors have small losses, but confident mistakes are punished severely.Why âcross-entropyâ?It comes from information theory. Cross-entropy measures the average number of bits needed to encode data from one distribution using another distribution. Here, weâre measuring the âdistanceâ between the true distribution (y) and predicted distribution (š).Categorical Cross-Entropy (Multiple Classes)Setup:True label: one-hot encoded vector (e.g., [0, 0, 1, 0, 0] for class 3)Predicted: probability distribution from softmax (e.g., [0.1, 0.2, 0.5, 0.15, 0.05])Formula:L = -Σᾢ yᾢ¡log(šᾢ)Since y is one-hot (only one element is 1, rest are 0), this simplifies to:L = -log(š_true_class)Example: Digit classification (0â9)True label: 7 â one-hot: [0,0,0,0,0,0,0,1,0,0]Predicted: [0.05, 0.05, 0.1, 0.05, 0.05, 0.05, 0.1, 0.4, 0.1, 0.05]Loss = -log(0.4) â 0.916If the model had predicted 7 with 0.9 probability: Loss = -log(0.9) â 0.105 (much better)Intuition: We only care about the probability assigned to the correct class. The loss increases as this probability decreases.Choosing the Right Loss FunctionRegression (predicting continuous values):MSE: Standard choice, penalizes large errors heavilyMAE: More robust to outliersHuber Loss: Combines benefits of both (MSE for small errors, MAE for large)Binary Classification:Binary Cross-Entropy: Standard choice when using sigmoid outputMulti-class Classification:Categorical Cross-Entropy: When labels are one-hot encodedSparse Categorical Cross-Entropy: When labels are integers (more memory efficient)Custom Loss Functions: Sometimes you need domain-specific losses. For example:Medical diagnosis: False negatives might be more costly than false positivesImage generation: Perceptual losses that compare high-level features, not pixelsReinforcement learning: Reward-based lossesThe loss function is the objective weâre optimizing. Choose it carefullyâââyour model will become excellent at minimizing it, for better or worse.5. Backpropagation: The Learning AlgorithmThis step is crucial itâs where the real learning happens.Our neural network has millions of tiny adjustable numbers called weights. We make a prediction, compare it with the correct answer, and realize weâre off. The big question is: how do we tweak those millions of weights to make the next prediction better?Itâs not as simple as it sounds. Each weight affects many others, and changing even one can ripple through the entire network. Should we increase it or decrease it? And by how much?Thatâs where backpropagation comes in a beautifully systematic way to figure out exactly how every single weight should change to reduce the overall error.To really grasp whatâs happening here, youâll need a bit of comfort with calculus, especially with derivatives and how small changes in one variable affect another.The Core Insight: The Chain Rule of CalculusEverything in backpropagation stems from one calculus concept: the chain rule.Simple example: If z = f(y) and y = g(x), then:dz/dx = (dz/dy) ¡ (dy/dx)In words: The rate of change of z with respect to x equals the rate of change of z with respect to y, multiplied by the rate of change of y with respect to x.This might seem abstract, so letâs make it concrete.Concrete Example: A Tiny NetworkArchitecture:One input: x = 2One weight: w = 3One bias: b = 1Activation: ReLUTrue output: y = 15Forward pass:z = wx + b = 3(2) + 1 = 7a = ReLU(z) = 7L = (a - y)² = (7 - 15)² = 64Loss is 64. We want to reduce it. Should we increase or decrease w?Backward pass (backpropagation):We need dL/dw (how much does loss change when we change w?).Using the chain rule:dL/dw = (dL/da) ¡ (da/dz) ¡ (dz/dw)Letâs calculate each piece:Step 1: dL/da (how does loss change with activation?)L = (a - y)²dL/da = 2(a - y) = 2(7 - 15) = -16Step 2: da/dz (how does activation change with pre-activation?)a = ReLU(z) = max(0, z)For z > 0: da/dz = 1For z ⤠0: da/dz = 0Since z = 7 > 0: da/dz = 1Step 3: dz/dw (how does pre-activation change with weight?)z = wx + bdz/dw = x = 2Combine them:dL/dw = (dL/da) ¡ (da/dz) ¡ (dz/dw)dL/dw = (-16) ¡ (1) ¡ (2) = -32Interpretation: The gradient is -32. This means:If we increase w by a tiny amount, the loss will decrease by approximately 32 times that amountThe negative sign tells us to increase w (move opposite to the gradient)The magnitude (32) tells us how sensitive the loss is to changes in wUpdate the weight:w_new = w_old - learning_rate ¡ (dL/dw)w_new = 3 - 0.01 ¡ (-32) = 3 + 0.32 = 3.32Weâve just learned! The network adjusted its weight to reduce the loss.Scaling to Deep NetworksIn real networks with many layers, we calculate gradients layer by layer, moving backward from the output.Example: 3-layer networkForward pass:Layer 1: zš = Wšx + bš, aš = ReLU(zš)Layer 2: z² = W²aš + b², a² = ReLU(z²)Layer 3: zÂł = WÂła² + bÂł, š = softmax(zÂł)Loss: L = CrossEntropy(š, y)Backward pass:Layer 3 (output layer):dL/dzÂł = š - y [derivative of softmax + cross-entropy]dL/dWÂł = (dL/dzÂł) ¡ a²áľdL/dbÂł = dL/dzÂłdL/da² = W³ᾠ¡ (dL/dzÂł) [pass gradient to previous layer]Layer 2:dL/dz² = (dL/da²) â ReLU'(z²) [â is element-wise multiplication]dL/dW² = (dL/dz²) ¡ ašáľdL/db² = dL/dz²dL/daš = W²ᾠ¡ (dL/dz²)Layer 1:dL/dzš = (dL/daš) â ReLU'(zš)dL/dWš = (dL/dzš) ¡ xáľdL/dbš = dL/dzšNotice the pattern:Calculate gradient with respect to pre-activation (z)Calculate gradient for weights: dL/dW = (dL/dz) ¡ inputáľCalculate gradient for bias: dL/db = dL/dzPass gradient backward: dL/d(previous_activation) = Wᾠ¡ (dL/dz)Why âBackpropagationâ?Because we propagate gradients backward through the network, from output to input. Each layer receives the gradient from the layer ahead, computes its own gradients, and passes gradients to the layer behind.The Vanishing Gradient ProblemFundamental issue in deep networks:When we multiply many small numbers (gradients) together through many layers, the product can become vanishingly smallâââapproaching zero.Example: If each layer has gradient 0.1, after 10 layers:0.1šⰠ= 0.0000000001The early layers receive essentially zero gradient and stop learning. The network is deep but only the last few layers are actually training.Solutions:ReLU activation: Gradient is 1 for positive inputs (doesnât shrink)Residual connections: Skip connections that allow gradients to bypass layersBatch normalization: Keeps activations in a healthy rangeCareful initialization: Start with weights that donât lead to extreme activationsThe Exploding Gradient ProblemThe opposite issue: gradients grow exponentially.If each layer has gradient 2, after 10 layers:2šⰠ= 1024Weights update by huge amounts, causing wild oscillations and instability. The model never converges.Solutions:Gradient clipping: Cap gradients at a maximum valueCareful initialization: Start with smaller weightsBatch normalization: Stabilizes the scale of activations and gradientsLower learning rates: Smaller update stepsComputational Efficiency: Why Backpropagation is BrilliantNaive approach to finding gradients: For each weight, we could:Make a tiny change: w â w + ξRecalculate the entire lossCompute: (L_new - L_old) / ξFor a network with 1 million weights, this requires 1 million forward passes. Computationally prohibitive.Backpropagation insight: Calculate all gradients in a single backward pass by reusing intermediate calculations. For N weights, we need:1 forward pass1 backward passThatâs it. Backpropagation computes all million gradients with just two passes through the network. This is why deep learning became practical.The Mathematics: Derivatives of Common ComponentsReLU:f(x) = max(0, x)f'(x) = 1 if x > 0, else 0Sigmoid:Ď(x) = 1/(1 + e^(-x))Ď'(x) = Ď(x)(1 - Ď(x)Tanh:tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))tanh'(x) = 1 - tanh²(x)Softmax + Cross-Entropy (combined):dL/dz = š - yThis remarkably simple gradient is why we use softmax with cross-entropy.MSE:L = (š - y)²dL/dš = 2(š - y)Memory RequirementsBackpropagation requires storing all activations from the forward pass to compute gradients in the backward pass. For a network with:Batch size: 324 layers with 1000 neurons eachWe must store: 32 Ă 4 Ă 1000 = 128,000 activation values in memory.This is why training large models requires substantial GPU memory, and why techniques like gradient checkpointing (recomputing some activations rather than storing them) become necessary.6. Gradient Descent: The Optimization AlgorithmImagine youâre standing on a mountain in thick fog. You canât see the bottom of the valley, but you can feel the slope beneath your feet. Your goal: reach the lowest point.Strategy: Take a step in the direction of steepest descent.This is gradient descent. The âmountainâ is the loss landscapeâââa high-dimensional surface where each dimension represents one weight, and the height represents the loss.The Mathematical FoundationAfter backpropagation, we have gradients: âL/âwâ, âL/âwâ, ..., âL/âwâEach gradient tells us:Direction: Positive gradient means loss increases when weight increasesMagnitude: Large gradient means weight strongly affects lossGradient descent update rule:w_new = w_old - Îą ¡ (âL/âw)Where Îą (alpha) is the learning rate.Why subtract? The gradient points in the direction of increasing loss. We want to decrease loss, so we move in the opposite direction (negative gradient).The Learning Rate: The Most Critical HyperparameterThe learning rate controls the step size. Choosing it is an art and science.Too large (Îą = 1.0):Iteration 1: Loss = 100Iteration 2: Loss = 250 [overshot the minimum]Iteration 3: Loss = 80Iteration 4: Loss = 300 [wild oscillations]...never convergesToo small (Îą = 0.000001):Iteration 1: Loss = 100.00Iteration 2: Loss = 99.99Iteration 3: Loss = 99.98...painfully slow, might get stuck in local minimumJust right (Îą = 0.01):Iteration 1: Loss = 100Iteration 2: Loss = 85Iteration 3: Loss = 73...steady progress toward minimumTypical ranges:Small networks: 0.001â0.01Large networks: 0.0001â0.001With Adam optimizer: 0.001 (default)Variants of Gradient Descent1. Batch Gradient DescentApproach: Use the entire dataset to compute one gradient update.for epoch in range(num_epochs): # Compute gradient using ALL training samples gradient = compute_gradient(all_data) weights = weights - learning_rate * gradientPros:Smooth convergenceGuaranteed to find the minimum (for convex functions)Cons:Slow: One update per epochMemory intensive: Must load entire datasetGets stuck in local minima (for non-convex functions)2. Stochastic Gradient Descent (SGD)Approach: Use one random sample at a time.for epoch in range(num_epochs): shuffle(data) for sample in data: # Compute gradient using ONE sample gradient = compute_gradient(sample) weights = weights - learning_rate * gradientPros:Fast updates: One update per sampleCan escape local minima (due to noise)Memory efficientCons:Noisy updates: path to minimum is erraticDoesnât fully utilize parallel computing (GPUs)May oscillate around minimum without settling3. Mini-Batch Gradient Descent (Most Common)Approach: Use a small batch of samples (typically 32, 64, 128, or 256).for epoch in range(num_epochs): shuffle(data) for batch in create_batches(data, batch_size=32): # Compute gradient using BATCH of samples gradient = compute_gradient(batch) weights = weights - learning_rate * gradientPros:Balanced: More stable than SGD, faster than batch GDEfficient: Perfect for GPU parallelizationModerate memory usageNoise helps escape local minima, but not too muchCons:Another hyperparameter to tune (batch size)This is the standard in modern deep learning.Advanced Optimizers: Beyond Basic Gradient DescentBasic gradient descent treats all parameters equally and uses a fixed learning rate. Modern optimizers are more sophisticated.MomentumProblem with basic GD: Imagine a narrow valley: steep sides, gentle slope toward minimum. Basic GD oscillates between sides while slowly progressing forward.Solution: Momentumvelocity = 0for iteration: gradient = compute_gradient() velocity = β * velocity - learning_rate * gradient weights = weights + velocityIntuition: Remember previous gradients. If we keep going in the same direction, accelerate. If we oscillate, dampen the movement.Effect:Faster convergence in consistent directionsReduced oscillationsCan roll through small local minimaTypical β: 0.9 (use 90% of previous velocity)RMSprop (Root Mean Square Propagation)Problem: Some parameters need large updates, others need small ones. A single learning rate is suboptimal.Solution: Adapt the learning rate for each parameter based on recent gradient magnitudes.squared_gradient_avg = 0for iteration: gradient = compute_gradient() squared_gradient_avg = β * squared_gradient_avg + (1-β) * gradient² adjusted_gradient = gradient / (sqrt(squared_gradient_avg) + Îľ) weights = weights - learning_rate * adjusted_gradientIntuition:Parameters with consistently large gradients get smaller effective learning rates (divided by large number)Parameters with small gradients get larger effective learning rates (divided by small number)Effect: Each parameter gets its own adaptive learning rate.Adam (Adaptive Moment Estimation)The gold standard: Combines momentum and RMSprop.m = 0 # first moment (momentum)v = 0 # second moment (RMSprop)for iteration: gradient = compute_gradient() # Update moments m = βâ * m + (1-βâ) * gradient v = βâ * v + (1-βâ) * gradient² # Bias correction (important in early iterations) m_corrected = m / (1 - βâ^t) v_corrected = v / (1 - βâ^t) # Update weights weights = weights - learning_rate * m_corrected / (sqrt(v_corrected) + Îľ)Why Adam dominates:Combines best of both worlds: momentum + adaptive learning ratesRobust to hyperparameter choices (default values work well)Efficient and converges quicklyWorks across diverse problem typesDefault hyperparameters:learning_rate = 0.001βâ = 0.9 (momentum)βâ = 0.999 (RMSprop)Îľ = 1e-8 (numerical stability)Learning Rate SchedulesEven with Adam, learning rates can be adjusted during training.1. Step DecayEpochs 1-30: lr = 0.001Epochs 31-60: lr = 0.0001Epochs 61+: lr = 0.00001Why: Start with larger steps to quickly find the general region, then smaller steps to fine-tune.2. Exponential Decaylr(t) = lrâ * e^(-kt)Smoothly decreases learning rate over time.3. Cosine Annealinglr(t) = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(Ďt/T))Gradually reduces learning rate following a cosine curve.4. Warm RestartsPeriodically reset learning rate to initial value. Helps escape local minima by occasionally taking large steps again.5. Learning Rate WarmupStart with very small learning rate, gradually increase to target value over first few epochs. Prevents instability in early training.The Convergence Question: When to Stop?Training loss keeps decreasing but should we keep training?Early StoppingConcept: Monitor performance on a validation set (data the model hasnât trained on).Epoch 1: Train Loss = 2.5, Val Loss = 2.6Epoch 5: Train Loss = 1.2, Val Loss = 1.3Epoch 10: Train Loss = 0.8, Val Loss = 0.9Epoch 15: Train Loss = 0.4, Val Loss = 0.85 [val loss stopped decreasing]Epoch 20: Train Loss = 0.2, Val Loss = 0.9 [val loss increasing!]Stop at epoch 10: Model is starting to overfit (memorizing training data rather than learning generalizable patterns).Implementation:best_val_loss = infinitypatience = 5 # epochs to wait for improvementpatience_counter = 0for epoch: train() val_loss = validate() if val_loss < best_val_loss: best_val_loss = val_loss save_model() patience_counter = 0 else: patience_counter += 1 if patience_counter >= patience: print("Early stopping!") breakChallenges in the Optimization LandscapeLocal MinimaThe loss surface has multiple valleys. Gradient descent might settle into a shallow local minimum instead of the deep global minimum.Solutions:Momentum (can roll over small bumps)Multiple random initializationsStochastic updates (noise helps escape)Saddle PointsPoints where gradient is zero but itâs neither a minimum nor maximumâââa âsaddleâ shape. More common than local minima in high dimensions.Solutions:Momentum helps push throughSecond-order methods (Newtonâs method)PlateausFlat regions where gradients are nearly zero. Progress stalls.Solutions:Adaptive learning rates (Adam)Patience (eventually gradients increase again)Batching and ParallelizationWhy batches matter for GPUs:Modern GPUs have thousands of cores. Computing gradients for 32 samples independently is slow. Computing them in parallel is fast.Matrix operations on batches:Input batch: [32 Ă 784] (32 images, 784 pixels each)Weights: [784 Ă 128]Output: [32 Ă 128] (32 outputs, 128 neurons)Single matrix multiplication computes all 32 samples simultaneously. This is why GPUs are essential for deep learning.Batch size trade-offs:Small batches (e.g., 8â32):More frequent updatesMore noise (helps generalization)Less memorySlower per epochLarge batches (e.g., 256â1024):Fewer updates per epochSmoother gradientsMore memory requiredFaster per epochRisk of poor generalization (too smooth)Sweet spot: Usually 32â128 for most applications.The Complete Training Loop: Putting It All TogetherNow we understand all the pieces. Hereâs how they work together:Initialization# Initialize weights (Xavier/He initialization)for layer in network: layer.weights = random_normal(0, sqrt(2/n_inputs)) layer.biases = zeros()# Initialize optimizeroptimizer = Adam(learning_rate=0.001)Why careful initialization matters:Too large: Exploding activations and gradientsToo small: Vanishing gradientsXavier/He initialization: Scaled to maintain activation variance across layersThe Training Loopfor epoch in range(num_epochs): # Shuffle data for randomness shuffle(training_data) for batch in create_batches(training_data, batch_size=32): # 1. FORWARD PROPAGATION x, y_true = batch z1 = W1 @ x + b1 a1 = relu(z1) z2 = W2 @ a1 + b2 a2 = relu(z2) z3 = W3 @ a2 + b3 y_pred = softmax(z3) # 2. COMPUTE LOSS loss = cross_entropy(y_pred, y_true) # 3. BACKPROPAGATION dL_dz3 = y_pred - y_true dL_dW3 = dL_dz3 @ a2.T dL_db3 = sum(dL_dz3, axis=0) dL_da2 = W3.T @ dL_dz3 dL_dz2 = dL_da2 * relu_derivative(z2) dL_dW2 = dL_dz2 @ a1.T dL_db2 = sum(dL_dz2, axis=0) dL_da1 = W2.T @ dL_dz2 dL_dz1 = dL_da1 * relu_derivative(z1) dL_dW1 = dL_dz1 @ x.T dL_db1 = sum(dL_dz1, axis=0) # 4. OPTIMIZATION (using Adam) W3, b3 = optimizer.update(W3, b3, dL_dW3, dL_db3) W2, b2 = optimizer.update(W2, b2, dL_dW2, dL_db2) W1, b1 = optimizer.update(W1, b1, dL_dW1, dL_db1) # 5. VALIDATION val_loss = evaluate(validation_data) print(f"Epoch {epoch}: Train Loss = {loss:.4f}, Val Loss = {val_loss:.4f}") # 6. EARLY STOPPING CHECK if should_stop(val_loss): break# 7. FINAL EVALUATIONtest_accuracy = evaluate(test_data)print(f"Final Test Accuracy: {test_accuracy:.2%}")What Happens Over TimeEpoch 1:Weights are randomPredictions are terrible (10% accuracy on 10 classes = random guessing)Loss is high (maybe 2.3)Large gradientsBig weight updatesEpoch 10:Network learned basic patternsAccuracy improved to 60%Loss decreased to 1.2Moderate gradientsSteady learningEpoch 50:Network refined understandingAccuracy at 92%Loss at 0.3Small gradientsFine-tuning detailsEpoch 100:Diminishing returnsAccuracy 93% (validation starting to plateau)Risk of overfittingTime to stopMonitoring Training: What to Watch1. Training LossShould decrease steadilyIf fluctuating wildly: learning rate too highIf barely moving: learning rate too low or stuck in minimum2. Validation LossShould track training loss initiallyIf diverging: overfittingIf much higher from start: train/val data distribution mismatch3. Gradient NormsShould be moderate (0.001â1.0)If very small (< 0.0001): vanishing gradientsIf very large (> 10): exploding gradients4. Activation StatisticsMean should be near zeroStd should be moderate (~1)If activations saturate (all 0 or all max): architectural problem5. Learning RateCan be adjusted based on progressToo aggressive: divergenceToo conservative: slow progressConclusion: The Symphony of LearningMachine learning is not one algorithmâââitâs a carefully orchestrated system:Architecture provides the capacity to represent complex functions (Universal Approximation Theorem)Activation functions enable non-linear transformationsForward propagation generates predictionsLoss functions quantify errorBackpropagation computes gradients efficientlyGradient descent iteratively improves weightsEach component is essential. Remove any one, and learning fails.The beauty lies in the simplicity of each piece and the power of their combination. From these building blocksâââmatrix multiplications, non-linear functions, derivatives, and iterative updatesâââemerges the capability to:Recognize faces in photosTranslate between languagesGenerate realistic imagesPlay games at superhuman levelsPredict protein structuresDrive cars autonomouslyAll from the same fundamental algorithm, repeated billions of times, gradually sculpting random weights into a representation of the worldâs patterns.This is how machines learn: not through magic, but through mathematics, iteration, and the elegant interplay of calculus and optimization across high-dimensional spaces.

cover imageOver the past few months Iâve been building a fully open-source voice agent, exploring the stack end-to-end and learning a ton along the way. Now Iâm ready to share everything I discovered.The best part? In 2025 you actually can build one yourself. With todayâs open-source models and frameworks you can piece together a real-time voice agent that listens, reasons, and talks back almost like a human without relying on closed platforms.Letâs walk through the building blocks, step by step.The Core PipelineAt a high level, a modern voice agent looks like this:OverviewPretty simple on paper but each step has its own challenges. Letâs dig deeper.Speech-to-Text (STT)Speech is a continuous audio wave it doesnât naturally have clear sentence boundaries or pauses. Thatâs where Voice Activity Detection (VAD) comes in:VAD (Voice Activity Detection): Detects when the user starts and stops talking. Without it, your bot either cuts you off too soon or stares at you blankly.Once the boundaries are clear, the audio is passed into an STT model for transcription.Silero VAD is the gold standard and pipecat has builtin support so I have choosen that :Sub-1ms per chunk on CPUJust 2MB in sizeHandles 6000+ languagesWorks with 8kHz & 16kHz audioMIT license (unrestricted use)Popular STT OptionsWhat are thing we need focus on choosing STT for voice agentAccuracy:Word Error Rate (WER): Measures transcription mistakes (lower is better).Example: WER 5% means 5 mistakes per 100 words.Sentence-level correctness: Some models may get individual words right but fail on sentence structure.Multilingual support: If your users speak multiple languages, check language coverage.Noise tolerance: Can it handle background noise, music, or multiple speakers?Accent/voice variation handling: Works across accents, genders, and speech speeds.Voice Activity Detection (VAD) integration: Detects when speech starts and ends.Streaming: Most STT models work in batch mode (great for YouTube captions, bad for live conversations). For real-time agents, we need streaming output words should appear while youâre still speaking.Low Latency: Even 300 500ms delays feel unnatural. Target sub-second responses.Whisper often comes first to mind for most people when discussing speech-to-text because it has a large community, numerous variants, and is backed by OpenAI.OpenAI Whisper FamilyWhisper Large V3âââState-of-the-art accuracy with multilingual supportFaster-WhisperâââOptimized implementation using CTranslate2Distil-WhisperâââLightweight for resource-constrained environmentsWhisperXâââEnhanced timestamps and speaker diarizationNVIDIA also offers some interesting STT models, though I havenât tried them yet since Whisper works well for my use case. Iâm just listing them here for you to explore:Canary Qwen 2.5BâââLeading performance, 5.63% WERParakeet TDT 0.6B V2âââUltra-fast inference (3,386 RTFx)Here the comparsion tableWhy I Chose FastWhisperAfter testing, my pick is FastWhisper, an optimized inference engine for Whisper.Key Advantages:12.5Ă faster than original Whisper3Ă faster than Faster-Whisper with batchingSub-200ms latency possible with proper tuningSame accuracy as WhisperRuns on CPU & GPU with automatic fallbackItâs built in C++ + CTranslate2, supports batching, and integrates neatly with VAD.For more you can check Speech to Text AI Model & Provider LeaderboardLarge Language Model (LLM)Once speech is transcribed, the text goes into an LLM the âbrainâ of your agent.What we want in an LLM for voice agents:Understands prompts, history, and contextGenerates responses quicklySupports tool calls (search, RAG, memory, APIs)Leading Open-Source LLMsMeta Llama FamilyLlama 3.3 70BâââOpen-source leaderLlama 3.2 (1B, 3B, 11B)âââScaled for different deployments128K context windowâââremembers long conversationsTool calling supportâââbuilt-in function executionOthersMistral 7B / Mixtral 8x7BâââEfficient and competitiveQwen 2.5âââStrong multilingual supportGoogle GemmaâââLightweight but solidMy Choice: Llama 3.3 70B VersatileWhy?Large context window â keeps conversations coherentTool use built-inWidely supported in the open-source communityText-to-Speech (TTS)Now the agent needs to speak back and this is where quality can make or break the experience.A poor TTS voice instantly ruins immersion. The key requirements are:Low latency avoid awkward pausesNatural speech no robotic toneStreaming output start speaking mid-sentenceOpen-Source TTS Models Iâve TriedThere are plenty of open-source TTS models available. Hereâs a snapshot of the ones I experimented with:Kokoro-82MâââLightweight, #1 on HuggingFace TTS Arena, blazing fastChatterboxâââBuilt on Llama, fast inference, rising adoptionXTTS-v2âââZero-shot voice cloning, 17 languages, streaming supportFishSpeechâââNatural dialogue flowOrpheusâââScales from 150Mâ3BDiaâââA TTS model capable of generating ultra-realistic dialogue in one pass.Why I Chose Kokoro-82MKey Advantages:5â15Ă smaller than competing models while maintaining high qualityRuns under 300MBâââedge-device friendlySub-300ms latencyHigh-fidelity 24kHz audioStreaming-first designââânatural conversation flowLimitations:No zero-shot voice cloning (uses a fixed voice library)Less expressive than XTTS-v2Relatively new model with a smaller communityYou can also check out my minimal Kokoro-FastAPI server to experiment with it:Speech-to-Speech ModelsSpeech-to-Speech (S2S) models represent an exciting advancement in AI, combining speech recognition, language understanding, and text-to-speech synthesis into a single, end-to-end pipeline. These models allow natural, real-time conversations by converting speech input directly into speech output, reducing latency and minimizing intermediate processing steps.Some notable models in this space include:Moshi: Developed by Kyutai-Labs, Moshi is a state-of-the-art speech-text foundation model designed for real-time full-duplex dialogue. Unlike traditional voice agents that process ASR, LLM, and TTS separately, Moshi handles the entire flow end-to-end.CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.VALL-E & VALL-E X (Microsoft): These models support zero-shot voice conversion and speech-to-speech synthesis from limited voice samples.AudioLM (Google Research): Leverages language modeling on audio tokens to generate high-quality speech continuation and synthesis.Among these, Iâve primarily worked with Moshi. Iâve implemented it on a FastAPI server with streaming support, which allows you to test and interact with it in real-time. You can explore the FastAPI implementation here: FastAPI + Moshi GitHub.Framework (The Glue)Finally, you need something to tie all the pieces together: streaming audio, message passing, and orchestration.Open-Source FrameworksPipecatPurpose-built for voice-first agentsStreaming-first (ultra-low latency)Modular designâââswap models easilyActive communityVocodeDeveloper-friendly, good docsDirect telephony integrationSmaller community, less activeLiveKit AgentsBased on WebRTCSupports voice, video, textSelf-hosting optionsTraditional OrchestrationLangChainâââgreat for docs, weak at streamingLlamaIndexâââRAG-focused, not optimized for voiceCustom buildsâââtotal control, but high overheadWhy I Recommend PipecatVoice-Centric FeaturesStreaming-first, frame-based pipeline (TTS can start before text is done)Smart Turn Detection v2 (intonation-aware)Built-in interruption handlingProduction ReadySub-500ms latency achievableEfficient for long-running agentsExcellent docs + examplesStrong, growing communityReal-World Performance~500ms voice-to-voice latency in productionWorks with Twilio + phone systemsSupports multi-agent orchestrationScales to thousands of concurrent usersLead to Next PartIn this first part, weâve covered the core tech stack and models needed to build a real-time voice agent.In the next part of the series, weâll dive into integration with Pipecat, explore our voice architecture, and walk through deployment strategies. Later, weâll show how to enhance your agent with RAG (Retrieval-Augmented Generation), memory features, and other advanced capabilities to make your voice assistant truly intelligent.Stay tuned the next guide will turn all these building blocks into a working, real time voice agent you can actually deploy.Iâve created a GitHub repository VoiceAgentGuide for this series, where we can store our notes and related resources. Donât forget to check it out and share your feedback. Feel free to contribute or add missing content by submitting a pull request (PR).ResourcesVoice AI & Voice Agents An Illustrated Primer

cover imageWhile browsing YouTube, I stumbled across a video titled This Book Changed How I Think About AI. Curious, I clicked and it introduced me to Empire of AI by Karen Hao, a book that dives deep into the evolution of OpenAI.The book explores OpenAIâs history, its culture of secrecy, and its almost single-minded pursuit of artificial general intelligence (AGI). Drawing on interviews with more than 260 people, along with correspondence and internal documents, Hao paints a revealing picture of the company.After reading it, I uncovered 12 particularly fascinating facts about OpenAI that most people donât know. Letâs dive in.1. The âOpenâ in OpenAI Was More Branding Than BeliefThe name sounds noble who doesnât like the idea of âopenâ AI? But hereâs the catch: from the very beginning, openness was more narrative than commitment. Founders Sam Altman, Greg Brockman, and Elon Musk leaned into it because it helped them stand out. Behind closed doors, though, cofounder Ilya Sutskever was already suggesting they could scale it back once the story had served its purpose. In other words: open, until it wasnât convenient.2. Elon Muskâs Billion-Dollar Promise? Mostly Smoke and MirrorsRemember Muskâs flashy $1 billion funding pledge? Turns out, OpenAI only ever saw about $130 million of it. And less than $45 million came directly from Musk himself. His back-and-forth on funding almost pushed the organization into crisis, forcing Altman to hunt down new sources of money.3. The For-Profit Shift Was More About Survival Than VisionIn 2019, OpenAI unveiled its âcapped-profitâ structure, pitching it as an innovative way to balance mission and money. But the truth is far less glamorous: the nonprofit model wasnât bringing in the billions needed to compete with tech giants. At one point, Brockman and Sutskever even discussed merging with a chip startup. Creating OpenAI LP wasnât a bold visionit was a lifeline.4. The âCapped-Profitâ Model Looked Unlimited to CriticsInvestors were told their returns would be capped at 100x. Sounds responsible, right? But do the math: a $10 million check could still turn into a $1 billion payout. Critics quickly called it âbasically unlimited,â arguing the cap only looked meaningful until you saw the actual numbers.5. GPT-2âs âToo Dangerousâ Storyline Was a PR MasterstrokeIn 2019, OpenAI said its GPT-2 model was so powerful it had to be withheld for safety reasons. Headlines exploded. But hereâs the twist: many researchers thought the risk claims were overblown and saw the whole thing as a publicity stunt engineered by Jack Clark, OpenAIâs communications chief at the time. The stunt workedâââthe company was suddenly everywhere.6. OpenAIâs Culture Had Clashing âTribesâInside OpenAI, things werenât exactly harmonious. Sam Altman himself described the organization as divided into three factions: research explorers, safety advocates, and startup minded builders. He even warned of âtribal warfareâ if they couldnât pull together. Thatâs not just workplace tension itâs a sign of deep conflict over the companyâs direction.7. ChatGPTâs Global Debut Was Basically an AccidentThink ChatGPTâs launch was carefully choreographed? Not at all. The product that made OpenAI a household name was released in just two weeks as a âresearch preview,â right after Thanksgiving 2022. The rush was partly to get ahead of a rumored chatbot from Anthropic. Even Microsoft OpenAIâs biggest partner was caught off guard and reportedly annoyed.8. Training Data Included Pirated Books and YouTube VideosWhere do you get enough data to train something like GPT-3 or GPT-4? In OpenAIâs case, by scraping almost everything it could. GPT-3 used a secret dataset nicknamed âBooks2,â which reportedly included pirated works from Library Genesis. GPT-4 went even further, with employees transcribing YouTube videos and scooping up anything online without explicit âdo not scrapeâ warnings.9. âAI Safetyâ Initially Ignored Social HarmsOpenAI loves to talk about AI safety now. But early on, executives resisted calls to broaden the term to include real-world harms like discrimination and bias. When pressed, one leader bluntly said, âThatâs not our role.â The message was clear: safety meant existential risks, not everyday impacts.10. Scaling Up Came with Hidden Environmental CostsBigger models require more compute and more resources. Training GPT-4 in Microsoftâs Iowa data centers consumed roughly 11.5 million gallons of water in a single month, during a drought. Strikingly, Altman and other leaders reportedly never discussed these environmental costs in company-wide meetings.11. âSummerSafe LPâ Had a Dark InspirationBefore OpenAI LP had its public name, it was secretly incorporated as âSummerSafe LP.â The reference? An episode of Rick and Morty where a car, tasked with keeping Summer safe, resorts to murder and torture. Internally, it was an ironic nod to how AI systems can twist well-meaning goals into dangerous outcomes.12. Departing Employees Faced Equity PressureLeaked documents revealed OpenAI used a hardball tactic with departing employees: sign a strict nondisparagement agreement or risk losing vested equity. This essentially forced people into lifelong silence. Altman later said he didnât know this was happening and was embarrassed, but records show he had signed paperwork granting the company those rights a year earlier.Final ThoughtsOpenAIâs story is anything but straightforward. From broken promises and internal clashes to controversial data practices, the company has often operated in ways that donât match its public messaging. Whether you see that as savvy strategy, messy growing pains, or something more troubling depends on your perspective.But one thingâs clear: the âopenâ in OpenAI has always been complicated.This blog originally published here

cover postAs regular readers of my blog may know, our primary technology stack is the MERN stack MongoDB, Express, React, and Node.js. On the frontend, we use React with TypeScript; on the backend, Node.js with TypeScript, and MongoDB serves as our database.While this stack has served us well, we encountered significant challenges as our application scaled particularly around build times, memory usage, and developer experience. In this post, I will outline two key areas where Rust-based tools helped us resolve these issues and substantially improved our teamâs development velocity.Improving Frontend PerformanceThe Problem: Slow Builds and Poor Developer ExperienceAs our frontend codebase grew, we began facing several recurring issues:Local development startup times became painfully slow.Build processes consumed large amounts of memory.On lower-end machines, builds caused systems to hang or crash.Developers regularly raised concerns about delays and performance bottlenecks.These issues were primarily due to our use of Create React App (CRA) with an ejected Webpack configuration. While powerful, this setup became increasingly inefficient for our scale and complexity.First Attempt: Migrating to ViteIn search of a solution, I explored Vite, a build tool known for its speed and modern architecture.Benefits:Faster initial load times due to native ES module imports.Noticeable improvement in development server startup.Challenges:Migrating from an ejected CRA setup was complex due to custom Webpack configurations.Issues arose with lazy-loaded routes, SVG assets, and ESLint/type-checking delays.Certain runtime errors occurred during navigation, likely due to missing or incorrect Vite configurations.Ultimately, while Vite offered some performance benefits, it did not fully resolve our problems and introduced new complications.Final Solution: Adopting RspackAfter further research, we came across Rspack, a high-performance Webpack-compatible bundler written in Rust. What caught my attention was its focus on performance and ease of migration.Key advantages of Rspack:Significantly faster build times up to 70% improvement in our case.Reduced memory consumption during both build and development.Compatibility with existing Webpack plugins and configurations, which simplified migration.Designed as a drop-in replacement for Webpack.After resolving a few initial issues, we successfully integrated Rspack into our frontend build system. The migration resulted in substantial improvements in build speed and developer satisfaction. The system is now in production with no reported issues, and developers are once again comfortable working on the frontend.Accelerating Backend TestingThe Problem: Slow Kubernetes-Based Testing CycleOur backend uses Kubernetes for deployment and testing. The typical development workflow looked like this:A developer makes code changes.A Docker image is built and pushed to a registry using github action.The updated image is deployed to the Kubernetes cluster.Testers verify the changes.This process, while standard, became inefficient. Even small changes (such as adding a log statement) required a full image build and redeployment, resulting in delays of 15 minutes or more per test cycle.Optimization: Runtime Code SyncTo address this, we have written the shell script that will first run when the pod starts or restart which will pull the latest changes from github and run the code.git reset --hard origin/$BRANCH_NAMEgit pull origin $BRANCH_NAMEThis significantly reduced testing turnaround time for JavaScript-based services.The TypeScript BottleneckHowever, for services written in TypeScript, the situation was more complex. After pulling the latest code, we needed to transpile TypeScript to JavaScript using tsc or npm run build. Unfortunately, this process:Consumed excessive memory.Took too long to complete.Caused pods to crash, especially in test environments with limited resources.Solution: Integrating SWCTo solve this, we adopted SWC, a Rust-based TypeScript compiler. Unlike tsc, SWC focuses on speed and performance.Results after integrating SWC:Compilation time reduced to approximately 250 milliseconds.Memory usage dropped significantly.Allowed us to support live code updates without full builds or redeployments.Because SWC does not perform type checking, we use it only in test environments. This tradeoff allows testers to verify code changes rapidly, without impacting our production pipeline.Conclusion: Rustâs Impact on Team EfficiencyIn both our frontend and backend workflows, Rust-based tools Rspack and SWCdelivered substantial improvements:Frontend build times were reduced by more than 70%, with better memory efficiency.Testing cycles became significantly faster, especially for TypeScript services.Developer experience improved across the board, reducing frustration and increasing velocity.Rustâs performance characteristics, coupled with thoughtful tool design, played a critical role in resolving bottlenecks in our JavaScript-based systems. For teams facing similar challenges, especially around build performance and scalability, we strongly recommend exploring Rust-powered tools as a viable solution.

Hiring has always been one of those tasks that seems easy until youâre knee deep in resumes and trying to remember who did what, and more importantly what to ask them during the interview._A few weeks ago, I had this exact moment. I was preparing for an interview and had a resume open in one tab, a notepad in another, and ChatGPT somewhere in the background trying to help me brainstorm questions. Thatâs when the idea hit me:âWhy am I juggling between tabs? What if this entire process could live in a single Chrome extension?âSo I built one. Itâs called HireZen.The Problem I Kept Running IntoEvery time I had to take an interview, Iâd start by opening the candidateâs resume. But even after reading it top to bottom, I wasnât always sure:Whatâs the best way to dig deeper into their projects?Are they really comfortable with the tools they listed?What kinds of behavioral or situational questions would be relevant?Iâd often resort to generic questions or spend too much time prepping just one resume. It felt repetitive and inefficient and I knew there had to be a better way.Enter HireZenHireZen is a Chrome extension that does one simple thing:You upload a resume, and it generates personalized interview questions for you using AI.Thatâs it. No over engineering. No login required. Just upload, generate, copy or print done.Hereâs what it currently supports:đ§ Reads and parses PDF resumesđ¤ Uses LLMs (like GPT-4) to generate questions based on the candidateâs experienceđ¨ď¸ Lets you print the generated questions or share them with HRThe idea is to take the mental load off interviewers and let AI handle the repetitive thinking.How It WorksBy default, when you visit Google Meet, HireZen will auto-open as a sidebarâââso you can prep questions while youâre on the call.Press Ctrl + M to hide or show the extension anytime (toggle view).ď¸ Click the Settings icon to:Choose your LLM provider (OpenAI, Claude, etc.)Enter your API keySelect the model you prefer (e.g., GPT-4, GPT-3.5)Everything is stored securely inside your browser. Once configured, just upload a resume and start generating questions instantly.previewpreview 2Tech Behind ItInitially, I was using a GitHub-hosted API to call OpenAIâs models. It worked well, but obviously, not scalable for others. So I added a Settings page where anyone using the extension can:Choose their LLM provider (e.g., OpenAI or others)Set their preferred modelEnter their own API key, which is stored securely in the browser (not sent to me or any server)No backend. No database. Just local storage via Chromeâs storage.local API.Itâs simple, and more importantlyâââsafe.On SecurityOne thing I was very cautious about was handling the API key. I didnât want to mess around with storing sensitive data anywhere outside the userâs browser. So everything model, provider, key is stored locally and only accessible to the extension.You control your own usage. You bring your own key. I never see it.Whatâs Next?This is just the beginning. Iâm planning to:Add support for exporting question sets as PDFBuild a small feedback form to help interviewers leave notesEventually list it on the Chrome Web StoreRight now, itâs all open and available to try.Try It OutHereâs the linkIt works on Chrome and Chromium-based browsers. Just open it, upload a resume, and let it do the rest.Why Iâm Sharing ThisIâm a solo developer. I build things out of curiosity and real-world pain points I face at work. HireZen is one of those small tools I wish I had earlier, so I built it and put it out there.If it saves you time or makes your interviews a little smoother thatâs all I hoped for.And hey, if you found it helpful and want to support my workâŚâď¸ You can buy me a coffeeâââit helps me keep building little tools like this and pushing updates.Thanks for reading!

I love when companies roll out generous free tiers. It feels like theyâre saying, âHey, go build your thingâââweâve got your back.â And if youâre a student, between jobs, or just tired of racking up charges for every API call (yep, been there too), free tiers can be a total game-changer.Thatâs exactly why GitHub Models stood out to meâââitâs like an AI candy shop, completely free to explore, as long as you have a GitHub account.Hereâs whatâs on the shelf:đŽ OpenAI models like gpt-4o, o3-miniđ§ Research favorites like Phi and LLaMAđ Multimodal models like llama-vision-instructđ Embeddings from Cohere and OpenAI⥠Plus providers like Mistral, Jamba, Codestral and moreOh, and the best part? Many of these models support function calling, making them perfect for agent-style apps.Now hereâs the real kicker:GitHub Models speak OpenAI-compatible APIâââwhich means any Python framework that already works with OpenAIâs ChatCompletion API⌠just works out of the box.Example 1: Connecting openai SDK to GitHub Modelsimport openaiimport osclient = openai.OpenAI( api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com")Now go ahead and use it like you would with OpenAI:client.chat.completions.create(...)Simple, clean, no surprises.Example 2: Running AutoGen with GitHub Modelsimport autogen_ext.models.openaiimport autogen_agentchat.agentsimport osclient = autogen_ext.models.openai.OpenAIChatCompletionClient( model="gpt-4o", api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com")math_teacher = autogen_agentchat.agents.AssistantAgent( name="Math_teacher", model_client=client, system_message="You only teach maths.")Just like that, your agent is ready to go.You can plug GitHub Models into tons of other Python libraries tooâââLangGraph, PydanticAI, LlamaIndex, you name it.Go build something fun. Happy tinkering!If you liked this article, learned something, or found it usefulâââclap away! Each clap burns 0.1 calories, so do 50 and skip leg day. Youâre welcome. đď¸đ

cover imageIf youâve been following our recent Kubernetes migration blog, you already know the journey has been full of challenges. From configuring pods to tackling networking issues, itâs been a rollercoaster. Weâve explored several tricky problems in previous blogs, and today, we invite you to put on your detective hat and join us as we investigate another Kubernetes mystery.The Mysterious Case of NXDomain ErrorsImagine this: Youâre checking your Kubernetes observability tools, and suddenly, you notice something strange over a million NXDomain errors! What could be causing this? Letâs break it down together.What Are NXDomain Errors?NX domainBefore we jump in, letâs test your DNS knowledge:Pop Quiz: What does an NXDomain error indicate?A) A domain exists but is unreachable.B) A domain doesnât exist.C) A domain is experiencing high latency.(Take a moment to think! Scroll down for the answerâŚ)The Answer: If you guessed B) A domain doesnât exist, youâre right! These errors occur when a DNS query is made for a non-existent domain.Unraveling the CluesWe took a closer look at the logs and found something unusualâââexternal domains were mysteriously gaining extra words like .cluster.local or .internal.cloudapp.net. Here are two examples:gmail.googleapis.com.cluster.localoauth2.googleapis.com.es52e2p4cafzg4m1it5a.bx.internal.cloudapp.netNow, letâs put your troubleshooting skills to the test:What do you think is happening here?A) These domains are being redirected intentionally.B) Kubernetes is modifying external domains.C) A rogue service is interfering with DNS.(Think about it before scrolling!)The Answer: B) Kubernetes is modifying external domains. But why? Letâs find out.How Kubernetes Handles DNS QueriesTo solve this puzzle, we need to understand how Kubernetes resolves DNS queries. When a pod performs a DNS lookup, Kubernetes doesnât always send the request as-is. Instead, it applies search domains and NDots rules to the query.Hereâs a fun experiment: Try running the following command inside a Kubernetes pod:cat /etc/resolv.confWhat do you see? You should find an entry for search domains and an ndots value. These settings influence how Kubernetes resolves domain names.Connecting the DotsBecause the ndots value was set to 5, Kubernetes treated gmail.googleapis.com as an incomplete domain and appended search domains, turning it into:gmail.googleapis.com.svc.cluster.localgmail.googleapis.com.cluster.localThese domains donât exist, leading to the dreaded NXDomain errors!Fixing the ProblemNow that weâve cracked the case, letâs apply the fix. Hereâs how you can customize DNS settings to prevent Kubernetes from modifying external domains:apiVersion: v1kind: Podmetadata: namespace: default name: dns-examplespec: containers: - name: test image: nginx dnsPolicy: "None" dnsConfig: nameservers: - 1.2.3.4 searches: - ns1.svc.cluster-domain.example - my.dns.search.suffix options: - name: ndots value: "2" - name: edns0The Outcome: A Smooth DNS ExperienceBy adjusting the DNS configuration, we prevent Kubernetes from mistakenly modifying external queries. This eliminates NXDomain errors and ensures external services resolve correctly.

cover imageMigrating from Heroku to Kubernetes is no small feat. While Heroku provided a straightforward Platform-as-a-Service (PaaS) environment that handled many operational aspects for us, Kubernetes offers greater flexibility, scalability, and control. However, with great power comes great responsibility and a host of new challenges. In this post, I will share the key lessons we learned during our migration and how we tackled common hurdles along the way.1. Probe Issue: Cloudflare ErrorsAfter migrating to Kubernetes, we encountered Cloudflare error messages reported by users. These errors would disappear after refreshing the page, but they were persistent. Our investigation traced the issue back to our deployment configuration.The way our pods were deployed prevented Cloudflare from properly verifying pod health, leading to timeouts.How We Fixed ItImplementing Pod Probes: We added both readiness and liveness probes in Kubernetes to ensure that traffic was only routed to healthy pods.Enhancing Resilience: This setup enabled Kubernetes to automatically restart unhealthy pods, preventing downtime.Here is a sample YAML configuration for readiness and liveness probes:apiVersion: apps/v1kind: Deploymentmetadata: name: my-appspec: replicas: 3 template: spec: containers: - name: my-app-container image: my-app:latest ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 15 periodSeconds: 20The readiness probe ensures the pod is ready to accept traffic, while the liveness probe restarts it if it becomes unresponsive.For a more detailed breakdown, check out my full blog post: Sherlock Holmes and the Case of the Cloudflare Timeout Mystery.2. Retrieving Real User IPsAnother issue we encountered was losing access to real user IP addresses after migrating to Kubernetes. Instead of seeing user IPs, our logs displayed the pod proxy IP, making it difficult to track users or manage logs effectively.SolutionBy setting the externalTrafficPolicy to Local, Kubernetes ensures that the real client IP is passed to your services, even when traffic is routed through a load balancer.Here is a sample configuration:apiVersion: v1kind: Servicemetadata: name: my-app-servicespec: selector: app: my-app ports: - port: 80 targetPort: 8080 type: LoadBalancer externalTrafficPolicy: LocalThis configuration restores the real client IP by routing traffic only to local nodes.For a step by step breakdown, read my blog post: Sherlock Holmes and the Case of the Missing User IPs.3.Zombie State Issue: Servers Becoming UnresponsiveZombieAn unexpected challenge we faced was servers entering a âzombieâ state. After running smoothly for days, some servers became unresponsive without any clear cause.Our Fix: A Scheduled RestartDespite extensive troubleshooting, we could not pinpoint the root cause. However, implementing a cron job to restart the servers every 24 hours effectively mitigated the issue.Here is how we configured it using Kubernetesâ CronJob resource:apiVersion: batch/v1kind: CronJobmetadata: name: restart-my-serverspec: schedule: "0 0 * * *" # Runs every day at midnight jobTemplate: spec: template: spec: containers: - name: restart-container image: my-app:latest command: ["sh", "-c", "echo Restarting server... && kill -HUP 1"] restartPolicy: OnFailureThis ensures a daily restart, keeping our services responsive.4.Securing Internal Service CommunicationA major advantage of Kubernetes is the ability to restrict internal service visibility for security reasons. We wanted to prevent all services from being externally accessible while still allowing internal communication.SolutionWe leveraged Kubernetesâ internal DNS system, which allows services to communicate securely within the cluster.For example, a service in the my-namespace namespace can be accessed using:my-service.my-namespace.svc.cluster.localThis setup isolates critical services, reducing the attack surface and enhancing security.5.Setting Up Alerts: Proactive MonitoringWithout proper alerting, issues like crash loops or unexpected pod restarts can go unnoticed until they cause major downtime.We implemented Prometheus and Alertmanager to notify us when:A pod enters a crash loopCPU or memory usage spikes above thresholdsHere is a Prometheus alerting rule to detect crash loops:groups:- name: crash-loop-alerts rules: - alert: PodInCrashLoop expr: kube_pod_container_status_restarts_total{job="kubelet",container="my-app"} > 5 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is in a crash loop"This alert triggers when a container restarts more than five times within five minutes.6.Optimizing Node Utilization with Taints and TolerationsTo efficiently allocate resources, we used taints and tolerations to control pod placement on nodes.For example, we applied a taint to a node to prevent certain pods from being scheduled on it:kubectl taint nodes node1 key=value:NoScheduleTo allow specific pods to run on the tainted node, we added this toleration:spec: tolerations: - key: "key" operator: "Equal" value: "value" effect: "NoSchedule"This strategy ensured high-resource pods were assigned to powerful nodes while lightweight pods ran on less resource-intensive ones, optimizing cluster performance.Wrapping UpMigrating from Heroku to Kubernetes came with its challenges, but each hurdle made our system stronger. With better scalability, resilience, and control, the shift was well worth it. If you are on a similar journey, embrace the learning curve it pays off.Have insights or questions? Letâs discuss.Donât Forgot To Smash 50 Claps đIf u Liked This ArticleâŚ.It wonât take much time đ
