Devb.io - Build Stunning Developer Portfolios in Minutes

Get to know me

💻 Languages

Python

TypeScript

PR Merged

📍 Location

Tirunelveli

Experience

Full-stack Developer

Klenty

Chennai, Tamil Nadu, India

2022 Jul - Present

Full-stack Developer

ChefAtHome FoodTech LLP

Tamil Nadu, India

2021 May - 2021 Sep

Java Script Programmer

CodeSpeedy Technology Private Limited

West Bengal, India

2021 Jan - 2021 Mar

Education

Bachelor of Engineering

Einstein college of engineering

Computer Science

2018 - 2022

Higher secondary

Tilak vidyalaya higher secondary school

Computer Science

2017 - 2018

Projects

CallAgent

Automating the calls using eleven labs and twilio

TypeScript

programmerraja.github.io

A collection a project

HTML

Interview-Helper

A chrome extension that will help you when taking interview

CSS

React-Js-Resource-

A repository with collection of learning resource for learning react js

JPMC-tech-task-1-py3

JPMorgan Chase Software Engineering virtual internship task -1

Python

Weather-App

A weather app build using a React Js

JavaScript

VoiceAgentGuide

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent

Python

IpLogger

A web application to log the user ip and user agent

JavaScript

Latest Articles

How Your LLM Costs 5X More If You Don’t Speak English

cover imageAre you overpaying for AI because of your language? If you’re building LLM applications in Spanish, Hindi, or Greek, you could be spending up to 6 times more than English users for the exact same functionality.This blog insipred from the research paper Do All Languages Cost the Same? Tokenization in the Era of Commercial Language ModelsThe Hidden Tokenization TaxWhen you send text to GPT-4, Claude, or Gemini, your input gets broken into tokens chunks roughly 3–4 characters long in English. You pay per token for both input and output.The shocking truth: The same sentence costs wildly different amounts depending on your language.Real Example: “Hello, my name is Sarah”English: 7 tokens → baseline → $16,425/yearSpanish: 11 tokens → 1.5× higher → $24,638/year (+$8,213)Hindi: 35 tokens → 5× higher → $82,125/year (+$65,700)Greek: 42 tokens → 6× higher → $98,550/year (+$82,125)That’s an $82,000 annual difference for the exact same chatbot purely because of language.The Complete Language Cost Breakdowntokenization cost by languageResearch from ACL 2023 and recent LLM benchmarks reveals systematic bias in how models tokenize different languages. Here’s what it costs to process 24 major languages:Tokenization cost comparison across 24 languages showing how many times more expensive each language is compared to English due to tokenization differencesMost Efficient Languages (1.0–1.5x English)English: 1.0x (baseline)French: 1.2xItalian: 1.2xPortuguese: 1.3xSpanish: 1.5xModerately Expensive (1.6–2.5x)Korean: 1.6xJapanese: 1.8xChinese (Simplified): 2.0xArabic: 2.0xRussian: 2.5xHighly Expensive (3.0–6.0x)Ukrainian: 3.0xBengali: 4.0xThai: 4.0xHindi: 5.0xTamil: 5.0xTelugu: 5.0xGreek: 6.0x (most expensive)Why Writing Systems MatterchartsComparison of tokenization costs and efficiency across different writing systems, showing why Latin-based languages are most cost-effective for LLM applicationsThe script your language uses creates dramatic efficiency gaps:Latin script: 1.4x average (73.5% efficient)Hangul (Korean): 1.6x (63% efficient)Han/Japanese: 1.8–2.0x (50–56% efficient)Cyrillic: 2.75x average (36.5% efficient)Indic scripts: 4–5x average (20% efficient)Greek: 6.0x (17% efficient — worst)Why This Inequality Exists1. Training Data BiasGPT-4, Claude, and Gemini are trained on English-dominant datasets. The Common Crawl corpus shows stark imbalance:~60% English~10–15% combined for Spanish/French/German<5% for most other languagesTokenizers learn to compress what they see most. English gets ultra-efficient encoding; everything else is treated as “foreign.”2. Morphological ComplexityLanguages with rich morphology generate far more word variationsEnglish: “run” → runs, running, ran (4 forms)Turkish: Single root → 50+ forms with suffixesArabic: Root system → thousands of variationsHindi: Complex verb conjugations with gender/number/tenseTokenizers can’t learn compact patterns for high-variation, low-data languages.3. Unicode Encoding OverheadDifferent scripts need different byte counts:Latin: 1 byte per characterCyrillic: 2 bytes per characterDevanagari/Tamil: 3+ bytes per characterMore bytes = more tokens = higher cost even for the same semantic content.Real-World Cost ImpactHere’s what tokenization inequality means for actual business applications:Customer Support Chatbot (10,000 messages/day)English: $16,425/yearSpanish: $24,638/year (+50%, +$8,213)Hindi: $82,125/year (+400%, +$65,700)Content Generation Platform (1M words/month)English: $14,400/yearSpanish: $21,600/yearHindi: $72,000/yearDocument Translation Service (100K words/day)English: $65,700/yearSpanish: $98,550/year (+$32,850)Hindi: $328,500/year (+$262,800)Code Assistant (50K queries/day)English: $91,250/yearSpanish: $136,875/yearHindi: $456,250/year (+$365,000)Bottom line: A company serving Hindi users pays $262,800-$365,000 more annually than an identical English service.The Socioeconomic DimensionResearch reveals a disturbing -0.5 correlation between a country’s Human Development Index and LLM tokenization cost.Translation: Less developed countries often speak languages that cost more to process.Users in developing nations pay premium ratesCommunities with fewer resources face higher AI barriersThis creates “double unfairness” in AI democratizationExample: A startup in India building a Hindi customer service bot pays 5x more than a US competitor despite likely having far less funding.The Future of Fair AILanguage should never determine how much intelligence costs. Yet today, the world’s most spoken tongues pay a silent premium just to access the same models. Fixing this isn’t about optimization it’s about fairness. Until every language is tokenized equally, AI remains fluent in inequality.

llm-applicationstokenizationgemini+1 more

Oct 26, 2025

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-2)

cover imageWelcome to Part 2 of the 2025 Voice AI Guide How to Build Your Own RealTime Voice Agent.In this section, we’ll dive deep into Pipecat and create a simple “Hello World” program to understand how real-time voice AI works in practice.If you have not read the part 1 read herePipecatPipecat is an open-source Python framework developed by Daily.co for building real-time voice and multimodal conversational AI agents. It provides a powerful yet intuitive way to orchestrate audio, video, AI services, and transport protocols to create sophisticated voice assistants, AI companions, and interactive conversational experiencesWhat makes Pipecat special is its voice-first approach combined with a modular, composable architecture Instead of building everything from scratch, you can focus on what makes your agent unique while Pipecat handles the complex orchestration of real-time audio processing, speech recognition, language models, and speech synthesis.What You Can Build with PipecatPipecat enables a wide range of applicationsVoice Assistants — Natural, streaming conversations with AIAI Companions — Coaches, meeting assistants, and interactive charactersPhone Agents — Customer support, intake bots, and automated calling systemsMultimodal Interfaces — Applications combining voice, video, and imagesBusiness Agents — Customer service bots and guided workflow systemsInteractive Games — Voice-controlled gaming experiencesCreative Tools — Interactive storytelling with generative mediaPipecat Architecture: How It WorksUnderstanding Pipecat’s architecture is crucial for building effective voice agents. The framework is built around three core concepts:1. FramesFrames are data packages that move through your application. Think of them as containers that hold specific types of information:Audio frames — Raw audio data from microphonesText frames — Transcribed speech or generated responsesImage frames — Visual data for multimodal applicationsControl frames — System messages like start/stop signals2. Frame ProcessorsFrame processors are specialized workers that handle specific tasks. Each processor:Receives specific frame types as inputPerforms a specialized transformation (transcription, language processing, etc.)Outputs new frames for the next processorPasses through frames it doesn’t handleCommon processor types include:STT (Speech-to-Text) processors that convert audio frames to text framesLLM processors that take text frames and generate response framesTTS (Text-to-Speech) processors that convert text frames to audio framesContext aggregators that manage conversation history3. PipelinesPipelines connect processors together, creating a structured path for data to flow through your application. They handle orchestration automatically and enable parallel processing — while the LLM generates later parts of a response, earlier parts are already being converted to speech and played back to users.Voice AI Processing FlowHere’s how a typical voice conversation flows through a Pipecat pipeline:Audio Input — User speaks → Transport receives streaming audio → Creates audio framesSpeech Recognition — STT processor receives audio → Transcribes in real-time → Outputs text framesContext Management — Context processor aggregates text with conversation historyLanguage Processing — LLM processor generates streaming response → Outputs text framesSpeech Synthesis — TTS processor converts text to speech → Outputs audio framesAudio Output — Transport streams audio to user’s device → User hears responseThe key insight is that everything happens in parallel — this parallel processing enables the ultra-low latency that makes conversations feel natural.Hello World Voice Agent: Complete ImplementationNow let’s build a complete “Hello World” voice agent that demonstrates all the core concepts. This example creates a friendly AI assistant you can have real-time voice conversations with.PrerequisitesBefore we start, you’ll need:Python 3.10 or lateruv package manager (or pip)API keys from three services:Deepgram for Speech-to-TextOpenAI for the language modelCartesia for Text-to-SpeechProject SetupFirst, let’s set up our project:# Install Pipecat with required servicesuv add "pipecat-ai[deepgram,openai,cartesia,webrtc]"Environment ConfigurationCreate a .env file with your API keys:# .envDEEPGRAM_API_KEY=your_deepgram_api_key_hereOPENAI_API_KEY=your_openai_api_key_here CARTESIA_API_KEY=your_cartesia_api_key_hereThe code is some what bigger so i have not shared the whole code you can check out the whole code hereasync def main():"""Main entry point for the Hello World bot.""" bot = HelloWorldVoiceBot()await bot.run_bot()Understanding the Code StructureLet’s break down the key components of our Hello World implementation:1. Service Initialization# Speech-to-Text serviceself.stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))# Language Model service self.llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-3.5-turbo")# Text-to-Speech serviceself.tts = CartesiaTTSService(api_key=os.getenv("CARTESIA_API_KEY"), voice_id="...")Each service is a frame processor that handles a specific part of the voice AI pipeline.2. Pipeline Configurationpipeline = Pipeline([ transport.input(), # Audio input from browser self.stt, # Speech → Text self.context_aggregator.user(),# Add to conversation history self.llm, # Generate response self.tts, # Text → Speech transport.output(), # Audio output to browser self.context_aggregator.assistant(), # Save response to history])The pipeline defines the data flow each processor receives frames, transforms them, and passes them to the next processor.3. Event-Driven Interactions@transport.event_handler("on_first_participant_joined")async def on_participant_joined(transport, participant): # Trigger bot to greet the user await task.queue_frame(LLMMessagesFrame(self.messages))Event handlers manage the conversation lifecycle — when users join/leave, when they start/stop speaking, etc.The diagram below shows a typical voice assistant pipeline, where each step happens in real-time:Running Your Hello World BotSave the code as hello_world_bot.pyRun the bot: python hello_world_bot.pyOpen your browser to http://localhost:7860Click “Connect” and allow microphone accessStart talking! Say something like “Hello, how are you?”The bot will:Listen to your speech (STT)Process it with OpenAI (LLM)Respond with natural speech (TTS)Remember the conversation contextFor more examples and advanced features, check out the Pipecat documentation and example repository.*What Next?Now that you’re familiar with Pipecat and can build your own real-time voice agent, it’s time to take the next step.In the upcoming part, we’ll explore how to run all models locally even on a CPU and build a fully offline voice agent.I’ve created a GitHub repository VoiceAgentGuide for this series, where we can store our notes and related resources. Don’t forget to check it out and share your feedback. Feel free to contribute or add missing content by submitting a pull request (PR).Stay tuned for the next part of the 2025 Voice AI Guide!

pipecatllmai-voice-agent+2 more

Oct 11, 2025

How Machines Learn: Understanding the Core Concepts of Neural Networks

Imagine trying to teach a child who’s never seen the world to recognize a face, feel that fire is hot, or sense when it might rain. How would you do it?For centuries, we thought intelligence required something mystical a soul, consciousness, divine spark. But what if it’s just pattern recognition at an extraordinary scale? What if learning is simply tuning millions of tiny parameters until inputs map correctly to outputs?That’s the bold idea behind deep learning: mathematical systems that can learn any pattern, approximate any function, and tackle problems once thought uniquely human.In 1989, mathematicians proved the Universal Approximation Theorem showing that even a single hidden layer neural network can approximate any continuous function. In theory, such a network can learn to translate, recognize, play, or predict anything.But theory alone isn’t enough. The theorem says such a network exists not how to build or train it. That’s where the real craft of deep learning begins: finding the right weights, training efficiently, and learning patterns that generalize.Let’s unpack the six core ideas that make this possible.Note: This is a deep dive not a skim. Grab a coffee, settle in, and take your time. By the end, you’ll understand neural networks from the ground up, not just in words but in logic.1. Neural Networks: Universal Function ApproximatorsWhat Are We Trying to Do?Before we understand neural networks, let’s start with something simpler: what is a function?In mathematics, a function is a relationship that maps inputs to outputs. f(x) = 2x + 1 is a function. You give it x = 3, it returns 7. Simple, deterministic, predictable.But real-world problems involve functions we can’t write down. Consider:f(image) = "cat" or "dog"f(email_text) = "spam" or "not spam"f(patient_symptoms) = disease_probabilityThese are still functions they map inputs to outputs but we don’t know their mathematical form. Traditional programming can’t help us here because we can’t write explicit rules for every possible image or email.Building Blocks: The Artificial NeuronLet’s build from the ground up. Start with a single neuron the atomic unit of a neural network.A neuron does three things:Receives multiple inputs (x₁, x₂, x₃, …)Multiplies each input by a weight (w₁, w₂, w₃, …)Sums everything up and adds a bias: z = w₁x₁ + w₂x₂ + w₃x₃ + ... + bWhy this structure? Because it’s the simplest way to combine multiple pieces of information into a single decision.Geometry of a Neuron: Drawing a LineLet’s ground this in a real example.Problem: You’re a bank deciding whether to approve loans. You have two pieces of information:x₁ = Annual income (in thousands)x₂ = Credit scoreGoal: Separate “approve” from “reject” applications.A Single Neuron Creates a Line (2D) or Hyperplane (Higher Dimensions)The equation z = w₁x₁ + w₂x₂ + b is actually the equation of a line! Let's see how:Example neuron with specific weights:z = 0.5·income + 2·credit_score - 150This neuron outputs positive values for “approve” and negative for “reject”. The decision boundary is where z = 0:0 = 0.5·income + 2·credit_score - 150credit_score = 75 - 0.25·incomeThis is a line! Let’s plot it:What the weights mean geometrically:w₁ = 0.5: For every $1000 increase in income, the decision shifts by 0.5 units toward approvalw₂ = 2.0: For every 1-point increase in credit score, the decision shifts by 2 units toward approval (4× more important than income!)b = -150: The bias shifts the entire line. Without it, the line would pass through origin (0,0)The learning process is finding the right line:Start with a random line (random weights)See which points it classifies wrongAdjust the weights to rotate and shift the lineRepeat until the line best separates the two groupsWhat One Neuron Can and Cannot DoCannot separate (non-linearly separable):XOR is the classic example: you need (0,1) and (1,0) to be class 1, but (0,0) and (1,1) to be class 0. No single line can achieve this separation.This is why we need multiple layers.Multiple Neurons, Multiple Lines: Building Complex BoundariesIf one neuron creates one line, what happens with multiple neurons in one layer?Example: 3 neurons in one layerNeuron 1: z₁ = w₁₁x₁ + w₁₂x₂ + b₁ [Line 1]Neuron 2: z₂ = w₂₁x₁ + w₂₂x₂ + b₂ [Line 2]Neuron 3: z₃ = w₃₁x₁ + w₃₂x₂ + b₃ [Line 3]Each neuron draws a different line! But without additional layers, we still can’t solve XOR. Why? Because we’re just drawing multiple lines without combining them in complex ways.The key insight: We need to combine these lines non-linearly. This is where activation functions and depth come in.The Layer AbstractionNow stack multiple neurons side by side that’s a layer. Each neuron in the layer:Receives the same inputsHas its own unique weights and biasProduces its own outputA layer with 10 neurons transforms one input vector into 10 different outputs, each representing a different “feature” or “pattern” it has detected.Solving XOR: A Complete ExampleLet’s solve XOR step-by-step to understand how layers work together.The XOR Problem:Input (x₁, x₂) → Output(0, 0) → 0(0, 1) → 1(1, 0) → 1(1, 1) → 0Two-Layer Solution:Layer 1: Create useful features (2 neurons with ReLU)Neuron 1: Detects “at least one input is 1”z₁ = x₁ + x₂ - 0.5a₁ = ReLU(z₁)Testing:(0,0): z₁ = -0.5, a₁ = 0(0,1): z₁ = 0.5, a₁ = 0.5(1,0): z₁ = 0.5, a₁ = 0.5(1,1): z₁ = 1.5, a₁ = 1.5Neuron 2: Detects “both inputs are 1”z₂ = x₁ + x₂ - 1.5a₂ = ReLU(z₂)Testing:(0,0): z₂ = -1.5, a₂ = 0(0,1): z₂ = -0.5, a₂ = 0(1,0): z₂ = -0.5, a₂ = 0(1,1): z₂ = 0.5, a₂ = 0.5Layer 2: Combine features (1 neuron with Sigmoid)z₃ = a₁ - 2·a₂ - 0.25output = Sigmoid(z₃)Testing:(0,0): z₃ = 0 - 0 - 0.25 = -0.25 → ≈0 ✓(0,1): z₃ = 0.5 - 0 - 0.25 = 0.25 → ≈1 ✓(1,0): z₃ = 0.5 - 0 - 0.25 = 0.25 → ≈1 ✓(1,1): z₃ = 1.5 - 1 - 0.25 = 0.25 → ≈0 ✓What happened geometrically?Layer 1 transformed the space:The first layer created new features where the problem becomes linearly separable!a₁ captures “OR-ness” (at least one is true)a₂ captures “AND-ness” (both are true)Layer 2 drew a simple line in this new space:a₁ - 2·a₂ = 0.25 [decision boundary]This line easily separates XOR in the transformed space!The key insight:Layer 1: Creates useful intermediate features by drawing multiple lines/planesLayer 2: Combines these features with another line/planeTogether: They can represent any decision boundary!The Complete ArchitectureA typical neural network:Input Layer (raw data) → Hidden Layer 1 (low-level features) → Hidden Layer 2 (mid-level features) → Hidden Layer 3 (high-level features) → Output Layer (predictions)The power lies not in any single neuron, but in the billions of connections between them, each with its own weight, collectively forming a function approximator of extraordinary flexibility.Universal Approximation TheoremThe Universal Approximation Theorem (1989) proves:A neural network with just one hidden layer can approximate any continuous function, given enough neurons.But “enough” might mean billions, which is impractical.Deep (multi-layer) networks achieve the same expressive power more efficiently through hierarchical composition like compression for abstractions.So, in theory, neural networks can learn any mapping; in practice, depth makes it tractable.2. Activation Functions: Breaking LinearityThe Linear Trap: A Fundamental ProblemImagine we build a neural network with three layers, but we don’t use activation functions. Let’s trace through what happens mathematically:Layer 1: z₁ = W₁x + b₁Layer 2: z₂ = W₂z₁ + b₂ = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂Layer 3: z₃ = W₃z₂ + b₃ = W₃(W₂W₁x + W₂b₁ + b₂) + b₃Simplifying: z₃ = (W₃W₂W₁)x + (W₃W₂b₁ + W₃b₂ + b₃)Notice what happened? No matter how many layers we add, we always end up with: Wx + b a simple linear function. Matrix multiplication of matrices is still a matrix. We've built an expensive way to do simple linear regression.This is catastrophic. Linear functions can only model linear relationships. The real world is non-linear. The path of a thrown ball, the spread of a virus, the relationship between study time and test scores — all non-linear.The Solution: Non-Linear Activation FunctionsAfter each neuron computes its weighted sum, we pass it through a non-linear activation function: a = σ(z)This single addition breaks the linear trap. Now our layers actually do different things, building increasingly complex representations.What Makes a Good Activation Function?Let’s think about what properties we need:Non-linearity (obviously, or we’re back where we started)Differentiability (we’ll need derivatives for learning)Computational efficiency (we’ll apply it billions of times)Avoid saturation (outputs shouldn’t always be at extremes)Zero-centered or positive (depending on the problem)Common Activation FunctionsReLU (Rectified Linear Unit): f(x) = max(0, x)Why it works:Dead simple: if input is positive, output equals input; if negative, output is zeroNon-linear despite looking linear (the “kink” at zero creates non-linearity)Computationally trivial: just one comparison and zero multiplicationDoesn’t saturate for positive values (unlike sigmoid)Induces sparsity: many neurons output exactly zero, creating efficient representationsThe problem:“Dying ReLU”: if a neuron’s weights push it permanently into negative territory, its gradient becomes zero and it stops learning foreverNot zero-centered: all outputs are positive, which can slow convergenceVariants:Leaky ReLU: f(x) = max(0.01x, x) — allows small gradients when x < 0, preventing deathELU (Exponential Linear Unit): Smooth curve for negative values, better learning dynamicsSigmoid: f(x) = 1/(1 + e^(-x))Why it exists:Squashes any input into range (0, 1)Historically motivated by biological neurons (firing rates between 0 and 1)Output can be interpreted as probabilitywhat’s happening?For large positive x: e^(-x) approaches 0, so output approaches 1For large negative x: e^(-x) approaches infinity, so output approaches 0At x = 0: output is 0.5Why it’s problematic:Vanishing gradients: For large positive or negative inputs, the sigmoid is nearly flat. The derivative approaches zero. During backpropagation, gradients get multiplied across layers; zeros multiply to deepen zeros. Deep networks can’t learn.Not zero-centered: Outputs always positive (0 to 1), causing zig-zagging during optimizationComputationally expensive: Exponential functionWhere it’s still used:Output layer for binary classification (want probability between 0 and 1)Tanh: f(x) = (e^x - e^(-x))/(e^x + e^(-x))Advantages over sigmoid:Zero-centered: outputs range from -1 to 1Stronger gradients: derivative at zero is 1 (compared to 0.25 for sigmoid)Still suffers from:Vanishing gradients for extreme valuesComputational cost of exponentialsSoftmax: f(x_i) = e^(x_i) / Σe^(x_j)Completely different purpose:Not used between hidden layersExclusively for multi-class classification output layersIn simple termsTakes a vector of arbitrary values (logits)Converts them into probabilities that sum to 1Exponentiation ensures all values are positiveDivision by sum ensures they sum to 1Higher inputs get exponentially higher probabilitiesExample:Input: [2.0, 1.0, 0.1]After softmax: [0.659, 0.242, 0.099]Notice: still ordered the same way, but now they’re probabilitiesWhy Different Layers Need Different ActivationsHidden Layers: ReLU family (efficiency, avoiding vanishing gradients)Binary Classification Output: Sigmoid (get probability for one class)Multi-class Classification Output: Softmax (get probability distribution over all classes)Regression Output: Often no activation (or linear) — we want the raw value, not a bounded one3. Forward Propagation: The Prediction ProcessWhat is Propagation?“Propagation” is just a fancy word for “passing information through.” Forward propagation is the process of taking input data and pushing it through every layer until we get a prediction.Let’s build this concept from absolute scratch.The Single Neuron CaseYou have:Input: x = 3Weight: w = 2Bias: b = 1Step 1: Linear combination z = wx + b = 2(3) + 1 = 7Step 2: Activation a = ReLU(z) = max(0, 7) = 7That’s it. The neuron outputs 7. This output might be the final prediction (if it’s the only neuron), or it might be input to the next layer.Multiple Inputs, Single NeuronNow you have three inputs:Inputs: x = [x₁=2, x₂=3, x₃=1]Weights: w = [w₁=0.5, w₂=-1, w₃=2]Bias: b = 1Step 1: Weighted sumz = w₁x₁ + w₂x₂ + w₃x₃ + bz = 0.5(2) + (-1)(3) + 2(1) + 1z = 1 - 3 + 2 + 1 = 1Step 2: Activation a = ReLU(1) = 1Single Layer: Multiple NeuronsNow suppose we have 3 neurons in one layer, all receiving the same 3 inputs.Neuron 1:Weights: [w₁₁, w₁₂, w₁₃], Bias: b₁Output: a₁ = ReLU(w₁₁x₁ + w₁₂x₂ + w₁₃x₃ + b₁)Neuron 2:Weights: [w₂₁, w₂₂, w₂₃], Bias: b₂Output: a₂ = ReLU(w₂₁x₁ + w₂₂x₂ + w₂₃x₃ + b₂)Neuron 3:Weights: [w₃₁, w₃₂, w₃₃], Bias: b₃Output: a₃ = ReLU(w₃₁x₁ + w₃₂x₃₃ + w₃₃x₃ + b₃)The layer transforms input vector [x₁, x₂, x₃] into output vector [a₁, a₂, a₃]Matrix Representation: Scaling to Thousands of NeuronsWriting out every neuron individually is tedious. We use matrix notation:Weight Matrix W:W = [w₁₁ w₁₂ w₁₃] [w₂₁ w₂₂ w₂₃] [w₃₁ w₃₂ w₃₃]Each row represents one neuron’s weights.Input Vector x:x = [x₁] [x₂] [x₃]Forward propagation for the layer:z = Wx + ba = ReLU(z)This single matrix multiplication computes all neurons simultaneously. With modern GPUs optimized for matrix operations, we can process thousands of neurons in parallel.Deep Networks: Chaining LayersNow stack multiple layers. The output of layer 1 becomes the input to layer 2:Layer 1:z¹ = W¹x + b¹a¹ = ReLU(z¹)Layer 2:z² = W²a¹ + b²a² = ReLU(z²)Layer 3 (output):z³ = W³a² + b³ŷ = softmax(z³) [if classification]The final output ŷ is our prediction.Concrete Example: Digit RecognitionInput: 28×28 pixel image of a handwritten digit (flattened to 784 values)Architecture:Input layer: 784 neuronsHidden layer 1: 128 neurons (with ReLU)Hidden layer 2: 64 neurons (with ReLU)Output layer: 10 neurons (with softmax for digits 0–9)Forward propagation:z¹ = W¹x + b¹ [128 values]a¹ = ReLU(z¹) [128 values]z² = W²a¹ + b² [64 values]a² = ReLU(z²) [64 values]z³ = W³a² + b³ [10 values]ŷ = softmax(z³) [10 probabilities summing to 1]Output might be: [0.01, 0.02, 0.05, 0.7, 0.1, 0.05, 0.03, 0.02, 0.01, 0.01]The network predicts “3” with 70% confidence (index 3 has highest probability).Why “Forward”?Because information flows in one direction: from input → through hidden layers → to output. No loops, no feedback (in standard feedforward networks). Each layer only looks forward, never backward.Later, during learning, we’ll propagate in the opposite direction (backward) to adjust weights. But prediction is always forward.4. Loss Functions: Quantifying ErrorWhy Do We Need Loss?Imagine you’re teaching a child to draw circles. They draw something. How do you tell them how “wrong” it is? You need a measurement some way to quantify the difference between what they drew and a perfect circle.Neural networks face the same problem. After forward propagation, we have a prediction ŷ. We also have the true answer y. The loss function L(ŷ, y) measures how wrong the prediction is.This single number is crucial because:It tells us how well the model is performingIt guides the learning process (we’ll adjust weights to minimize this number)Different problems need different ways of measuring “wrongness”Property Requirements for Loss FunctionsNon-negative: L ≥ 0 always (can't be "negative wrong")Zero when perfect: L = 0 when ŷ = y exactlyIncreases with error: Worse predictions → higher lossDifferentiable: We need gradients for learning (calculus requirement)Appropriate for the task: Regression vs classification need different measuresMean Squared Error (MSE): For RegressionThe Problem: Predict a continuous value (house price, temperature, stock price)The most intuitive approach: absolute difference |ŷ - y|If true value is 100 and we predict 90, error = 10Simple, interpretableBut there’s a problem: absolute value isn’t differentiable at zero (the derivative has a discontinuity). This complicates learning algorithms.Better approach: Square the differenceL = (ŷ - y)²Why squaring?Always positive (negative errors don’t cancel positive ones)Differentiable everywhere: dL/dŷ = 2(ŷ - y)Penalizes large errors more (error of 10 contributes 100, but error of 1 contributes only 1)Mathematically convenient (leads to elegant solutions)For multiple predictions (a batch):MSE = (1/n) Σᵢ(ŷᵢ - yᵢ)²We average across all samples to get a single loss value.Concrete Example:Predicting house pricesTrue prices: [200k, 300k, 250k]Predicted: [210k, 280k, 255k]Errors: [10k, -20k, 5k]Squared errors: [100M, 400M, 25M]MSE = (100M + 400M + 25M) / 3 = 175MThe large middle error dominates the loss, signaling that’s where improvement is needed most.Variant: MAE (Mean Absolute Error)MAE = (1/n) Σᵢ|ŷᵢ - yᵢ|More robust to outliers (doesn’t square them)Less sensitive to large errorsHarder to optimize (non-smooth at zero)Cross-Entropy Loss: For ClassificationThe Problem: Predict discrete categories (cat vs dog, spam vs ham, digit 0–9)MSE doesn’t work well here. Why? Because classification outputs are probabilities, and we need to measure “how wrong” a probability distribution is.Binary Cross-Entropy (Two Classes)Setup:True label: y ∈ {0, 1} (e.g., 0 = not spam, 1 = spam)Predicted probability: ŷ ∈ [0, 1] (from sigmoid activation)If true label is 1 (positive class):If we predict ŷ = 1.0 (certain it’s positive): perfect, loss should be 0If we predict ŷ = 0.9 (very confident): small lossIf we predict ŷ = 0.5 (uncertain): moderate lossIf we predict ŷ = 0.1 (confident it’s negative): large lossIf we predict ŷ = 0.0 (certain it’s negative): infinite loss (catastrophically wrong)The formula that captures this:L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]Why this works:Case 1: y = 1 (true class is positive)L = -log(ŷ)If ŷ = 1: L = -log(1) = 0 ✓If ŷ = 0.5: L = -log(0.5) ≈ 0.69If ŷ = 0.1: L = -log(0.1) ≈ 2.30If ŷ → 0: L → ∞ (massive penalty for confident wrong answer)Case 2: y = 0 (true class is negative)L = -log(1-ŷ)If ŷ = 0: L = -log(1) = 0 ✓If ŷ = 0.5: L = -log(0.5) ≈ 0.69If ŷ = 0.9: L = -log(0.1) ≈ 2.30If ŷ → 1: L → ∞The logarithm creates the right penalty structure: small errors have small losses, but confident mistakes are punished severely.Why “cross-entropy”?It comes from information theory. Cross-entropy measures the average number of bits needed to encode data from one distribution using another distribution. Here, we’re measuring the “distance” between the true distribution (y) and predicted distribution (ŷ).Categorical Cross-Entropy (Multiple Classes)Setup:True label: one-hot encoded vector (e.g., [0, 0, 1, 0, 0] for class 3)Predicted: probability distribution from softmax (e.g., [0.1, 0.2, 0.5, 0.15, 0.05])Formula:L = -Σᵢ yᵢ·log(ŷᵢ)Since y is one-hot (only one element is 1, rest are 0), this simplifies to:L = -log(ŷ_true_class)Example: Digit classification (0–9)True label: 7 → one-hot: [0,0,0,0,0,0,0,1,0,0]Predicted: [0.05, 0.05, 0.1, 0.05, 0.05, 0.05, 0.1, 0.4, 0.1, 0.05]Loss = -log(0.4) ≈ 0.916If the model had predicted 7 with 0.9 probability: Loss = -log(0.9) ≈ 0.105 (much better)Intuition: We only care about the probability assigned to the correct class. The loss increases as this probability decreases.Choosing the Right Loss FunctionRegression (predicting continuous values):MSE: Standard choice, penalizes large errors heavilyMAE: More robust to outliersHuber Loss: Combines benefits of both (MSE for small errors, MAE for large)Binary Classification:Binary Cross-Entropy: Standard choice when using sigmoid outputMulti-class Classification:Categorical Cross-Entropy: When labels are one-hot encodedSparse Categorical Cross-Entropy: When labels are integers (more memory efficient)Custom Loss Functions: Sometimes you need domain-specific losses. For example:Medical diagnosis: False negatives might be more costly than false positivesImage generation: Perceptual losses that compare high-level features, not pixelsReinforcement learning: Reward-based lossesThe loss function is the objective we’re optimizing. Choose it carefully — your model will become excellent at minimizing it, for better or worse.5. Backpropagation: The Learning AlgorithmThis step is crucial it’s where the real learning happens.Our neural network has millions of tiny adjustable numbers called weights. We make a prediction, compare it with the correct answer, and realize we’re off. The big question is: how do we tweak those millions of weights to make the next prediction better?It’s not as simple as it sounds. Each weight affects many others, and changing even one can ripple through the entire network. Should we increase it or decrease it? And by how much?That’s where backpropagation comes in a beautifully systematic way to figure out exactly how every single weight should change to reduce the overall error.To really grasp what’s happening here, you’ll need a bit of comfort with calculus, especially with derivatives and how small changes in one variable affect another.The Core Insight: The Chain Rule of CalculusEverything in backpropagation stems from one calculus concept: the chain rule.Simple example: If z = f(y) and y = g(x), then:dz/dx = (dz/dy) · (dy/dx)In words: The rate of change of z with respect to x equals the rate of change of z with respect to y, multiplied by the rate of change of y with respect to x.This might seem abstract, so let’s make it concrete.Concrete Example: A Tiny NetworkArchitecture:One input: x = 2One weight: w = 3One bias: b = 1Activation: ReLUTrue output: y = 15Forward pass:z = wx + b = 3(2) + 1 = 7a = ReLU(z) = 7L = (a - y)² = (7 - 15)² = 64Loss is 64. We want to reduce it. Should we increase or decrease w?Backward pass (backpropagation):We need dL/dw (how much does loss change when we change w?).Using the chain rule:dL/dw = (dL/da) · (da/dz) · (dz/dw)Let’s calculate each piece:Step 1: dL/da (how does loss change with activation?)L = (a - y)²dL/da = 2(a - y) = 2(7 - 15) = -16Step 2: da/dz (how does activation change with pre-activation?)a = ReLU(z) = max(0, z)For z > 0: da/dz = 1For z ≤ 0: da/dz = 0Since z = 7 > 0: da/dz = 1Step 3: dz/dw (how does pre-activation change with weight?)z = wx + bdz/dw = x = 2Combine them:dL/dw = (dL/da) · (da/dz) · (dz/dw)dL/dw = (-16) · (1) · (2) = -32Interpretation: The gradient is -32. This means:If we increase w by a tiny amount, the loss will decrease by approximately 32 times that amountThe negative sign tells us to increase w (move opposite to the gradient)The magnitude (32) tells us how sensitive the loss is to changes in wUpdate the weight:w_new = w_old - learning_rate · (dL/dw)w_new = 3 - 0.01 · (-32) = 3 + 0.32 = 3.32We’ve just learned! The network adjusted its weight to reduce the loss.Scaling to Deep NetworksIn real networks with many layers, we calculate gradients layer by layer, moving backward from the output.Example: 3-layer networkForward pass:Layer 1: z¹ = W¹x + b¹, a¹ = ReLU(z¹)Layer 2: z² = W²a¹ + b², a² = ReLU(z²)Layer 3: z³ = W³a² + b³, ŷ = softmax(z³)Loss: L = CrossEntropy(ŷ, y)Backward pass:Layer 3 (output layer):dL/dz³ = ŷ - y [derivative of softmax + cross-entropy]dL/dW³ = (dL/dz³) · a²ᵀdL/db³ = dL/dz³dL/da² = W³ᵀ · (dL/dz³) [pass gradient to previous layer]Layer 2:dL/dz² = (dL/da²) ⊙ ReLU'(z²) [⊙ is element-wise multiplication]dL/dW² = (dL/dz²) · a¹ᵀdL/db² = dL/dz²dL/da¹ = W²ᵀ · (dL/dz²)Layer 1:dL/dz¹ = (dL/da¹) ⊙ ReLU'(z¹)dL/dW¹ = (dL/dz¹) · xᵀdL/db¹ = dL/dz¹Notice the pattern:Calculate gradient with respect to pre-activation (z)Calculate gradient for weights: dL/dW = (dL/dz) · inputᵀCalculate gradient for bias: dL/db = dL/dzPass gradient backward: dL/d(previous_activation) = Wᵀ · (dL/dz)Why “Backpropagation”?Because we propagate gradients backward through the network, from output to input. Each layer receives the gradient from the layer ahead, computes its own gradients, and passes gradients to the layer behind.The Vanishing Gradient ProblemFundamental issue in deep networks:When we multiply many small numbers (gradients) together through many layers, the product can become vanishingly small — approaching zero.Example: If each layer has gradient 0.1, after 10 layers:0.1¹⁰ = 0.0000000001The early layers receive essentially zero gradient and stop learning. The network is deep but only the last few layers are actually training.Solutions:ReLU activation: Gradient is 1 for positive inputs (doesn’t shrink)Residual connections: Skip connections that allow gradients to bypass layersBatch normalization: Keeps activations in a healthy rangeCareful initialization: Start with weights that don’t lead to extreme activationsThe Exploding Gradient ProblemThe opposite issue: gradients grow exponentially.If each layer has gradient 2, after 10 layers:2¹⁰ = 1024Weights update by huge amounts, causing wild oscillations and instability. The model never converges.Solutions:Gradient clipping: Cap gradients at a maximum valueCareful initialization: Start with smaller weightsBatch normalization: Stabilizes the scale of activations and gradientsLower learning rates: Smaller update stepsComputational Efficiency: Why Backpropagation is BrilliantNaive approach to finding gradients: For each weight, we could:Make a tiny change: w → w + εRecalculate the entire lossCompute: (L_new - L_old) / εFor a network with 1 million weights, this requires 1 million forward passes. Computationally prohibitive.Backpropagation insight: Calculate all gradients in a single backward pass by reusing intermediate calculations. For N weights, we need:1 forward pass1 backward passThat’s it. Backpropagation computes all million gradients with just two passes through the network. This is why deep learning became practical.The Mathematics: Derivatives of Common ComponentsReLU:f(x) = max(0, x)f'(x) = 1 if x > 0, else 0Sigmoid:σ(x) = 1/(1 + e^(-x))σ'(x) = σ(x)(1 - σ(x)Tanh:tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))tanh'(x) = 1 - tanh²(x)Softmax + Cross-Entropy (combined):dL/dz = ŷ - yThis remarkably simple gradient is why we use softmax with cross-entropy.MSE:L = (ŷ - y)²dL/dŷ = 2(ŷ - y)Memory RequirementsBackpropagation requires storing all activations from the forward pass to compute gradients in the backward pass. For a network with:Batch size: 324 layers with 1000 neurons eachWe must store: 32 × 4 × 1000 = 128,000 activation values in memory.This is why training large models requires substantial GPU memory, and why techniques like gradient checkpointing (recomputing some activations rather than storing them) become necessary.6. Gradient Descent: The Optimization AlgorithmImagine you’re standing on a mountain in thick fog. You can’t see the bottom of the valley, but you can feel the slope beneath your feet. Your goal: reach the lowest point.Strategy: Take a step in the direction of steepest descent.This is gradient descent. The “mountain” is the loss landscape — a high-dimensional surface where each dimension represents one weight, and the height represents the loss.The Mathematical FoundationAfter backpropagation, we have gradients: ∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙEach gradient tells us:Direction: Positive gradient means loss increases when weight increasesMagnitude: Large gradient means weight strongly affects lossGradient descent update rule:w_new = w_old - α · (∂L/∂w)Where α (alpha) is the learning rate.Why subtract? The gradient points in the direction of increasing loss. We want to decrease loss, so we move in the opposite direction (negative gradient).The Learning Rate: The Most Critical HyperparameterThe learning rate controls the step size. Choosing it is an art and science.Too large (α = 1.0):Iteration 1: Loss = 100Iteration 2: Loss = 250 [overshot the minimum]Iteration 3: Loss = 80Iteration 4: Loss = 300 [wild oscillations]...never convergesToo small (α = 0.000001):Iteration 1: Loss = 100.00Iteration 2: Loss = 99.99Iteration 3: Loss = 99.98...painfully slow, might get stuck in local minimumJust right (α = 0.01):Iteration 1: Loss = 100Iteration 2: Loss = 85Iteration 3: Loss = 73...steady progress toward minimumTypical ranges:Small networks: 0.001–0.01Large networks: 0.0001–0.001With Adam optimizer: 0.001 (default)Variants of Gradient Descent1. Batch Gradient DescentApproach: Use the entire dataset to compute one gradient update.for epoch in range(num_epochs): # Compute gradient using ALL training samples gradient = compute_gradient(all_data) weights = weights - learning_rate * gradientPros:Smooth convergenceGuaranteed to find the minimum (for convex functions)Cons:Slow: One update per epochMemory intensive: Must load entire datasetGets stuck in local minima (for non-convex functions)2. Stochastic Gradient Descent (SGD)Approach: Use one random sample at a time.for epoch in range(num_epochs): shuffle(data) for sample in data: # Compute gradient using ONE sample gradient = compute_gradient(sample) weights = weights - learning_rate * gradientPros:Fast updates: One update per sampleCan escape local minima (due to noise)Memory efficientCons:Noisy updates: path to minimum is erraticDoesn’t fully utilize parallel computing (GPUs)May oscillate around minimum without settling3. Mini-Batch Gradient Descent (Most Common)Approach: Use a small batch of samples (typically 32, 64, 128, or 256).for epoch in range(num_epochs): shuffle(data) for batch in create_batches(data, batch_size=32): # Compute gradient using BATCH of samples gradient = compute_gradient(batch) weights = weights - learning_rate * gradientPros:Balanced: More stable than SGD, faster than batch GDEfficient: Perfect for GPU parallelizationModerate memory usageNoise helps escape local minima, but not too muchCons:Another hyperparameter to tune (batch size)This is the standard in modern deep learning.Advanced Optimizers: Beyond Basic Gradient DescentBasic gradient descent treats all parameters equally and uses a fixed learning rate. Modern optimizers are more sophisticated.MomentumProblem with basic GD: Imagine a narrow valley: steep sides, gentle slope toward minimum. Basic GD oscillates between sides while slowly progressing forward.Solution: Momentumvelocity = 0for iteration: gradient = compute_gradient() velocity = β * velocity - learning_rate * gradient weights = weights + velocityIntuition: Remember previous gradients. If we keep going in the same direction, accelerate. If we oscillate, dampen the movement.Effect:Faster convergence in consistent directionsReduced oscillationsCan roll through small local minimaTypical β: 0.9 (use 90% of previous velocity)RMSprop (Root Mean Square Propagation)Problem: Some parameters need large updates, others need small ones. A single learning rate is suboptimal.Solution: Adapt the learning rate for each parameter based on recent gradient magnitudes.squared_gradient_avg = 0for iteration: gradient = compute_gradient() squared_gradient_avg = β * squared_gradient_avg + (1-β) * gradient² adjusted_gradient = gradient / (sqrt(squared_gradient_avg) + ε) weights = weights - learning_rate * adjusted_gradientIntuition:Parameters with consistently large gradients get smaller effective learning rates (divided by large number)Parameters with small gradients get larger effective learning rates (divided by small number)Effect: Each parameter gets its own adaptive learning rate.Adam (Adaptive Moment Estimation)The gold standard: Combines momentum and RMSprop.m = 0 # first moment (momentum)v = 0 # second moment (RMSprop)for iteration: gradient = compute_gradient() # Update moments m = β₁ * m + (1-β₁) * gradient v = β₂ * v + (1-β₂) * gradient² # Bias correction (important in early iterations) m_corrected = m / (1 - β₁^t) v_corrected = v / (1 - β₂^t) # Update weights weights = weights - learning_rate * m_corrected / (sqrt(v_corrected) + ε)Why Adam dominates:Combines best of both worlds: momentum + adaptive learning ratesRobust to hyperparameter choices (default values work well)Efficient and converges quicklyWorks across diverse problem typesDefault hyperparameters:learning_rate = 0.001β₁ = 0.9 (momentum)β₂ = 0.999 (RMSprop)ε = 1e-8 (numerical stability)Learning Rate SchedulesEven with Adam, learning rates can be adjusted during training.1. Step DecayEpochs 1-30: lr = 0.001Epochs 31-60: lr = 0.0001Epochs 61+: lr = 0.00001Why: Start with larger steps to quickly find the general region, then smaller steps to fine-tune.2. Exponential Decaylr(t) = lr₀ * e^(-kt)Smoothly decreases learning rate over time.3. Cosine Annealinglr(t) = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(πt/T))Gradually reduces learning rate following a cosine curve.4. Warm RestartsPeriodically reset learning rate to initial value. Helps escape local minima by occasionally taking large steps again.5. Learning Rate WarmupStart with very small learning rate, gradually increase to target value over first few epochs. Prevents instability in early training.The Convergence Question: When to Stop?Training loss keeps decreasing but should we keep training?Early StoppingConcept: Monitor performance on a validation set (data the model hasn’t trained on).Epoch 1: Train Loss = 2.5, Val Loss = 2.6Epoch 5: Train Loss = 1.2, Val Loss = 1.3Epoch 10: Train Loss = 0.8, Val Loss = 0.9Epoch 15: Train Loss = 0.4, Val Loss = 0.85 [val loss stopped decreasing]Epoch 20: Train Loss = 0.2, Val Loss = 0.9 [val loss increasing!]Stop at epoch 10: Model is starting to overfit (memorizing training data rather than learning generalizable patterns).Implementation:best_val_loss = infinitypatience = 5 # epochs to wait for improvementpatience_counter = 0for epoch: train() val_loss = validate() if val_loss < best_val_loss: best_val_loss = val_loss save_model() patience_counter = 0 else: patience_counter += 1 if patience_counter >= patience: print("Early stopping!") breakChallenges in the Optimization LandscapeLocal MinimaThe loss surface has multiple valleys. Gradient descent might settle into a shallow local minimum instead of the deep global minimum.Solutions:Momentum (can roll over small bumps)Multiple random initializationsStochastic updates (noise helps escape)Saddle PointsPoints where gradient is zero but it’s neither a minimum nor maximum — a “saddle” shape. More common than local minima in high dimensions.Solutions:Momentum helps push throughSecond-order methods (Newton’s method)PlateausFlat regions where gradients are nearly zero. Progress stalls.Solutions:Adaptive learning rates (Adam)Patience (eventually gradients increase again)Batching and ParallelizationWhy batches matter for GPUs:Modern GPUs have thousands of cores. Computing gradients for 32 samples independently is slow. Computing them in parallel is fast.Matrix operations on batches:Input batch: [32 × 784] (32 images, 784 pixels each)Weights: [784 × 128]Output: [32 × 128] (32 outputs, 128 neurons)Single matrix multiplication computes all 32 samples simultaneously. This is why GPUs are essential for deep learning.Batch size trade-offs:Small batches (e.g., 8–32):More frequent updatesMore noise (helps generalization)Less memorySlower per epochLarge batches (e.g., 256–1024):Fewer updates per epochSmoother gradientsMore memory requiredFaster per epochRisk of poor generalization (too smooth)Sweet spot: Usually 32–128 for most applications.The Complete Training Loop: Putting It All TogetherNow we understand all the pieces. Here’s how they work together:Initialization# Initialize weights (Xavier/He initialization)for layer in network: layer.weights = random_normal(0, sqrt(2/n_inputs)) layer.biases = zeros()# Initialize optimizeroptimizer = Adam(learning_rate=0.001)Why careful initialization matters:Too large: Exploding activations and gradientsToo small: Vanishing gradientsXavier/He initialization: Scaled to maintain activation variance across layersThe Training Loopfor epoch in range(num_epochs): # Shuffle data for randomness shuffle(training_data) for batch in create_batches(training_data, batch_size=32): # 1. FORWARD PROPAGATION x, y_true = batch z1 = W1 @ x + b1 a1 = relu(z1) z2 = W2 @ a1 + b2 a2 = relu(z2) z3 = W3 @ a2 + b3 y_pred = softmax(z3) # 2. COMPUTE LOSS loss = cross_entropy(y_pred, y_true) # 3. BACKPROPAGATION dL_dz3 = y_pred - y_true dL_dW3 = dL_dz3 @ a2.T dL_db3 = sum(dL_dz3, axis=0) dL_da2 = W3.T @ dL_dz3 dL_dz2 = dL_da2 * relu_derivative(z2) dL_dW2 = dL_dz2 @ a1.T dL_db2 = sum(dL_dz2, axis=0) dL_da1 = W2.T @ dL_dz2 dL_dz1 = dL_da1 * relu_derivative(z1) dL_dW1 = dL_dz1 @ x.T dL_db1 = sum(dL_dz1, axis=0) # 4. OPTIMIZATION (using Adam) W3, b3 = optimizer.update(W3, b3, dL_dW3, dL_db3) W2, b2 = optimizer.update(W2, b2, dL_dW2, dL_db2) W1, b1 = optimizer.update(W1, b1, dL_dW1, dL_db1) # 5. VALIDATION val_loss = evaluate(validation_data) print(f"Epoch {epoch}: Train Loss = {loss:.4f}, Val Loss = {val_loss:.4f}") # 6. EARLY STOPPING CHECK if should_stop(val_loss): break# 7. FINAL EVALUATIONtest_accuracy = evaluate(test_data)print(f"Final Test Accuracy: {test_accuracy:.2%}")What Happens Over TimeEpoch 1:Weights are randomPredictions are terrible (10% accuracy on 10 classes = random guessing)Loss is high (maybe 2.3)Large gradientsBig weight updatesEpoch 10:Network learned basic patternsAccuracy improved to 60%Loss decreased to 1.2Moderate gradientsSteady learningEpoch 50:Network refined understandingAccuracy at 92%Loss at 0.3Small gradientsFine-tuning detailsEpoch 100:Diminishing returnsAccuracy 93% (validation starting to plateau)Risk of overfittingTime to stopMonitoring Training: What to Watch1. Training LossShould decrease steadilyIf fluctuating wildly: learning rate too highIf barely moving: learning rate too low or stuck in minimum2. Validation LossShould track training loss initiallyIf diverging: overfittingIf much higher from start: train/val data distribution mismatch3. Gradient NormsShould be moderate (0.001–1.0)If very small (< 0.0001): vanishing gradientsIf very large (> 10): exploding gradients4. Activation StatisticsMean should be near zeroStd should be moderate (~1)If activations saturate (all 0 or all max): architectural problem5. Learning RateCan be adjusted based on progressToo aggressive: divergenceToo conservative: slow progressConclusion: The Symphony of LearningMachine learning is not one algorithm — it’s a carefully orchestrated system:Architecture provides the capacity to represent complex functions (Universal Approximation Theorem)Activation functions enable non-linear transformationsForward propagation generates predictionsLoss functions quantify errorBackpropagation computes gradients efficientlyGradient descent iteratively improves weightsEach component is essential. Remove any one, and learning fails.The beauty lies in the simplicity of each piece and the power of their combination. From these building blocks — matrix multiplications, non-linear functions, derivatives, and iterative updates — emerges the capability to:Recognize faces in photosTranslate between languagesGenerate realistic imagesPlay games at superhuman levelsPredict protein structuresDrive cars autonomouslyAll from the same fundamental algorithm, repeated billions of times, gradually sculpting random weights into a representation of the world’s patterns.This is how machines learn: not through magic, but through mathematics, iteration, and the elegant interplay of calculus and optimization across high-dimensional spaces.

neuronsllmactivation-functions+1 more

Oct 4, 2025

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-1)

cover imageOver the past few months I’ve been building a fully open-source voice agent, exploring the stack end-to-end and learning a ton along the way. Now I’m ready to share everything I discovered.The best part? In 2025 you actually can build one yourself. With today’s open-source models and frameworks you can piece together a real-time voice agent that listens, reasons, and talks back almost like a human without relying on closed platforms.Let’s walk through the building blocks, step by step.The Core PipelineAt a high level, a modern voice agent looks like this:OverviewPretty simple on paper but each step has its own challenges. Let’s dig deeper.Speech-to-Text (STT)Speech is a continuous audio wave it doesn’t naturally have clear sentence boundaries or pauses. That’s where Voice Activity Detection (VAD) comes in:VAD (Voice Activity Detection): Detects when the user starts and stops talking. Without it, your bot either cuts you off too soon or stares at you blankly.Once the boundaries are clear, the audio is passed into an STT model for transcription.Silero VAD is the gold standard and pipecat has builtin support so I have choosen that :Sub-1ms per chunk on CPUJust 2MB in sizeHandles 6000+ languagesWorks with 8kHz & 16kHz audioMIT license (unrestricted use)Popular STT OptionsWhat are thing we need focus on choosing STT for voice agentAccuracy:Word Error Rate (WER): Measures transcription mistakes (lower is better).Example: WER 5% means 5 mistakes per 100 words.Sentence-level correctness: Some models may get individual words right but fail on sentence structure.Multilingual support: If your users speak multiple languages, check language coverage.Noise tolerance: Can it handle background noise, music, or multiple speakers?Accent/voice variation handling: Works across accents, genders, and speech speeds.Voice Activity Detection (VAD) integration: Detects when speech starts and ends.Streaming: Most STT models work in batch mode (great for YouTube captions, bad for live conversations). For real-time agents, we need streaming output words should appear while you’re still speaking.Low Latency: Even 300 500ms delays feel unnatural. Target sub-second responses.Whisper often comes first to mind for most people when discussing speech-to-text because it has a large community, numerous variants, and is backed by OpenAI.OpenAI Whisper FamilyWhisper Large V3 — State-of-the-art accuracy with multilingual supportFaster-Whisper — Optimized implementation using CTranslate2Distil-Whisper — Lightweight for resource-constrained environmentsWhisperX — Enhanced timestamps and speaker diarizationNVIDIA also offers some interesting STT models, though I haven’t tried them yet since Whisper works well for my use case. I’m just listing them here for you to explore:Canary Qwen 2.5B — Leading performance, 5.63% WERParakeet TDT 0.6B V2 — Ultra-fast inference (3,386 RTFx)Here the comparsion tableWhy I Chose FastWhisperAfter testing, my pick is FastWhisper, an optimized inference engine for Whisper.Key Advantages:12.5× faster than original Whisper3× faster than Faster-Whisper with batchingSub-200ms latency possible with proper tuningSame accuracy as WhisperRuns on CPU & GPU with automatic fallbackIt’s built in C++ + CTranslate2, supports batching, and integrates neatly with VAD.For more you can check Speech to Text AI Model & Provider LeaderboardLarge Language Model (LLM)Once speech is transcribed, the text goes into an LLM the “brain” of your agent.What we want in an LLM for voice agents:Understands prompts, history, and contextGenerates responses quicklySupports tool calls (search, RAG, memory, APIs)Leading Open-Source LLMsMeta Llama FamilyLlama 3.3 70B — Open-source leaderLlama 3.2 (1B, 3B, 11B) — Scaled for different deployments128K context window — remembers long conversationsTool calling support — built-in function executionOthersMistral 7B / Mixtral 8x7B — Efficient and competitiveQwen 2.5 — Strong multilingual supportGoogle Gemma — Lightweight but solidMy Choice: Llama 3.3 70B VersatileWhy?Large context window → keeps conversations coherentTool use built-inWidely supported in the open-source communityText-to-Speech (TTS)Now the agent needs to speak back and this is where quality can make or break the experience.A poor TTS voice instantly ruins immersion. The key requirements are:Low latency avoid awkward pausesNatural speech no robotic toneStreaming output start speaking mid-sentenceOpen-Source TTS Models I’ve TriedThere are plenty of open-source TTS models available. Here’s a snapshot of the ones I experimented with:Kokoro-82M — Lightweight, #1 on HuggingFace TTS Arena, blazing fastChatterbox — Built on Llama, fast inference, rising adoptionXTTS-v2 — Zero-shot voice cloning, 17 languages, streaming supportFishSpeech — Natural dialogue flowOrpheus — Scales from 150M–3BDia — A TTS model capable of generating ultra-realistic dialogue in one pass.Why I Chose Kokoro-82MKey Advantages:5–15× smaller than competing models while maintaining high qualityRuns under 300MB — edge-device friendlySub-300ms latencyHigh-fidelity 24kHz audioStreaming-first design — natural conversation flowLimitations:No zero-shot voice cloning (uses a fixed voice library)Less expressive than XTTS-v2Relatively new model with a smaller communityYou can also check out my minimal Kokoro-FastAPI server to experiment with it:Speech-to-Speech ModelsSpeech-to-Speech (S2S) models represent an exciting advancement in AI, combining speech recognition, language understanding, and text-to-speech synthesis into a single, end-to-end pipeline. These models allow natural, real-time conversations by converting speech input directly into speech output, reducing latency and minimizing intermediate processing steps.Some notable models in this space include:Moshi: Developed by Kyutai-Labs, Moshi is a state-of-the-art speech-text foundation model designed for real-time full-duplex dialogue. Unlike traditional voice agents that process ASR, LLM, and TTS separately, Moshi handles the entire flow end-to-end.CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.VALL-E & VALL-E X (Microsoft): These models support zero-shot voice conversion and speech-to-speech synthesis from limited voice samples.AudioLM (Google Research): Leverages language modeling on audio tokens to generate high-quality speech continuation and synthesis.Among these, I’ve primarily worked with Moshi. I’ve implemented it on a FastAPI server with streaming support, which allows you to test and interact with it in real-time. You can explore the FastAPI implementation here: FastAPI + Moshi GitHub.Framework (The Glue)Finally, you need something to tie all the pieces together: streaming audio, message passing, and orchestration.Open-Source FrameworksPipecatPurpose-built for voice-first agentsStreaming-first (ultra-low latency)Modular design — swap models easilyActive communityVocodeDeveloper-friendly, good docsDirect telephony integrationSmaller community, less activeLiveKit AgentsBased on WebRTCSupports voice, video, textSelf-hosting optionsTraditional OrchestrationLangChain — great for docs, weak at streamingLlamaIndex — RAG-focused, not optimized for voiceCustom builds — total control, but high overheadWhy I Recommend PipecatVoice-Centric FeaturesStreaming-first, frame-based pipeline (TTS can start before text is done)Smart Turn Detection v2 (intonation-aware)Built-in interruption handlingProduction ReadySub-500ms latency achievableEfficient for long-running agentsExcellent docs + examplesStrong, growing communityReal-World Performance~500ms voice-to-voice latency in productionWorks with Twilio + phone systemsSupports multi-agent orchestrationScales to thousands of concurrent usersLead to Next PartIn this first part, we’ve covered the core tech stack and models needed to build a real-time voice agent.In the next part of the series, we’ll dive into integration with Pipecat, explore our voice architecture, and walk through deployment strategies. Later, we’ll show how to enhance your agent with RAG (Retrieval-Augmented Generation), memory features, and other advanced capabilities to make your voice assistant truly intelligent.Stay tuned the next guide will turn all these building blocks into a working, real time voice agent you can actually deploy.I’ve created a GitHub repository VoiceAgentGuide for this series, where we can store our notes and related resources. Don’t forget to check it out and share your feedback. Feel free to contribute or add missing content by submitting a pull request (PR).ResourcesVoice AI & Voice Agents An Illustrated Primer

voice-agentai-agentvoiceassitant+1 more

Sep 21, 2025

The Other Side of OpenAI 12 Surprising Stories You Haven’t Heard

cover imageWhile browsing YouTube, I stumbled across a video titled This Book Changed How I Think About AI. Curious, I clicked and it introduced me to Empire of AI by Karen Hao, a book that dives deep into the evolution of OpenAI.The book explores OpenAI’s history, its culture of secrecy, and its almost single-minded pursuit of artificial general intelligence (AGI). Drawing on interviews with more than 260 people, along with correspondence and internal documents, Hao paints a revealing picture of the company.After reading it, I uncovered 12 particularly fascinating facts about OpenAI that most people don’t know. Let’s dive in.1. The “Open” in OpenAI Was More Branding Than BeliefThe name sounds noble who doesn’t like the idea of “open” AI? But here’s the catch: from the very beginning, openness was more narrative than commitment. Founders Sam Altman, Greg Brockman, and Elon Musk leaned into it because it helped them stand out. Behind closed doors, though, cofounder Ilya Sutskever was already suggesting they could scale it back once the story had served its purpose. In other words: open, until it wasn’t convenient.2. Elon Musk’s Billion-Dollar Promise? Mostly Smoke and MirrorsRemember Musk’s flashy $1 billion funding pledge? Turns out, OpenAI only ever saw about $130 million of it. And less than $45 million came directly from Musk himself. His back-and-forth on funding almost pushed the organization into crisis, forcing Altman to hunt down new sources of money.3. The For-Profit Shift Was More About Survival Than VisionIn 2019, OpenAI unveiled its “capped-profit” structure, pitching it as an innovative way to balance mission and money. But the truth is far less glamorous: the nonprofit model wasn’t bringing in the billions needed to compete with tech giants. At one point, Brockman and Sutskever even discussed merging with a chip startup. Creating OpenAI LP wasn’t a bold visionit was a lifeline.4. The “Capped-Profit” Model Looked Unlimited to CriticsInvestors were told their returns would be capped at 100x. Sounds responsible, right? But do the math: a $10 million check could still turn into a $1 billion payout. Critics quickly called it “basically unlimited,” arguing the cap only looked meaningful until you saw the actual numbers.5. GPT-2’s “Too Dangerous” Storyline Was a PR MasterstrokeIn 2019, OpenAI said its GPT-2 model was so powerful it had to be withheld for safety reasons. Headlines exploded. But here’s the twist: many researchers thought the risk claims were overblown and saw the whole thing as a publicity stunt engineered by Jack Clark, OpenAI’s communications chief at the time. The stunt worked — the company was suddenly everywhere.6. OpenAI’s Culture Had Clashing “Tribes”Inside OpenAI, things weren’t exactly harmonious. Sam Altman himself described the organization as divided into three factions: research explorers, safety advocates, and startup minded builders. He even warned of “tribal warfare” if they couldn’t pull together. That’s not just workplace tension it’s a sign of deep conflict over the company’s direction.7. ChatGPT’s Global Debut Was Basically an AccidentThink ChatGPT’s launch was carefully choreographed? Not at all. The product that made OpenAI a household name was released in just two weeks as a “research preview,” right after Thanksgiving 2022. The rush was partly to get ahead of a rumored chatbot from Anthropic. Even Microsoft OpenAI’s biggest partner was caught off guard and reportedly annoyed.8. Training Data Included Pirated Books and YouTube VideosWhere do you get enough data to train something like GPT-3 or GPT-4? In OpenAI’s case, by scraping almost everything it could. GPT-3 used a secret dataset nicknamed “Books2,” which reportedly included pirated works from Library Genesis. GPT-4 went even further, with employees transcribing YouTube videos and scooping up anything online without explicit “do not scrape” warnings.9. “AI Safety” Initially Ignored Social HarmsOpenAI loves to talk about AI safety now. But early on, executives resisted calls to broaden the term to include real-world harms like discrimination and bias. When pressed, one leader bluntly said, “That’s not our role.” The message was clear: safety meant existential risks, not everyday impacts.10. Scaling Up Came with Hidden Environmental CostsBigger models require more compute and more resources. Training GPT-4 in Microsoft’s Iowa data centers consumed roughly 11.5 million gallons of water in a single month, during a drought. Strikingly, Altman and other leaders reportedly never discussed these environmental costs in company-wide meetings.11. “SummerSafe LP” Had a Dark InspirationBefore OpenAI LP had its public name, it was secretly incorporated as “SummerSafe LP.” The reference? An episode of Rick and Morty where a car, tasked with keeping Summer safe, resorts to murder and torture. Internally, it was an ironic nod to how AI systems can twist well-meaning goals into dangerous outcomes.12. Departing Employees Faced Equity PressureLeaked documents revealed OpenAI used a hardball tactic with departing employees: sign a strict nondisparagement agreement or risk losing vested equity. This essentially forced people into lifelong silence. Altman later said he didn’t know this was happening and was embarrassed, but records show he had signed paperwork granting the company those rights a year earlier.Final ThoughtsOpenAI’s story is anything but straightforward. From broken promises and internal clashes to controversial data practices, the company has often operated in ways that don’t match its public messaging. Whether you see that as savvy strategy, messy growing pains, or something more troubling depends on your perspective.But one thing’s clear: the “open” in OpenAI has always been complicated.This blog originally published here

elon-muskopenaigpt

Sep 10, 2025

Rust Tools That Made Our Dev Team Productive Again

cover postAs regular readers of my blog may know, our primary technology stack is the MERN stack MongoDB, Express, React, and Node.js. On the frontend, we use React with TypeScript; on the backend, Node.js with TypeScript, and MongoDB serves as our database.While this stack has served us well, we encountered significant challenges as our application scaled particularly around build times, memory usage, and developer experience. In this post, I will outline two key areas where Rust-based tools helped us resolve these issues and substantially improved our team’s development velocity.Improving Frontend PerformanceThe Problem: Slow Builds and Poor Developer ExperienceAs our frontend codebase grew, we began facing several recurring issues:Local development startup times became painfully slow.Build processes consumed large amounts of memory.On lower-end machines, builds caused systems to hang or crash.Developers regularly raised concerns about delays and performance bottlenecks.These issues were primarily due to our use of Create React App (CRA) with an ejected Webpack configuration. While powerful, this setup became increasingly inefficient for our scale and complexity.First Attempt: Migrating to ViteIn search of a solution, I explored Vite, a build tool known for its speed and modern architecture.Benefits:Faster initial load times due to native ES module imports.Noticeable improvement in development server startup.Challenges:Migrating from an ejected CRA setup was complex due to custom Webpack configurations.Issues arose with lazy-loaded routes, SVG assets, and ESLint/type-checking delays.Certain runtime errors occurred during navigation, likely due to missing or incorrect Vite configurations.Ultimately, while Vite offered some performance benefits, it did not fully resolve our problems and introduced new complications.Final Solution: Adopting RspackAfter further research, we came across Rspack, a high-performance Webpack-compatible bundler written in Rust. What caught my attention was its focus on performance and ease of migration.Key advantages of Rspack:Significantly faster build times up to 70% improvement in our case.Reduced memory consumption during both build and development.Compatibility with existing Webpack plugins and configurations, which simplified migration.Designed as a drop-in replacement for Webpack.After resolving a few initial issues, we successfully integrated Rspack into our frontend build system. The migration resulted in substantial improvements in build speed and developer satisfaction. The system is now in production with no reported issues, and developers are once again comfortable working on the frontend.Accelerating Backend TestingThe Problem: Slow Kubernetes-Based Testing CycleOur backend uses Kubernetes for deployment and testing. The typical development workflow looked like this:A developer makes code changes.A Docker image is built and pushed to a registry using github action.The updated image is deployed to the Kubernetes cluster.Testers verify the changes.This process, while standard, became inefficient. Even small changes (such as adding a log statement) required a full image build and redeployment, resulting in delays of 15 minutes or more per test cycle.Optimization: Runtime Code SyncTo address this, we have written the shell script that will first run when the pod starts or restart which will pull the latest changes from github and run the code.git reset --hard origin/$BRANCH_NAMEgit pull origin $BRANCH_NAMEThis significantly reduced testing turnaround time for JavaScript-based services.The TypeScript BottleneckHowever, for services written in TypeScript, the situation was more complex. After pulling the latest code, we needed to transpile TypeScript to JavaScript using tsc or npm run build. Unfortunately, this process:Consumed excessive memory.Took too long to complete.Caused pods to crash, especially in test environments with limited resources.Solution: Integrating SWCTo solve this, we adopted SWC, a Rust-based TypeScript compiler. Unlike tsc, SWC focuses on speed and performance.Results after integrating SWC:Compilation time reduced to approximately 250 milliseconds.Memory usage dropped significantly.Allowed us to support live code updates without full builds or redeployments.Because SWC does not perform type checking, we use it only in test environments. This tradeoff allows testers to verify code changes rapidly, without impacting our production pipeline.Conclusion: Rust’s Impact on Team EfficiencyIn both our frontend and backend workflows, Rust-based tools Rspack and SWCdelivered substantial improvements:Frontend build times were reduced by more than 70%, with better memory efficiency.Testing cycles became significantly faster, especially for TypeScript services.Developer experience improved across the board, reducing frustration and increasing velocity.Rust’s performance characteristics, coupled with thoughtful tool design, played a critical role in resolving bottlenecks in our JavaScript-based systems. For teams facing similar challenges, especially around build performance and scalability, we strongly recommend exploring Rust-powered tools as a viable solution.

nodejsreactwebdev+1 more

Aug 9, 2025

This AI Interview Assistant Chrome Extension Was My Weekend Project

Hiring has always been one of those tasks that seems easy until you’re knee deep in resumes and trying to remember who did what, and more importantly what to ask them during the interview._A few weeks ago, I had this exact moment. I was preparing for an interview and had a resume open in one tab, a notepad in another, and ChatGPT somewhere in the background trying to help me brainstorm questions. That’s when the idea hit me:“Why am I juggling between tabs? What if this entire process could live in a single Chrome extension?”So I built one. It’s called HireZen.The Problem I Kept Running IntoEvery time I had to take an interview, I’d start by opening the candidate’s resume. But even after reading it top to bottom, I wasn’t always sure:What’s the best way to dig deeper into their projects?Are they really comfortable with the tools they listed?What kinds of behavioral or situational questions would be relevant?I’d often resort to generic questions or spend too much time prepping just one resume. It felt repetitive and inefficient and I knew there had to be a better way.Enter HireZenHireZen is a Chrome extension that does one simple thing:You upload a resume, and it generates personalized interview questions for you using AI.That’s it. No over engineering. No login required. Just upload, generate, copy or print done.Here’s what it currently supports:🧠 Reads and parses PDF resumes🤖 Uses LLMs (like GPT-4) to generate questions based on the candidate’s experience🖨️ Lets you print the generated questions or share them with HRThe idea is to take the mental load off interviewers and let AI handle the repetitive thinking.How It WorksBy default, when you visit Google Meet, HireZen will auto-open as a sidebar — so you can prep questions while you’re on the call.Press Ctrl + M to hide or show the extension anytime (toggle view).️ Click the Settings icon to:Choose your LLM provider (OpenAI, Claude, etc.)Enter your API keySelect the model you prefer (e.g., GPT-4, GPT-3.5)Everything is stored securely inside your browser. Once configured, just upload a resume and start generating questions instantly.previewpreview 2Tech Behind ItInitially, I was using a GitHub-hosted API to call OpenAI’s models. It worked well, but obviously, not scalable for others. So I added a Settings page where anyone using the extension can:Choose their LLM provider (e.g., OpenAI or others)Set their preferred modelEnter their own API key, which is stored securely in the browser (not sent to me or any server)No backend. No database. Just local storage via Chrome’s storage.local API.It’s simple, and more importantly — safe.On SecurityOne thing I was very cautious about was handling the API key. I didn’t want to mess around with storing sensitive data anywhere outside the user’s browser. So everything model, provider, key is stored locally and only accessible to the extension.You control your own usage. You bring your own key. I never see it.What’s Next?This is just the beginning. I’m planning to:Add support for exporting question sets as PDFBuild a small feedback form to help interviewers leave notesEventually list it on the Chrome Web StoreRight now, it’s all open and available to try.Try It OutHere’s the linkIt works on Chrome and Chromium-based browsers. Just open it, upload a resume, and let it do the rest.Why I’m Sharing ThisI’m a solo developer. I build things out of curiosity and real-world pain points I face at work. HireZen is one of those small tools I wish I had earlier, so I built it and put it out there.If it saves you time or makes your interviews a little smoother that’s all I hoped for.And hey, if you found it helpful and want to support my work…☕️ You can buy me a coffee — it helps me keep building little tools like this and pushing updates.Thanks for reading!

interviewllmprojects

May 24, 2025

Prototyping AI Agents with GitHub Models (for Free!)

I love when companies roll out generous free tiers. It feels like they’re saying, “Hey, go build your thing — we’ve got your back.” And if you’re a student, between jobs, or just tired of racking up charges for every API call (yep, been there too), free tiers can be a total game-changer.That’s exactly why GitHub Models stood out to me — it’s like an AI candy shop, completely free to explore, as long as you have a GitHub account.Here’s what’s on the shelf:🔮 OpenAI models like gpt-4o, o3-mini🧠 Research favorites like Phi and LLaMA🌍 Multimodal models like llama-vision-instruct📚 Embeddings from Cohere and OpenAI⚡ Plus providers like Mistral, Jamba, Codestral and moreOh, and the best part? Many of these models support function calling, making them perfect for agent-style apps.Now here’s the real kicker:GitHub Models speak OpenAI-compatible API — which means any Python framework that already works with OpenAI’s ChatCompletion API… just works out of the box.Example 1: Connecting openai SDK to GitHub Modelsimport openaiimport osclient = openai.OpenAI( api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com")Now go ahead and use it like you would with OpenAI:client.chat.completions.create(...)Simple, clean, no surprises.Example 2: Running AutoGen with GitHub Modelsimport autogen_ext.models.openaiimport autogen_agentchat.agentsimport osclient = autogen_ext.models.openai.OpenAIChatCompletionClient( model="gpt-4o", api_key=os.environ["GITHUB_TOKEN"], base_url="https://models.inference.ai.azure.com")math_teacher = autogen_agentchat.agents.AssistantAgent( name="Math_teacher", model_client=client, system_message="You only teach maths.")Just like that, your agent is ready to go.You can plug GitHub Models into tons of other Python libraries too — LangGraph, PydanticAI, LlamaIndex, you name it.Go build something fun. Happy tinkering!If you liked this article, learned something, or found it useful — clap away! Each clap burns 0.1 calories, so do 50 and skip leg day. You’re welcome. 🏋️👏

openaillmgithub

Apr 20, 2025

Sherlock Holmes: The Case of the Million NXDomain Errors

cover imageIf you’ve been following our recent Kubernetes migration blog, you already know the journey has been full of challenges. From configuring pods to tackling networking issues, it’s been a rollercoaster. We’ve explored several tricky problems in previous blogs, and today, we invite you to put on your detective hat and join us as we investigate another Kubernetes mystery.The Mysterious Case of NXDomain ErrorsImagine this: You’re checking your Kubernetes observability tools, and suddenly, you notice something strange over a million NXDomain errors! What could be causing this? Let’s break it down together.What Are NXDomain Errors?NX domainBefore we jump in, let’s test your DNS knowledge:Pop Quiz: What does an NXDomain error indicate?A) A domain exists but is unreachable.B) A domain doesn’t exist.C) A domain is experiencing high latency.(Take a moment to think! Scroll down for the answer…)The Answer: If you guessed B) A domain doesn’t exist, you’re right! These errors occur when a DNS query is made for a non-existent domain.Unraveling the CluesWe took a closer look at the logs and found something unusual — external domains were mysteriously gaining extra words like .cluster.local or .internal.cloudapp.net. Here are two examples:gmail.googleapis.com.cluster.localoauth2.googleapis.com.es52e2p4cafzg4m1it5a.bx.internal.cloudapp.netNow, let’s put your troubleshooting skills to the test:What do you think is happening here?A) These domains are being redirected intentionally.B) Kubernetes is modifying external domains.C) A rogue service is interfering with DNS.(Think about it before scrolling!)The Answer: B) Kubernetes is modifying external domains. But why? Let’s find out.How Kubernetes Handles DNS QueriesTo solve this puzzle, we need to understand how Kubernetes resolves DNS queries. When a pod performs a DNS lookup, Kubernetes doesn’t always send the request as-is. Instead, it applies search domains and NDots rules to the query.Here’s a fun experiment: Try running the following command inside a Kubernetes pod:cat /etc/resolv.confWhat do you see? You should find an entry for search domains and an ndots value. These settings influence how Kubernetes resolves domain names.Connecting the DotsBecause the ndots value was set to 5, Kubernetes treated gmail.googleapis.com as an incomplete domain and appended search domains, turning it into:gmail.googleapis.com.svc.cluster.localgmail.googleapis.com.cluster.localThese domains don’t exist, leading to the dreaded NXDomain errors!Fixing the ProblemNow that we’ve cracked the case, let’s apply the fix. Here’s how you can customize DNS settings to prevent Kubernetes from modifying external domains:apiVersion: v1kind: Podmetadata: namespace: default name: dns-examplespec: containers: - name: test image: nginx dnsPolicy: "None" dnsConfig: nameservers: - 1.2.3.4 searches: - ns1.svc.cluster-domain.example - my.dns.search.suffix options: - name: ndots value: "2" - name: edns0The Outcome: A Smooth DNS ExperienceBy adjusting the DNS configuration, we prevent Kubernetes from mistakenly modifying external queries. This eliminates NXDomain errors and ensures external services resolve correctly.

kubernetesdevopsdns

Mar 30, 2025

From Heroku to Kubernetes Lessons Learned in Our Migration Journey

cover imageMigrating from Heroku to Kubernetes is no small feat. While Heroku provided a straightforward Platform-as-a-Service (PaaS) environment that handled many operational aspects for us, Kubernetes offers greater flexibility, scalability, and control. However, with great power comes great responsibility and a host of new challenges. In this post, I will share the key lessons we learned during our migration and how we tackled common hurdles along the way.1. Probe Issue: Cloudflare ErrorsAfter migrating to Kubernetes, we encountered Cloudflare error messages reported by users. These errors would disappear after refreshing the page, but they were persistent. Our investigation traced the issue back to our deployment configuration.The way our pods were deployed prevented Cloudflare from properly verifying pod health, leading to timeouts.How We Fixed ItImplementing Pod Probes: We added both readiness and liveness probes in Kubernetes to ensure that traffic was only routed to healthy pods.Enhancing Resilience: This setup enabled Kubernetes to automatically restart unhealthy pods, preventing downtime.Here is a sample YAML configuration for readiness and liveness probes:apiVersion: apps/v1kind: Deploymentmetadata: name: my-appspec: replicas: 3 template: spec: containers: - name: my-app-container image: my-app:latest ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 15 periodSeconds: 20The readiness probe ensures the pod is ready to accept traffic, while the liveness probe restarts it if it becomes unresponsive.For a more detailed breakdown, check out my full blog post: Sherlock Holmes and the Case of the Cloudflare Timeout Mystery.2. Retrieving Real User IPsAnother issue we encountered was losing access to real user IP addresses after migrating to Kubernetes. Instead of seeing user IPs, our logs displayed the pod proxy IP, making it difficult to track users or manage logs effectively.SolutionBy setting the externalTrafficPolicy to Local, Kubernetes ensures that the real client IP is passed to your services, even when traffic is routed through a load balancer.Here is a sample configuration:apiVersion: v1kind: Servicemetadata: name: my-app-servicespec: selector: app: my-app ports: - port: 80 targetPort: 8080 type: LoadBalancer externalTrafficPolicy: LocalThis configuration restores the real client IP by routing traffic only to local nodes.For a step by step breakdown, read my blog post: Sherlock Holmes and the Case of the Missing User IPs.3.Zombie State Issue: Servers Becoming UnresponsiveZombieAn unexpected challenge we faced was servers entering a “zombie” state. After running smoothly for days, some servers became unresponsive without any clear cause.Our Fix: A Scheduled RestartDespite extensive troubleshooting, we could not pinpoint the root cause. However, implementing a cron job to restart the servers every 24 hours effectively mitigated the issue.Here is how we configured it using Kubernetes’ CronJob resource:apiVersion: batch/v1kind: CronJobmetadata: name: restart-my-serverspec: schedule: "0 0 * * *" # Runs every day at midnight jobTemplate: spec: template: spec: containers: - name: restart-container image: my-app:latest command: ["sh", "-c", "echo Restarting server... && kill -HUP 1"] restartPolicy: OnFailureThis ensures a daily restart, keeping our services responsive.4.Securing Internal Service CommunicationA major advantage of Kubernetes is the ability to restrict internal service visibility for security reasons. We wanted to prevent all services from being externally accessible while still allowing internal communication.SolutionWe leveraged Kubernetes’ internal DNS system, which allows services to communicate securely within the cluster.For example, a service in the my-namespace namespace can be accessed using:my-service.my-namespace.svc.cluster.localThis setup isolates critical services, reducing the attack surface and enhancing security.5.Setting Up Alerts: Proactive MonitoringWithout proper alerting, issues like crash loops or unexpected pod restarts can go unnoticed until they cause major downtime.We implemented Prometheus and Alertmanager to notify us when:A pod enters a crash loopCPU or memory usage spikes above thresholdsHere is a Prometheus alerting rule to detect crash loops:groups:- name: crash-loop-alerts rules: - alert: PodInCrashLoop expr: kube_pod_container_status_restarts_total{job="kubelet",container="my-app"} > 5 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is in a crash loop"This alert triggers when a container restarts more than five times within five minutes.6.Optimizing Node Utilization with Taints and TolerationsTo efficiently allocate resources, we used taints and tolerations to control pod placement on nodes.For example, we applied a taint to a node to prevent certain pods from being scheduled on it:kubectl taint nodes node1 key=value:NoScheduleTo allow specific pods to run on the tainted node, we added this toleration:spec: tolerations: - key: "key" operator: "Equal" value: "value" effect: "NoSchedule"This strategy ensured high-resource pods were assigned to powerful nodes while lightweight pods ran on less resource-intensive ones, optimizing cluster performance.Wrapping UpMigrating from Heroku to Kubernetes came with its challenges, but each hurdle made our system stronger. With better scalability, resilience, and control, the shift was well worth it. If you are on a similar journey, embrace the learning curve it pays off.Have insights or questions? Let’s discuss.Don’t Forgot To Smash 50 Claps 👏If u Liked This Article….It won’t take much time 😄

kubernetesherokudevops

Mar 23, 2025

Latest Articles

How Your LLM Costs 5X More If You Don’t Speak English

llm-applicationstokenizationgemini+1 more

Oct 26, 2025

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-2)

pipecatllmai-voice-agent+2 more

Oct 11, 2025

How Machines Learn: Understanding the Core Concepts of Neural Networks

neuronsllmactivation-functions+1 more

Oct 4, 2025

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-1)

voice-agentai-agentvoiceassitant+1 more

Sep 21, 2025

The Other Side of OpenAI 12 Surprising Stories You Haven’t Heard

elon-muskopenaigpt

Sep 10, 2025

Rust Tools That Made Our Dev Team Productive Again

nodejsreactwebdev+1 more

Aug 9, 2025

This AI Interview Assistant Chrome Extension Was My Weekend Project

interviewllmprojects

May 24, 2025

Prototyping AI Agents with GitHub Models (for Free!)

openaillmgithub

Apr 20, 2025

Sherlock Holmes: The Case of the Million NXDomain Errors

kubernetesdevopsdns

Mar 30, 2025

From Heroku to Kubernetes Lessons Learned in Our Migration Journey

kubernetesherokudevops

Mar 23, 2025

K.BOOPATHI

Connect with me

439 Contributions

Get to know me

💻 Languages

PR Merged

📍 Location

Experience

Full-stack Developer

Full-stack Developer

Java Script Programmer

Education

Bachelor of Engineering

Higher secondary

Projects

CallAgent

programmerraja.github.io

Interview-Helper

React-Js-Resource-

JPMC-tech-task-1-py3

Weather-App

VoiceAgentGuide

IpLogger

Latest Articles

How Your LLM Costs 5X More If You Don’t Speak English

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-2)

How Machines Learn: Understanding the Core Concepts of Neural Networks

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-1)

The Other Side of OpenAI 12 Surprising Stories You Haven’t Heard

Rust Tools That Made Our Dev Team Productive Again

This AI Interview Assistant Chrome Extension Was My Weekend Project

Prototyping AI Agents with GitHub Models (for Free!)

Sherlock Holmes: The Case of the Million NXDomain Errors

From Heroku to Kubernetes Lessons Learned in Our Migration Journey

Experience

Full-stack Developer

Full-stack Developer

Java Script Programmer

Education

Bachelor of Engineering

Higher secondary

Latest Articles

How Your LLM Costs 5X More If You Don’t Speak English

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-2)

How Machines Learn: Understanding the Core Concepts of Neural Networks

2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-1)

The Other Side of OpenAI 12 Surprising Stories You Haven’t Heard

Rust Tools That Made Our Dev Team Productive Again

This AI Interview Assistant Chrome Extension Was My Weekend Project

Prototyping AI Agents with GitHub Models (for Free!)

Sherlock Holmes: The Case of the Million NXDomain Errors

From Heroku to Kubernetes Lessons Learned in Our Migration Journey