data engineer × explorer

I build data pipelines,
break things, and write
about both.

Software data engineer. Distributed systems, real-time streaming, and making sense of the world through data and travel.

42
posts
12
pipelines
8
countries
5PB+
processed
01 recent posts
3 Days in Manali — Snow, Chai & Solang Valley
Building Real-Time Pipelines with Kafka + Flink
Why I Stopped Chasing Productivity Hacks
Goa in December — Not What I Expected
Data Modeling Patterns for Analytics at Scale
On Being Comfortable With Not Knowing
02 projects
stream-engine
Real-time event processing system. Handles 100K events/sec with sub-second latency.
100K events/sec
kafka flink clickhouse docker
🔍
data-quality-fw
Automated data quality checks & monitoring across 200+ tables with alerting.
200+ tables
great-expectations dbt python airflow
🚀
query-optimizer
SQL analysis and optimization tool for Trino queries with execution insights.
5M+ queries analyzed
trino python react
🌐
this-blog
Personal portfolio. Auto-published via Claude AI with image optimization.
Live
astro typescript cloudinary vercel
03 gallery
← cd ..

3 Days in Manali — Snow, Chai & Solang Valley

January 12, 2025

Manali in January is a completely different beast. I've been to Himachal three times before, but this was my first winter trip, and honestly, I wasn't prepared for how magical snow-covered mountains could be when you're sipping hot chai at 7,000 feet.

Day 1: Solang Valley Surprise

We arrived late morning after a 15-hour drive from Delhi. The moment we hit the main market, it was clear this wasn't the crowded tourist trap I expected. January = local season, apparently.

There's something about standing in snow for the first time in a year. Everything slows down. Your breath becomes visible. Time feels different.

Solang Valley was the highlight—skiing wasn't on the agenda (prices were ridiculous), but walking through fresh powder with mountain peaks disappearing into clouds was surreal enough. The air was crisp, almost sharp in your lungs.

Day 2: Lost in Locals

We skipped the tourist circuit and just walked. Old Town Manali, tiny cafes, chai stalls with uncles debating politics, shops selling Himachali wool. Found this incredible restaurant run by a 60-year-old woman who's been making momos for 30 years.

She told us stories about Manali in the 80s when it was just a village. Now it's crawling with startups and digital nomads in peak season. But in January? It's back to being a real place.

Day 3: Rohtang & Reflections

Drove up to Rohtang Pass early morning. The road was treacherous—snow, ice, tight turns—but the view was worth every moment of clutching the car handle.

Looking down at three valleys simultaneously, I realized why people come to mountains to think. There's something about scale that puts everything in perspective. Your problems look smaller up there.

Filed under travel · 3,200 words · 12 min read

← cd ..

Building Real-Time Pipelines with Kafka + Flink

January 8, 2025

Real-time data pipelines are the new normal. If you're still batch-processing everything daily, you're essentially driving a sports car in first gear. Let's talk about why Kafka + Flink is the combo changing the game.

The Architecture

Here's what we built at scale (100K events/sec):

// Producer: Events stream into Kafka const producer = KafkaProducer({ brokers: ['kafka-1:9092', 'kafka-2:9092'], topic: 'events.raw', compression: 'snappy' }) // Stream in, events out—no buffering producer.send(event)

Kafka acts as the buffer. Flink processes. ClickHouse stores. This separation of concerns is crucial—if your pipeline hiccups, you don't lose data.

The Flink Job

Here's where the magic happens. Flink's windowing and state management make real-time aggregations trivial:

DataStream<Event> events = env .addSource(FlinkKafkaConsumer) .assignTimestampsAndWatermarks(WatermarkStrategy) events .keyBy("user_id") .window(TumblingEventTimeWindow(Time.minutes(5))) .aggregate(AggregationFunction) .addSink(ClickHouseSink)

This processes events in 5-minute windows, aggregates by user, and pushes results to ClickHouse. Sub-second latency, fault-tolerant, exactly-once semantics.

Why This Stack?

  • Kafka: Distributed, durable, ridiculously reliable. Built for throughput.
  • Flink: True streaming (not micro-batches). Event-time semantics. Complex state.
  • ClickHouse: OLAP database designed for time-series. Sub-second queries on billions of rows.

The Gotchas

State management in Flink is tricky. Get your RocksDB configs wrong, and your job will mysteriously OOM during high load.

Things we learned the hard way:

  • Watermarks need careful tuning—too aggressive and you lose events, too lenient and you have latency
  • Exactly-once semantics come with overhead. Sometimes at-least-once is the right trade-off
  • Monitoring is non-negotiable. Flink's UI is great but instrument everything

Closing Thoughts

Building a real-time pipeline is like conducting an orchestra. Every component needs to be in sync. But when it works? You get live insights at scale. That's worth the complexity.

Filed under tech · 2,100 words · 8 min read