Harshal Shilwant — AI & RAG Engineer

About

Harshal
Shilwant

AI Systems Engineer

Experience 2 Years

Domain Cognitive Research Market

Previously TechnoNexis

Focus RAG · LLMs · Data Pipelines

I don't just integrate APIs — I architect systems that make AI actually work in the real world. That means obsessing over data quality before a single prompt is written, understanding retrieval semantics before vector embeddings are configured, and thinking about failure modes before deployment.

At TechnoNexis, I built end-to-end RAG pipelines for the cognitive research market — ingesting messy, real-world data from PDFs and spreadsheets, cleaning it, chunking it intelligently, embedding it, and serving it through LLMs that returned accurate, grounded answers.

My philosophy: garbage in, garbage out. Before any LLM touches data, it must be clean, structured, and semantically meaningful. I care about reducing hallucination, improving retrieval precision, and building backends that are reliable under production load.

🧬

Data-First Thinking

Every AI system starts with data quality — cleaning and schema design before any model work.

🎯

Grounded Outputs

Reducing hallucination through retrieval design, not prompt hacks.

⚙️

System Design

APIs, pipelines, and services built for scale — not just demos.

📐

Eval-Driven Dev

Output quality measured and improved systematically, not by feel.

Featured Work

Projects That Ship

Real systems solving real problems. Each project reflects a full engineering loop — from data to deployment.

01 / 04 RAG · NLP

AI Market Research RAG System

End-to-end pipeline ingesting PDF & Excel research reports. Chunks, embeds, and retrieves context for LLM-powered Q&A with drastically reduced hallucination versus vanilla GPT prompting.

Python LangChain Pinecone OpenAI API FastAPI PyMuPDF

68%

Hallucination Reduction

~200ms

Query Latency

System Highlights

Semantic chunking with overlap strategy to preserve context across section boundaries
Hybrid retrieval: BM25 sparse + dense vector search, reranked via Cohere
Per-query citation extraction — every answer grounded to source document & page
Async FastAPI backend with request queue and rate-limiting for multi-user load

02 / 04 Analytics · AI

Excel Analytics + AI Insight Platform

Upload any structured Excel file and receive auto-generated visualizations, data summaries, and natural-language insights. AI detects trends, outliers, and business signals automatically.

React Node.js Python Pandas Recharts GPT-4o

Avg. Insight Time

12+

Chart Types

System Highlights

Schema inference engine auto-detects numeric, categorical, and temporal columns
AI-generated chart recommendations based on data shape and column types
LLM summarizes each chart with business-level language, not technical output
MERN full-stack with file streaming — handles Excel files up to 50MB

03 / 04 Search · Vector DB

Semantic Search Engine

Vector database-backed retrieval system replacing keyword search. Uses dense embeddings to match intent, not just vocabulary — significantly improving result relevance for domain-specific corpora.

Sentence Transformers Qdrant FastAPI Docker MongoDB

91%

Recall@10

40ms

P99 Latency

System Highlights

Fine-tuned bi-encoder on domain-specific query-document pairs for higher precision
HNSW index in Qdrant for sub-50ms approximate nearest neighbor search
Faceted filtering: combine semantic score with metadata filters in one query
Dockerized deployment with horizontal scaling support via load balancer

04 / 04 Backend · LLM Ops

LLM API Backend Architecture

Production-grade backend for serving multiple LLM providers behind a unified API. Includes model routing, fallback logic, cost tracking, prompt versioning, and caching — all production-ready.

Node.js Express Redis PostgreSQL OpenAI Anthropic

60%

Cost Reduction (cache)

99.9%

Uptime (fallback)

System Highlights

Unified API gateway — swap providers (OpenAI → Anthropic → Mistral) via config
Semantic response caching with Redis: identical-intent queries hit cache not API
Prompt version registry — rollback, A/B test, and track prompt performance
Per-tenant cost tracking and token budget enforcement in real-time

System Design

Architecture Thinking

These are the core systems I design and reason about. Clean flows, defined responsibilities, observable outputs.

Pipeline 01

RAG Pipeline — Ingestion to Answer

📄 Raw Document

→

🧹 Extraction & Cleaning

→

✂️ Chunking Strategy

→

🔢 Embeddings

→

🗄️ Vector Store

→

🔍 Retrieval + Rerank

→

🤖 LLM + Context

→

✅ Grounded Answer

The key design decision: chunking strategy determines retrieval quality more than model choice. I use semantic chunking with sliding overlap (128-token overlap on 512-token chunks) to preserve cross-boundary context. Retrieval uses hybrid BM25+dense, reranked before LLM injection.

Pipeline 02

Data Processing Flow — Raw to Production-Ready

📊 Excel / CSV / PDF

→

🔎 Schema Inference

→

🧹 Dedup + Nulls

→

🔧 Type Normalization

→

✅ Validation Layer

→

🗃️ Clean Store

→

🚀 Downstream AI

Data quality gates catch problems before they propagate. Schema inference auto-detects column types; validation rules flag statistical anomalies (outliers beyond 3σ) and structural issues (missing required fields) before any AI system touches the data.

Pipeline 03

LLM Request Flow — Optimized for Cost & Reliability

📥 API Request

→

🔐 Auth + Rate Limit

→

💾 Semantic Cache?

→

📝 Prompt Builder

→

🤖 Model Router

→

⚡ LLM Provider

→

📊 Log + Track Cost

→

📤 Response

The model router selects provider based on task type, cost budget, and latency SLA. Cache hit rate of ~60% achieved by embedding incoming queries and checking cosine similarity against recent responses — not exact string match. Fallback chain: primary → secondary → queue.

Engineering Perspective

How I Think

The mental models and design principles I apply when building AI systems.

How do I design a RAG system from scratch?

I start with the query, not the documents. What does a "good answer" look like? That drives everything — chunking size, retrieval strategy, and how much context the LLM actually needs.

1.Define answer quality first (what's a good vs bad response?)

2.Audit source documents — types, sizes, structure

3.Design chunking strategy around semantic boundaries

4.Choose retrieval type (dense, sparse, hybrid) based on query diversity

5.Add reranking — always improves precision for minimal cost

How do I reduce hallucination in LLM outputs?

Hallucination is mostly a retrieval problem, not a prompting problem. If the right context doesn't reach the LLM, no prompt will fix it. I focus on retrieval precision first.

→Improve chunk quality — semantic coherence over arbitrary splits

→Add citation constraints in system prompt — force source grounding

→Use reranking to filter irrelevant retrieved context

→Measure faithfulness score (RAGAs) on every release

How do I evaluate LLM output quality systematically?

You can't improve what you don't measure. I build eval pipelines that run on every code change — not manual vibe-checking before release.

→Build a golden dataset of 50-100 query-answer pairs per domain

→Track: faithfulness, answer relevance, context precision (RAGAs)

→Use LLM-as-judge for semantic similarity scoring

→Alert on metric regression — treat evals like unit tests

Building Reliable
AI Systems with
RAG, Data & LLMs

Projects That Ship

Architecture Thinking

Skills & Tools

How I Think

Let's Build
Something Intelligent

Building ReliableAI Systems withRAG, Data & LLMs

Projects That Ship

Architecture Thinking

Skills & Tools

How I Think

Let's BuildSomething Intelligent

Building Reliable
AI Systems with
RAG, Data & LLMs

Let's Build
Something Intelligent