LLM Evaluation | RAG Evaluation | Voice AI

AI engineer building workflows that can be evaluated, traced, and improved.

I work on practical AI systems where model behavior matters: RLHF-style response evaluation, retriever comparison, grounded answers, voice-agent edge cases, and local-first LLM applications.

RLHF
Rubrics, preference checks, and annotator agreement
RAG
Retriever comparison, citations, and groundedness
Voice AI
Production STT/TTS plus scenario-tested demos

Profile

The useful gap is between a demo and evidence that the workflow holds up.

My background combines LLM evaluation work, production voice AI integration, and business-facing product experience. I care about whether a model followed the instruction, whether a RAG answer is grounded in evidence, and whether a voice pipeline can handle realistic failures such as silence, unclear transcripts, and interruptions.

I am currently targeting remote or relocation-friendly roles in AI evaluation, RLHF, RAG and GenAI applications, voice AI, conversational AI, and applied AI engineering.

Selected public work

Three anchor projects show the current direction.

LLM Evaluation Python

llm-evaluation-lab

Config-driven workbench for pairwise response scoring, RLHF-style rubrics, JSONL validation, multi-annotator agreement, disagreement cases, failure tags, and Markdown/JSON reports.

  • Computed-vs-human preference checks
  • Pairwise annotator agreement metrics
  • CI plus unit-tested CLI workflow
Open repository
RAG Evaluation Python

rag-evaluation-workbench

Measures retrieval recall, citation coverage, required-fact coverage, grounded answer rate, retriever ranking behavior, and failure tags over a small document corpus.

  • BM25 vs keyword vs hybrid comparison
  • Top-source expected-source checks
  • Generated comparison report
Open repository
Voice AI Python

voice-ai-agent-demo

Local voice-agent framework with STT, assistant, and TTS provider boundaries, streaming-style artifacts, p50/p95 latency traces, batch scenarios, and reports.

  • Silence timeout and clarification paths
  • Barge-in emergency escalation scenario
  • Batch intent and latency report
Open repository

Focus areas

Where I can contribute fastest.

AI Evaluation / RLHF

Prompt design, side-by-side response review, rubric-based scoring, preference data, coding/data-science evaluation, ground-truth corrections, annotator agreement, and failure-mode analysis.

RAG / GenAI Applications

Document ingestion, chunking, retrieval, retriever comparison, citations, groundedness checks, local LLM apps, LangChain, FAISS, Ollama, Streamlit, and Gradio workflows.

Voice AI / Conversational AI

STT/TTS integration, provider boundaries, latency tracing, intent routing, batch scenario testing, interruption handling, and LLM-backed assistant quality review.

Experience

Applied AI plus model-quality judgment.

Sep 2024 - Jan 2025

Jr. AI / Data Science Engineer | Engineer Master

Integrated TTS and STT models into PreCall AI and contributed to Turing-sourced RLHF pipelines for xAI and Meta Llama 4, evaluating model behavior across Python and data-science tasks.

2020 - 2022

Business Development Associate | Byju's

Built customer discovery, communication, and business-context judgment through high-volume consultative education sales.

Resume variants

Different roles need different evidence.

Available for

Remote or relocation-friendly AI roles.

Best fit: AI evaluation, RLHF, RAG/GenAI applications, voice AI, conversational AI, and Python-based applied AI systems.