LLM Evaluation | RAG Evaluation | Voice AI

AI engineer building workflows that can be evaluated, traced, and improved.

I work on practical AI systems where model behavior matters: RLHF-style response evaluation, retriever comparison, grounded answers, voice-agent edge cases, and local-first LLM applications.

View GitHub Email me

RLHF: Rubrics, preference checks, and annotator agreement
RAG: Retriever comparison, citations, and groundedness
Voice AI: Production STT/TTS plus scenario-tested demos

Profile

The useful gap is between a demo and evidence that the workflow holds up.

My background combines LLM evaluation work, production voice AI integration, and business-facing product experience. I care about whether a model followed the instruction, whether a RAG answer is grounded in evidence, and whether a voice pipeline can handle realistic failures such as silence, unclear transcripts, and interruptions.

I am currently targeting remote or relocation-friendly roles in AI evaluation, RLHF, RAG and GenAI applications, voice AI, conversational AI, and applied AI engineering.

Selected public work

Three anchor projects show the current direction.

LLM Evaluation Python

llm-evaluation-lab

Config-driven workbench for pairwise response scoring, RLHF-style rubrics, JSONL validation, multi-annotator agreement, disagreement cases, failure tags, and Markdown/JSON reports.

Computed-vs-human preference checks
Pairwise annotator agreement metrics
CI plus unit-tested CLI workflow

Open repository

RAG Evaluation Python

rag-evaluation-workbench

Measures retrieval recall, citation coverage, required-fact coverage, grounded answer rate, retriever ranking behavior, and failure tags over a small document corpus.

BM25 vs keyword vs hybrid comparison
Top-source expected-source checks
Generated comparison report

Open repository

Voice AI Python

voice-ai-agent-demo

Local voice-agent framework with STT, assistant, and TTS provider boundaries, streaming-style artifacts, p50/p95 latency traces, batch scenarios, and reports.

Silence timeout and clarification paths
Barge-in emergency escalation scenario
Batch intent and latency report

Open repository

Focus areas

Where I can contribute fastest.

AI Evaluation / RLHF

Prompt design, side-by-side response review, rubric-based scoring, preference data, coding/data-science evaluation, ground-truth corrections, annotator agreement, and failure-mode analysis.

RAG / GenAI Applications

Document ingestion, chunking, retrieval, retriever comparison, citations, groundedness checks, local LLM apps, LangChain, FAISS, Ollama, Streamlit, and Gradio workflows.

Voice AI / Conversational AI

STT/TTS integration, provider boundaries, latency tracing, intent routing, batch scenario testing, interruption handling, and LLM-backed assistant quality review.

Experience

Applied AI plus model-quality judgment.

Sep 2024 - Jan 2025

Jr. AI / Data Science Engineer | Engineer Master

Integrated TTS and STT models into PreCall AI and contributed to Turing-sourced RLHF pipelines for xAI and Meta Llama 4, evaluating model behavior across Python and data-science tasks.

2020 - 2022

Business Development Associate | Byju's

Built customer discovery, communication, and business-context judgment through high-volume consultative education sales.

Resume variants

Different roles need different evidence.

AI Evaluation / RLHF LLM evaluator, AI trainer, RLHF, model quality

Preview PDF

RAG / LLM Applications GenAI engineer, RAG, document QA, LangChain

Preview PDF

Voice AI / Conversational AI STT/TTS, voice agents, conversational systems

Preview PDF

Available for

Remote or relocation-friendly AI roles.

Best fit: AI evaluation, RLHF, RAG/GenAI applications, voice AI, conversational AI, and Python-based applied AI systems.

yashtomar10122@gmail.com LinkedIn GitHub