AI2025-06-20📖 3 min read

A Practical Guide to AI Implementation: From LLM Integration to RAG in Production

A Practical Guide to AI Implementation: From LLM Integration to RAG in Production

Best practices and hard-won lessons for deploying machine learning models in production. A deep dive into LLM usage, prompt design, RAG implementation, and cost optimization.

髙木 晃宏

代表 / エンジニア

👨‍💼

TL;DR

  • Clarify ROI before adopting AI. The question isn't "how can we use AI?" — it's "what problem are we solving with AI?"
  • Prompt engineering is 90% of LLM success. Master Few-shot and Chain of Thought techniques.
  • Use RAG to put internal data to work. Your chunking strategy and vector DB choice make or break it.
  • In production, cost control, rate limit handling, and fallback logic are non-negotiable.

Introduction: The Reality of AI Adoption

"We want to implement AI."

If you're an engineer, you've probably heard this from leadership. But AI is a tool, not a goal. The real question is: what problem are you trying to solve?

Our team has worked on more than ten AI projects over the past two years. The gap between successful and failed projects was always clear.

What success looks like:

  • Starting from a concrete business problem
  • Starting small and validating impact early
  • Integrating naturally into existing human workflows

What failure looks like:

  • Vague goals like "do something with AI"
  • Large upfront investment before proving value
  • No plan for integrating with existing processes

This article walks through a practical approach to AI implementation, with a focus on LLMs (large language models).

Deciding Whether to Adopt AI

Identifying the Right Use Cases

Use cases where AI works well:

✅ Tasks AI is good at - Processing large volumes of text (summarization, classification, extraction) - Pattern recognition (anomaly detection, recommendations) - Natural language interfaces (chatbots, search) - Automating repetitive tasks that still require judgment ❌ Tasks AI is not good at - Processes requiring 100% accuracy (e.g., final review of legal documents) - Processes with extremely tight real-time requirements - Domains with very little data - High-stakes decisions with strict accountability requirements

An ROI Framework

interface AIProjectROI { // Cost factors developmentCost: number; // Development cost apiCost: number; // API usage fee (monthly) infrastructureCost: number; // Infrastructure cost maintenanceCost: number; // Operations and maintenance cost // Benefit factors timeSavingHours: number; // Hours of work saved per month hourlyRate: number; // Hourly rate equivalent qualityImprovement: number; // Value from quality improvements newRevenueOpportunity: number; // New revenue opportunities } function calculateROI(project: AIProjectROI): { monthlyBenefit: number; monthlyCost: number; paybackMonths: number; } { const monthlyBenefit = (project.timeSavingHours * project.hourlyRate) + project.qualityImprovement + project.newRevenueOpportunity; const monthlyCost = project.apiCost + project.infrastructureCost + project.maintenanceCost; const paybackMonths = project.developmentCost / (monthlyBenefit - monthlyCost); return { monthlyBenefit, monthlyCost, paybackMonths }; } // Example: automated customer support response system const supportBot = calculateROI({ developmentCost: 2000000, // 200万円 apiCost: 50000, // 5万円/month infrastructureCost: 10000, // 1万円/month maintenanceCost: 30000, // 3万円/month timeSavingHours: 160, // 160 hours saved per month hourlyRate: 3000, // 3000 yen/hour equivalent qualityImprovement: 50000, // 5万円/month from quality gains newRevenueOpportunity: 0 }); // Result: 53万円 monthly benefit, 9万円 monthly cost, 4.5-month payback period

LLM Usage Patterns

1. Text Classification and Sentiment Analysis

import OpenAI from 'openai'; const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); interface ClassificationResult { category: string; confidence: number; sentiment: 'positive' | 'neutral' | 'negative'; } async function classifyCustomerInquiry( inquiry: string ): Promise<ClassificationResult> { const response = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [ { role: 'system', content: `You are a customer support inquiry classification system. Classify the inquiry into one of the following categories: - billing: billing and payment issues - technical: technical problems - account: account-related issues - general: other general inquiries Respond in JSON format.` }, { role: 'user', content: inquiry } ], response_format: { type: 'json_object' }, temperature: 0.1 // Keep temperature low for stable classification results }); return JSON.parse(response.choices[0].message.content!); } // Usage const result = await classifyCustomerInquiry( 'It looks like my credit card was charged twice.' ); // { category: 'billing', confidence: 0.95, sentiment: 'negative' }

2. Text Generation and Summarization

async function summarizeDocument( document: string, maxLength: number = 200 ): Promise<string> { const response = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [ { role: 'system', content: `You are an expert document summarizer. Summarize the given document in ${maxLength} characters or fewer. Be concise while capturing all key points.` }, { role: 'user', content: document } ], max_tokens: Math.ceil(maxLength * 1.5), // Japanese text is approximately 1.5 tokens per character temperature: 0.3 }); return response.choices[0].message.content!; }

3. Structured Data Extraction

import { z } from 'zod'; // Type definition for extracted data const InvoiceSchema = z.object({ vendorName: z.string(), invoiceNumber: z.string(), invoiceDate: z.string(), dueDate: z.string(), totalAmount: z.number(), taxAmount: z.number(), lineItems: z.array(z.object({ description: z.string(), quantity: z.number(), unitPrice: z.number(), amount: z.number() })) }); type Invoice = z.infer<typeof InvoiceSchema>; async function extractInvoiceData(invoiceText: string): Promise<Invoice> { const response = await openai.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'system', content: `You are an invoice data extraction system. Extract the following fields from the invoice text and return them as JSON: - vendorName: issuing company name - invoiceNumber: invoice number - invoiceDate: invoice date (YYYY-MM-DD format) - dueDate: payment due date (YYYY-MM-DD format) - totalAmount: total amount (numeric) - taxAmount: tax amount (numeric) - lineItems: array of line item objects` }, { role: 'user', content: invoiceText } ], response_format: { type: 'json_object' }, temperature: 0 }); const data = JSON.parse(response.choices[0].message.content!); // Validate with Zod return InvoiceSchema.parse(data); }

Prompt Engineering

Core Principles

  1. Be explicit: Eliminate ambiguity and give specific instructions
  2. Provide context: Supply the necessary background information
  3. Specify output format: Clearly state the expected format
  4. Use examples (Few-shot): Demonstrate with concrete examples to improve accuracy

Few-shot Learning

async function categorizeProduct(productName: string): Promise<string> { const response = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [ { role: 'system', content: 'Determine the category for a given product name.' }, // Few-shot examples { role: 'user', content: 'iPhone 15 Pro Max 256GB' }, { role: 'assistant', content: 'Smartphones' }, { role: 'user', content: 'SONY WH-1000XM5' }, { role: 'assistant', content: 'Headphones & Earphones' }, { role: 'user', content: 'MacBook Air M3' }, { role: 'assistant', content: 'Laptops' }, // Actual input { role: 'user', content: productName } ], temperature: 0.1 }); return response.choices[0].message.content!; }

Chain of Thought

For tasks requiring complex reasoning, walking the model through its thinking step by step significantly improves accuracy.

async function analyzeBusinessProblem(problem: string): Promise<{ analysis: string; recommendations: string[]; }> { const response = await openai.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'system', content: `You are a business consultant. When analyzing a problem, work through the following steps: 1. Identify the core issue 2. List all contributing factors 3. Assess the impact of each factor 4. Consider possible solutions 5. Recommend the most effective course of action Show your reasoning at each step before presenting your final recommendations.` }, { role: 'user', content: problem } ], temperature: 0.7 }); // Parse and structure the response const content = response.choices[0].message.content!; // ... parsing logic return { analysis: content, recommendations: [] }; }

Implementing RAG (Retrieval-Augmented Generation)

RAG is an architecture where the LLM generates responses grounded in relevant information retrieved from an external knowledge base.

Architecture Overview

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ User │────▶│Query Embed- │────▶│ Vector DB │ │ Question │ │ ding │ │ (Pinecone) │ └─────────────┘ └─────────────┘ └──────┬──────┘ ┌─────────────┐ │ Similar docs │ LLM │◀───────────┘ │ (GPT-4) │ └──────┬──────┘ ┌──────▼──────┐ │ Response │ │ Generation │ └─────────────┘

Document Preprocessing and Chunking

interface DocumentChunk { id: string; content: string; metadata: { source: string; page?: number; section?: string; }; embedding?: number[]; } function chunkDocument( document: string, chunkSize: number = 1000, overlap: number = 200 ): string[] { const chunks: string[] = []; let start = 0; while (start < document.length) { let end = start + chunkSize; // Avoid splitting in the middle of a sentence if (end < document.length) { const lastPeriod = document.lastIndexOf('。', end); const lastNewline = document.lastIndexOf('\n', end); const breakPoint = Math.max(lastPeriod, lastNewline); if (breakPoint > start) { end = breakPoint + 1; } } chunks.push(document.slice(start, end).trim()); start = end - overlap; } return chunks; } // Semantic chunking (a more advanced approach) async function semanticChunk( document: string, maxChunkSize: number = 1500 ): Promise<string[]> { // Split on headings and paragraphs const sections = document.split(/\n#{1,3}\s/); const chunks: string[] = []; let currentChunk = ''; for (const section of sections) { if (currentChunk.length + section.length > maxChunkSize) { if (currentChunk) { chunks.push(currentChunk.trim()); } currentChunk = section; } else { currentChunk += '\n' + section; } } if (currentChunk) { chunks.push(currentChunk.trim()); } return chunks; }

Vector Embeddings and Search

import { Pinecone } from '@pinecone-database/pinecone'; const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! }); const index = pinecone.index('knowledge-base'); // Embed and store documents async function indexDocuments(chunks: DocumentChunk[]): Promise<void> { // Generate embedding vectors using the OpenAI Embeddings API const embeddings = await openai.embeddings.create({ model: 'text-embedding-3-small', input: chunks.map(c => c.content) }); // Store in Pinecone const vectors = chunks.map((chunk, i) => ({ id: chunk.id, values: embeddings.data[i].embedding, metadata: { content: chunk.content, ...chunk.metadata } })); await index.upsert(vectors); } // Search for similar documents async function searchSimilarDocuments( query: string, topK: number = 5 ): Promise<DocumentChunk[]> { // Convert the query to an embedding vector const queryEmbedding = await openai.embeddings.create({ model: 'text-embedding-3-small', input: query }); // Similarity search in Pinecone const results = await index.query({ vector: queryEmbedding.data[0].embedding, topK, includeMetadata: true }); return results.matches.map(match => ({ id: match.id, content: match.metadata!.content as string, metadata: { source: match.metadata!.source as string, score: match.score } })); }

The Complete RAG Pipeline

async function ragQuery(userQuestion: string): Promise<string> { // 1. Retrieve relevant documents const relevantDocs = await searchSimilarDocuments(userQuestion, 5); // 2. Build the context const context = relevantDocs .map(doc => `[Source: ${doc.metadata.source}]\n${doc.content}`) .join('\n\n---\n\n'); // 3. Generate a response with the LLM const response = await openai.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'system', content: `You are an internal knowledge base assistant. Answer questions based solely on the reference materials below. Rules: - Only use information found in the reference materials - If the information is not available, say "No relevant information was found" - Always cite the source that supports your answer Reference materials: ${context}` }, { role: 'user', content: userQuestion } ], temperature: 0.3 }); return response.choices[0].message.content!; }

API Cost Optimization

Model Selection Strategy

type TaskComplexity = 'simple' | 'moderate' | 'complex'; function selectModel(task: TaskComplexity): string { const modelMap = { simple: 'gpt-4o-mini', // Classification, simple extraction moderate: 'gpt-4o-mini', // Summarization, Q&A complex: 'gpt-4o' // Complex reasoning, code generation }; return modelMap[task]; } // Cost estimation utility function estimateCost( model: string, inputTokens: number, outputTokens: number ): number { const pricing: Record<string, { input: number; output: number }> = { 'gpt-4o': { input: 0.0025, output: 0.01 }, // per 1K tokens 'gpt-4o-mini': { input: 0.00015, output: 0.0006 } }; const rate = pricing[model]; return (inputTokens / 1000 * rate.input) + (outputTokens / 1000 * rate.output); }

Caching Strategy

import { Redis } from 'ioredis'; const redis = new Redis(process.env.REDIS_URL); async function cachedCompletion( prompt: string, options: { model: string; temperature: number } ): Promise<string> { // Only cache deterministic results (temperature === 0) if (options.temperature > 0) { return await directCompletion(prompt, options); } const cacheKey = `llm:${options.model}:${hashString(prompt)}`; // Check cache const cached = await redis.get(cacheKey); if (cached) { return cached; } // Make the API call const result = await directCompletion(prompt, options); // Cache for 24 hours await redis.setex(cacheKey, 86400, result); return result; }

Batch Processing to Reduce Costs

async function batchClassify( items: string[], batchSize: number = 20 ): Promise<string[]> { const results: string[] = []; for (let i = 0; i < items.length; i += batchSize) { const batch = items.slice(i, i + batchSize); // Process multiple items in a single request const response = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [ { role: 'system', content: `Classify each of the following ${batch.length} items. Return your answer as a JSON array.` }, { role: 'user', content: batch.map((item, idx) => `${idx + 1}. ${item}`).join('\n') } ], response_format: { type: 'json_object' } }); const batchResults = JSON.parse(response.choices[0].message.content!); results.push(...batchResults.classifications); } return results; }

Error Handling and Fallbacks

Handling Rate Limits

import pRetry from 'p-retry'; async function robustCompletion( messages: OpenAI.ChatCompletionMessageParam[] ): Promise<string> { return pRetry( async () => { try { const response = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages }); return response.choices[0].message.content!; } catch (error: any) { if (error.status === 429) { // Retry on rate limit errors throw error; } if (error.status === 500 || error.status === 503) { // Retry on server errors throw error; } // All other errors fail immediately throw new pRetry.AbortError(error); } }, { retries: 3, minTimeout: 1000, maxTimeout: 10000, onFailedAttempt: (error) => { console.log(`Attempt ${error.attemptNumber} failed. Retrying...`); } } ); }

Fallback Strategy

async function completionWithFallback( messages: OpenAI.ChatCompletionMessageParam[] ): Promise<string> { const models = ['gpt-4o', 'gpt-4o-mini', 'gpt-3.5-turbo']; for (const model of models) { try { const response = await openai.chat.completions.create({ model, messages, timeout: 30000 }); return response.choices[0].message.content!; } catch (error) { console.error(`${model} failed:`, error); continue; } } // All models failed throw new Error('All models failed'); }

Monitoring in Production

Key Metrics

interface LLMMetrics { requestId: string; model: string; inputTokens: number; outputTokens: number; latencyMs: number; cost: number; success: boolean; errorType?: string; } async function trackedCompletion( messages: OpenAI.ChatCompletionMessageParam[], options: { model: string } ): Promise<{ result: string; metrics: LLMMetrics }> { const startTime = Date.now(); const requestId = crypto.randomUUID(); try { const response = await openai.chat.completions.create({ model: options.model, messages }); const metrics: LLMMetrics = { requestId, model: options.model, inputTokens: response.usage!.prompt_tokens, outputTokens: response.usage!.completion_tokens, latencyMs: Date.now() - startTime, cost: estimateCost( options.model, response.usage!.prompt_tokens, response.usage!.completion_tokens ), success: true }; // Record metrics await recordMetrics(metrics); return { result: response.choices[0].message.content!, metrics }; } catch (error: any) { const metrics: LLMMetrics = { requestId, model: options.model, inputTokens: 0, outputTokens: 0, latencyMs: Date.now() - startTime, cost: 0, success: false, errorType: error.code || 'unknown' }; await recordMetrics(metrics); throw error; } }

Automated Quality Evaluation

// Automated response quality evaluation async function evaluateResponseQuality( question: string, response: string, expectedTopics: string[] ): Promise<{ relevance: number; completeness: number; accuracy: number; }> { const evaluation = await openai.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'system', content: `You are a response quality evaluator. Score each dimension from 0 to 100: - relevance: how well the answer addresses the question - completeness: how thoroughly the answer covers the topic - accuracy: how factually correct the information is Respond in JSON format.` }, { role: 'user', content: `Question: ${question}\n\nAnswer: ${response}\n\nExpected topics: ${expectedTopics.join(', ')}` } ], response_format: { type: 'json_object' } }); return JSON.parse(evaluation.choices[0].message.content!); }

Summary: Keys to Successful AI Implementation

Adoption Phase

  1. Clarify the problem: Identify the specific business problem AI will solve
  2. Evaluate ROI: Estimate costs and benefits quantitatively
  3. Start small: Validate impact with a proof of concept

Implementation Phase

  1. Prompt design: Leverage Few-shot and Chain of Thought techniques
  2. Build RAG: Create a knowledge base from your internal data
  3. Optimize costs: Choose the right model, use caching, and batch requests

Operations Phase

  1. Monitor: Continuously track latency, cost, and quality
  2. Feedback loop: Iterate based on user feedback
  3. Regular reviews: Periodically review accuracy and ROI

AI implementation is never a one-and-done effort. Continuous improvement and operational discipline are what separate projects that succeed from those that stall.

Resources


If you're struggling with your AI adoption journey, feel free to reach out.