A Practical Guide to AI Implementation: From LLM Integration to RAG in Production

TL;DR

Clarify ROI before adopting AI. The question isn't "how can we use AI?" — it's "what problem are we solving with AI?"
Prompt engineering is 90% of LLM success. Master Few-shot and Chain of Thought techniques.
Use RAG to put internal data to work. Your chunking strategy and vector DB choice make or break it.
In production, cost control, rate limit handling, and fallback logic are non-negotiable.

Introduction: The Reality of AI Adoption

"We want to implement AI."

If you're an engineer, you've probably heard this from leadership. But AI is a tool, not a goal. The real question is: what problem are you trying to solve?

Our team has worked on more than ten AI projects over the past two years. The gap between successful and failed projects was always clear.

What success looks like:

Starting from a concrete business problem
Starting small and validating impact early
Integrating naturally into existing human workflows

What failure looks like:

Vague goals like "do something with AI"
Large upfront investment before proving value
No plan for integrating with existing processes

This article walks through a practical approach to AI implementation, with a focus on LLMs (large language models).

Deciding Whether to Adopt AI

Identifying the Right Use Cases

Use cases where AI works well:

✅ Tasks AI is good at
- Processing large volumes of text (summarization, classification, extraction)
- Pattern recognition (anomaly detection, recommendations)
- Natural language interfaces (chatbots, search)
- Automating repetitive tasks that still require judgment

❌ Tasks AI is not good at
- Processes requiring 100% accuracy (e.g., final review of legal documents)
- Processes with extremely tight real-time requirements
- Domains with very little data
- High-stakes decisions with strict accountability requirements

An ROI Framework

interface AIProjectROI {
  // Cost factors
  developmentCost: number;      // Development cost
  apiCost: number;              // API usage fee (monthly)
  infrastructureCost: number;   // Infrastructure cost
  maintenanceCost: number;      // Operations and maintenance cost

  // Benefit factors
  timeSavingHours: number;      // Hours of work saved per month
  hourlyRate: number;           // Hourly rate equivalent
  qualityImprovement: number;   // Value from quality improvements
  newRevenueOpportunity: number; // New revenue opportunities
}

function calculateROI(project: AIProjectROI): {
  monthlyBenefit: number;
  monthlyCost: number;
  paybackMonths: number;
} {
  const monthlyBenefit =
    (project.timeSavingHours * project.hourlyRate) +
    project.qualityImprovement +
    project.newRevenueOpportunity;

  const monthlyCost =
    project.apiCost +
    project.infrastructureCost +
    project.maintenanceCost;

  const paybackMonths = project.developmentCost / (monthlyBenefit - monthlyCost);

  return { monthlyBenefit, monthlyCost, paybackMonths };
}

// Example: automated customer support response system
const supportBot = calculateROI({
  developmentCost: 2000000,      // 200万円
  apiCost: 50000,               // 5万円/month
  infrastructureCost: 10000,    // 1万円/month
  maintenanceCost: 30000,       // 3万円/month
  timeSavingHours: 160,         // 160 hours saved per month
  hourlyRate: 3000,             // 3000 yen/hour equivalent
  qualityImprovement: 50000,    // 5万円/month from quality gains
  newRevenueOpportunity: 0
});
// Result: 53万円 monthly benefit, 9万円 monthly cost, 4.5-month payback period

LLM Usage Patterns

1. Text Classification and Sentiment Analysis

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface ClassificationResult {
  category: string;
  confidence: number;
  sentiment: 'positive' | 'neutral' | 'negative';
}

async function classifyCustomerInquiry(
  inquiry: string
): Promise<ClassificationResult> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `You are a customer support inquiry classification system.
Classify the inquiry into one of the following categories:
- billing: billing and payment issues
- technical: technical problems
- account: account-related issues
- general: other general inquiries

Respond in JSON format.`
      },
      {
        role: 'user',
        content: inquiry
      }
    ],
    response_format: { type: 'json_object' },
    temperature: 0.1  // Keep temperature low for stable classification results
  });

  return JSON.parse(response.choices[0].message.content!);
}

// Usage
const result = await classifyCustomerInquiry(
  'It looks like my credit card was charged twice.'
);
// { category: 'billing', confidence: 0.95, sentiment: 'negative' }

2. Text Generation and Summarization

async function summarizeDocument(
  document: string,
  maxLength: number = 200
): Promise<string> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `You are an expert document summarizer.
Summarize the given document in ${maxLength} characters or fewer.
Be concise while capturing all key points.`
      },
      {
        role: 'user',
        content: document
      }
    ],
    max_tokens: Math.ceil(maxLength * 1.5),  // Japanese text is approximately 1.5 tokens per character
    temperature: 0.3
  });

  return response.choices[0].message.content!;
}

3. Structured Data Extraction

import { z } from 'zod';

// Type definition for extracted data
const InvoiceSchema = z.object({
  vendorName: z.string(),
  invoiceNumber: z.string(),
  invoiceDate: z.string(),
  dueDate: z.string(),
  totalAmount: z.number(),
  taxAmount: z.number(),
  lineItems: z.array(z.object({
    description: z.string(),
    quantity: z.number(),
    unitPrice: z.number(),
    amount: z.number()
  }))
});

type Invoice = z.infer<typeof InvoiceSchema>;

async function extractInvoiceData(invoiceText: string): Promise<Invoice> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are an invoice data extraction system.
Extract the following fields from the invoice text and return them as JSON:
- vendorName: issuing company name
- invoiceNumber: invoice number
- invoiceDate: invoice date (YYYY-MM-DD format)
- dueDate: payment due date (YYYY-MM-DD format)
- totalAmount: total amount (numeric)
- taxAmount: tax amount (numeric)
- lineItems: array of line item objects`
      },
      {
        role: 'user',
        content: invoiceText
      }
    ],
    response_format: { type: 'json_object' },
    temperature: 0
  });

  const data = JSON.parse(response.choices[0].message.content!);

  // Validate with Zod
  return InvoiceSchema.parse(data);
}

Prompt Engineering

Core Principles

Be explicit: Eliminate ambiguity and give specific instructions
Provide context: Supply the necessary background information
Specify output format: Clearly state the expected format
Use examples (Few-shot): Demonstrate with concrete examples to improve accuracy

Few-shot Learning

async function categorizeProduct(productName: string): Promise<string> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: 'Determine the category for a given product name.'
      },
      // Few-shot examples
      { role: 'user', content: 'iPhone 15 Pro Max 256GB' },
      { role: 'assistant', content: 'Smartphones' },
      { role: 'user', content: 'SONY WH-1000XM5' },
      { role: 'assistant', content: 'Headphones & Earphones' },
      { role: 'user', content: 'MacBook Air M3' },
      { role: 'assistant', content: 'Laptops' },
      // Actual input
      { role: 'user', content: productName }
    ],
    temperature: 0.1
  });

  return response.choices[0].message.content!;
}

Chain of Thought

For tasks requiring complex reasoning, walking the model through its thinking step by step significantly improves accuracy.

async function analyzeBusinessProblem(problem: string): Promise<{
  analysis: string;
  recommendations: string[];
}> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are a business consultant.
When analyzing a problem, work through the following steps:

1. Identify the core issue
2. List all contributing factors
3. Assess the impact of each factor
4. Consider possible solutions
5. Recommend the most effective course of action

Show your reasoning at each step before presenting your final recommendations.`
      },
      {
        role: 'user',
        content: problem
      }
    ],
    temperature: 0.7
  });

  // Parse and structure the response
  const content = response.choices[0].message.content!;
  // ... parsing logic
  return { analysis: content, recommendations: [] };
}

Implementing RAG (Retrieval-Augmented Generation)

RAG is an architecture where the LLM generates responses grounded in relevant information retrieved from an external knowledge base.

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│    User     │────▶│Query Embed- │────▶│  Vector DB  │
│  Question   │     │   ding      │     │  (Pinecone) │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                               │
                    ┌─────────────┐            │ Similar docs
                    │    LLM     │◀───────────┘
                    │  (GPT-4)   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Response   │
                    │ Generation  │
                    └─────────────┘

Document Preprocessing and Chunking

interface DocumentChunk {
  id: string;
  content: string;
  metadata: {
    source: string;
    page?: number;
    section?: string;
  };
  embedding?: number[];
}

function chunkDocument(
  document: string,
  chunkSize: number = 1000,
  overlap: number = 200
): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < document.length) {
    let end = start + chunkSize;

    // Avoid splitting in the middle of a sentence
    if (end < document.length) {
      const lastPeriod = document.lastIndexOf('。', end);
      const lastNewline = document.lastIndexOf('\n', end);
      const breakPoint = Math.max(lastPeriod, lastNewline);

      if (breakPoint > start) {
        end = breakPoint + 1;
      }
    }

    chunks.push(document.slice(start, end).trim());
    start = end - overlap;
  }

  return chunks;
}

// Semantic chunking (a more advanced approach)
async function semanticChunk(
  document: string,
  maxChunkSize: number = 1500
): Promise<string[]> {
  // Split on headings and paragraphs
  const sections = document.split(/\n#{1,3}\s/);

  const chunks: string[] = [];
  let currentChunk = '';

  for (const section of sections) {
    if (currentChunk.length + section.length > maxChunkSize) {
      if (currentChunk) {
        chunks.push(currentChunk.trim());
      }
      currentChunk = section;
    } else {
      currentChunk += '\n' + section;
    }
  }

  if (currentChunk) {
    chunks.push(currentChunk.trim());
  }

  return chunks;
}

Vector Embeddings and Search

import { Pinecone } from '@pinecone-database/pinecone';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.index('knowledge-base');

// Embed and store documents
async function indexDocuments(chunks: DocumentChunk[]): Promise<void> {
  // Generate embedding vectors using the OpenAI Embeddings API
  const embeddings = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: chunks.map(c => c.content)
  });

  // Store in Pinecone
  const vectors = chunks.map((chunk, i) => ({
    id: chunk.id,
    values: embeddings.data[i].embedding,
    metadata: {
      content: chunk.content,
      ...chunk.metadata
    }
  }));

  await index.upsert(vectors);
}

// Search for similar documents
async function searchSimilarDocuments(
  query: string,
  topK: number = 5
): Promise<DocumentChunk[]> {
  // Convert the query to an embedding vector
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query
  });

  // Similarity search in Pinecone
  const results = await index.query({
    vector: queryEmbedding.data[0].embedding,
    topK,
    includeMetadata: true
  });

  return results.matches.map(match => ({
    id: match.id,
    content: match.metadata!.content as string,
    metadata: {
      source: match.metadata!.source as string,
      score: match.score
    }
  }));
}

The Complete RAG Pipeline

async function ragQuery(userQuestion: string): Promise<string> {
  // 1. Retrieve relevant documents
  const relevantDocs = await searchSimilarDocuments(userQuestion, 5);

  // 2. Build the context
  const context = relevantDocs
    .map(doc => `[Source: ${doc.metadata.source}]\n${doc.content}`)
    .join('\n\n---\n\n');

  // 3. Generate a response with the LLM
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are an internal knowledge base assistant.
Answer questions based solely on the reference materials below.

Rules:
- Only use information found in the reference materials
- If the information is not available, say "No relevant information was found"
- Always cite the source that supports your answer

Reference materials:
${context}`
      },
      {
        role: 'user',
        content: userQuestion
      }
    ],
    temperature: 0.3
  });

  return response.choices[0].message.content!;
}

API Cost Optimization

Model Selection Strategy

type TaskComplexity = 'simple' | 'moderate' | 'complex';

function selectModel(task: TaskComplexity): string {
  const modelMap = {
    simple: 'gpt-4o-mini',    // Classification, simple extraction
    moderate: 'gpt-4o-mini',  // Summarization, Q&A
    complex: 'gpt-4o'         // Complex reasoning, code generation
  };
  return modelMap[task];
}

// Cost estimation utility
function estimateCost(
  model: string,
  inputTokens: number,
  outputTokens: number
): number {
  const pricing: Record<string, { input: number; output: number }> = {
    'gpt-4o': { input: 0.0025, output: 0.01 },      // per 1K tokens
    'gpt-4o-mini': { input: 0.00015, output: 0.0006 }
  };

  const rate = pricing[model];
  return (inputTokens / 1000 * rate.input) + (outputTokens / 1000 * rate.output);
}

Caching Strategy

import { Redis } from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

async function cachedCompletion(
  prompt: string,
  options: { model: string; temperature: number }
): Promise<string> {
  // Only cache deterministic results (temperature === 0)
  if (options.temperature > 0) {
    return await directCompletion(prompt, options);
  }

  const cacheKey = `llm:${options.model}:${hashString(prompt)}`;

  // Check cache
  const cached = await redis.get(cacheKey);
  if (cached) {
    return cached;
  }

  // Make the API call
  const result = await directCompletion(prompt, options);

  // Cache for 24 hours
  await redis.setex(cacheKey, 86400, result);

  return result;
}

Batch Processing to Reduce Costs

async function batchClassify(
  items: string[],
  batchSize: number = 20
): Promise<string[]> {
  const results: string[] = [];

  for (let i = 0; i < items.length; i += batchSize) {
    const batch = items.slice(i, i + batchSize);

    // Process multiple items in a single request
    const response = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [
        {
          role: 'system',
          content: `Classify each of the following ${batch.length} items.
Return your answer as a JSON array.`
        },
        {
          role: 'user',
          content: batch.map((item, idx) => `${idx + 1}. ${item}`).join('\n')
        }
      ],
      response_format: { type: 'json_object' }
    });

    const batchResults = JSON.parse(response.choices[0].message.content!);
    results.push(...batchResults.classifications);
  }

  return results;
}

Error Handling and Fallbacks

Handling Rate Limits

import pRetry from 'p-retry';

async function robustCompletion(
  messages: OpenAI.ChatCompletionMessageParam[]
): Promise<string> {
  return pRetry(
    async () => {
      try {
        const response = await openai.chat.completions.create({
          model: 'gpt-4o-mini',
          messages
        });
        return response.choices[0].message.content!;
      } catch (error: any) {
        if (error.status === 429) {
          // Retry on rate limit errors
          throw error;
        }
        if (error.status === 500 || error.status === 503) {
          // Retry on server errors
          throw error;
        }
        // All other errors fail immediately
        throw new pRetry.AbortError(error);
      }
    },
    {
      retries: 3,
      minTimeout: 1000,
      maxTimeout: 10000,
      onFailedAttempt: (error) => {
        console.log(`Attempt ${error.attemptNumber} failed. Retrying...`);
      }
    }
  );
}

Fallback Strategy

async function completionWithFallback(
  messages: OpenAI.ChatCompletionMessageParam[]
): Promise<string> {
  const models = ['gpt-4o', 'gpt-4o-mini', 'gpt-3.5-turbo'];

  for (const model of models) {
    try {
      const response = await openai.chat.completions.create({
        model,
        messages,
        timeout: 30000
      });
      return response.choices[0].message.content!;
    } catch (error) {
      console.error(`${model} failed:`, error);
      continue;
    }
  }

  // All models failed
  throw new Error('All models failed');
}

Monitoring in Production

Key Metrics

interface LLMMetrics {
  requestId: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  latencyMs: number;
  cost: number;
  success: boolean;
  errorType?: string;
}

async function trackedCompletion(
  messages: OpenAI.ChatCompletionMessageParam[],
  options: { model: string }
): Promise<{ result: string; metrics: LLMMetrics }> {
  const startTime = Date.now();
  const requestId = crypto.randomUUID();

  try {
    const response = await openai.chat.completions.create({
      model: options.model,
      messages
    });

    const metrics: LLMMetrics = {
      requestId,
      model: options.model,
      inputTokens: response.usage!.prompt_tokens,
      outputTokens: response.usage!.completion_tokens,
      latencyMs: Date.now() - startTime,
      cost: estimateCost(
        options.model,
        response.usage!.prompt_tokens,
        response.usage!.completion_tokens
      ),
      success: true
    };

    // Record metrics
    await recordMetrics(metrics);

    return {
      result: response.choices[0].message.content!,
      metrics
    };
  } catch (error: any) {
    const metrics: LLMMetrics = {
      requestId,
      model: options.model,
      inputTokens: 0,
      outputTokens: 0,
      latencyMs: Date.now() - startTime,
      cost: 0,
      success: false,
      errorType: error.code || 'unknown'
    };

    await recordMetrics(metrics);
    throw error;
  }
}

Automated Quality Evaluation

// Automated response quality evaluation
async function evaluateResponseQuality(
  question: string,
  response: string,
  expectedTopics: string[]
): Promise<{
  relevance: number;
  completeness: number;
  accuracy: number;
}> {
  const evaluation = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are a response quality evaluator.
Score each dimension from 0 to 100:
- relevance: how well the answer addresses the question
- completeness: how thoroughly the answer covers the topic
- accuracy: how factually correct the information is

Respond in JSON format.`
      },
      {
        role: 'user',
        content: `Question: ${question}\n\nAnswer: ${response}\n\nExpected topics: ${expectedTopics.join(', ')}`
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(evaluation.choices[0].message.content!);
}

Summary: Keys to Successful AI Implementation

Adoption Phase

Clarify the problem: Identify the specific business problem AI will solve
Evaluate ROI: Estimate costs and benefits quantitatively
Start small: Validate impact with a proof of concept

Implementation Phase

Prompt design: Leverage Few-shot and Chain of Thought techniques
Build RAG: Create a knowledge base from your internal data
Optimize costs: Choose the right model, use caching, and batch requests

Operations Phase

Monitor: Continuously track latency, cost, and quality
Feedback loop: Iterate based on user feedback
Regular reviews: Periodically review accuracy and ROI

AI implementation is never a one-and-done effort. Continuous improvement and operational discipline are what separate projects that succeed from those that stall.

Resources

If you're struggling with your AI adoption journey, feel free to reach out.