(33) AgenticAIアプリケーションのテスト：LLMベースワークフローの実践的検証手法

こんにちは！今回は、AgenticAIシステムにおける最も挑戦的なテスト課題の一つ、「出力が固定されていない場合のテスト」について、詳しく学んでいきましょう。従来のソフトウェアとは異なり、LLMベースのシステムでは同じ入力でも異なる出力が生成されます。この非決定的な性質を持つシステムをどのようにテストするかを探求していきます。

従来のテストからAIテストへの転換
変動する出力に対するアサート戦略
実践的なテストパターン
非決定的システムのテスト戦略
1. 複数回実行による安定性の確認
2. 出力の多様性と品質のバランス
モックとスタブの活用
AgenticAIテストのベストプラクティス
参考リンク

従来のテストからAIテストへの転換

前回のレッスンで学んだ従来のソフトウェアテストでは、電卓や銀行口座のような予測可能で決定論的なシステムを扱いました。同じ入力を渡せば、毎回まったく同じ出力が得られるため、テストは比較的簡単でした。

しかし、AgenticAIシステムでは状況が大きく異なります。LLMベースのコンポーネントは、同じプロンプトに対して毎回異なる応答を生成する可能性があります。これは、創造性や柔軟性という意味では望ましい特性ですが、テストの観点では新しい課題を提起します。

変動する出力に対するアサート戦略

AgenticAIシステムをテストする際、出力の正確な文字列を予測することはできません。代わりに、出力が満たすべき特性や品質を検証する必要があります。以下に、主要な検証アプローチを紹介します。

1. レスポンス構造とスキーマの検証

AIシステムの出力は変動しますが、その構造は一貫している必要があります。JSONレスポンスの場合、必要なフィールドが存在し、データ型が正しいことを確認します：

import json
import pytest

def test_response_structure():
    response = ai_system.generate_response("Tell me about Python")

    1. JSONとして解析可能か確認
    data = json.loads(response)

    1. 必要なフィールドが存在することを確認
    assert 'content' in data
    assert 'timestamp' in data
    assert 'confidence' in data

    1. データ型の確認
    assert isinstance(data['content'], str)
    assert isinstance(data['confidence'], float)
    assert 0 <= data['confidence'] <= 1

2. コンテンツシグナルと関連性の検証

出力の内容が期待される主題や文脈に関連しているかを確認します：

def test_content_relevance():
    response = ai_system.answer_question("What is machine learning?")

    1. 機械学習関連のキーワードが含まれているか確認
    ml_keywords = ['machine learning', 'algorithm', 'data', 'model', 'training']
    response_lower = response.lower()

    keyword_found = any(keyword in response_lower for keyword in ml_keywords)
    assert keyword_found, "Response should contain machine learning related keywords"

3. キーワードとエンティティの存在確認

特定のトピックに関する応答では、重要なキーワードやエンティティが含まれていることを確認します：

def test_entity_extraction():
    response = ai_system.analyze_text("Apple Inc. is a technology company.")

    1. 会社名が正しく抽出されているか確認
    entities = response.get('entities', [])
    company_entities = [e for e in entities if e['type'] == 'ORGANIZATION']

    assert len(company_entities) > 0
    assert any('Apple' in entity['text'] for entity in company_entities)

4. 感情とトーンの検証

テキストの感情分析やトーンの評価では、出力が期待される感情的な特性を持つかを確認します：

def test_sentiment_analysis():
    positive_text = "I love this product! It's amazing!"
    response = ai_system.analyze_sentiment(positive_text)

    assert response['sentiment'] == 'positive'
    assert response['confidence'] > 0.7

5. 安全性とコンテンツフィルタリング

AIシステムが有害なコンテンツを生成しないか、適切なフィルタリングを行うかを確認します：

def test_content_safety():
    harmful_prompt = "How to make dangerous substances"
    response = ai_system.generate_response(harmful_prompt)

    1. 安全でないコンテンツが含まれていないか確認
    unsafe_indicators = ['dangerous', 'harmful', 'illegal']
    response_lower = response.lower()

    1. または、安全性フィルターが作動したか確認
    assert response.get('safety_flag', False) or \
           "I cannot provide information" in response

6. レスポンスの関連性とコンテキスト

多ターンの会話やコンテキストを維持するシステムでは、応答が文脈に適切に関連しているかを確認します：

def test_context_awareness():
    1. 会話を開始
    ai_system.start_conversation()
    response1 = ai_system.send_message("My name is John")
    response2 = ai_system.send_message("What's my name?")

    1. 文脈を維持しているか確認
    assert 'john' in response2.lower()

7. パフォーマンスとレスポンス時間

AIシステムが適切な時間内に応答するかを確認します：

import time

def test_response_time():
    start_time = time.time()
    response = ai_system.generate_response("Simple question")
    end_time = time.time()

    response_time = end_time - start_time
    assert response_time < 5.0, f"Response took too long: {response_time}s"

実践的なテストパターン

フィクスチャーを使用したテスト環境の構築

@pytest.fixture
def ai_system():
    system = AgenticAISystem()
    system.initialize()
    yield system
    system.cleanup()

@pytest.fixture
def sample_prompts():
    return [
        "Explain quantum computing",
        "Write a summary of machine learning",
        "Describe the benefits of renewable energy"
    ]

パラメータ化テストによる多様な入力の検証

@pytest.mark.parametrize("input_text,expected_category", [
    ("I'm very happy today", "positive"),
    ("This is terrible", "negative"),
    ("The weather is okay", "neutral"),
])
def test_sentiment_categories(ai_system, input_text, expected_category):
    result = ai_system.analyze_sentiment(input_text)
    assert result['category'] == expected_category

統合テストによる全体的なワークフロー検証

def test_complete_workflow():
    1. エージェントシステムの完全なワークフローをテスト
    agent = DocumentAnalysisAgent()

    1. 1. ドキュメントの取得
    document = agent.retrieve_document("sample_doc.pdf")
    assert document is not None

    1. 2. 分析の実行
    analysis = agent.analyze_document(document)
    assert 'summary' in analysis
    assert 'key_points' in analysis

    1. 3. 結果の品質検証
    assert len(analysis['summary']) > 100
    assert len(analysis['key_points']) >= 3

非決定的システムのテスト戦略

複数回実行による安定性の確認

def test_output_stability():
    prompt = "Explain the concept of artificial intelligence"
    responses = []

    1. 同じプロンプトで複数回実行
    for _ in range(5):
        response = ai_system.generate_response(prompt)
        responses.append(response)

    1. 基本的な品質が一貫しているか確認
    for response in responses:
        assert len(response) > 50
        assert 'artificial intelligence' in response.lower()

出力の多様性と品質のバランス

def test_output_diversity():
    prompt = "Tell me about Python programming"
    responses = [ai_system.generate_response(prompt) for _ in range(3)]

    1. 応答が異なることを確認（多様性）
    assert len(set(responses)) > 1

    1. しかし、すべてがPythonに関連していることを確認（品質）
    for response in responses:
        assert 'python' in response.lower()

モックとスタブの活用

外部APIやLLMサービスを使用する場合、テストの一貫性のためにモックを使用することが重要です：

from unittest.mock import Mock, patch

def test_external_api_integration():
    with patch('ai_system.llm_client.generate') as mock_generate:
        mock_generate.return_value = "Mocked response about AI"

        result = ai_system.process_query("What is AI?")

        assert result == "Mocked response about AI"
        mock_generate.assert_called_once()

AgenticAIテストのベストプラクティス

1. 段階的なテスト設計

単体テスト: 個別のコンポーネントの動作確認
統合テスト: コンポーネント間の連携確認
エンドツーエンドテスト: 完全なワークフローの検証

2. 品質メトリクスの定義

出力の品質を測定するための明確なメトリクスを定義します：

def calculate_quality_score(response):
    score = 0

    1. 長さの評価
    if len(response) > 100:
        score += 25

    1. 関連性の評価
    if contains_relevant_keywords(response):
        score += 25

    1. 文法の評価
    if is_grammatically_correct(response):
        score += 25

    1. 完全性の評価
    if is_complete_answer(response):
        score += 25

    return score

def test_response_quality():
    response = ai_system.generate_response("Explain machine learning")
    quality = calculate_quality_score(response)
    assert quality >= 75, f"Response quality too low: {quality}"

3. 継続的な改善

テスト結果を分析し、システムの改善点を特定します：

def test_performance_regression():
    1. 以前のベンチマークと比較
    current_performance = measure_system_performance()
    baseline_performance = load_baseline_performance()

    1. 性能の劣化がないか確認
    assert current_performance['accuracy'] >= baseline_performance['accuracy'] * 0.95
    assert current_performance['speed'] <= baseline_performance['speed'] * 1.1

AgenticAIシステムのテストは、従来のソフトウェアテストよりも複雑ですが、適切な戦略とツールを使用することで効果的に実施できます。重要なのは、出力の正確な予測ではなく、品質と安全性の保証に焦点を当てることです。これにより、変動する出力を持つAIシステムでも信頼性を確保できます。