Using Langfuse for LLM Observability

Langfuse is an open-source LLM engineering platform that provides specialized observability for AI applications, including token usage tracking, model performance analytics, and detailed LLM interaction tracing.

Quick Start

1. Setup Langfuse Account

First, create a Langfuse account and get your API keys:

Sign up at Langfuse Cloud
Create a new project in your Langfuse dashboard
Get your API keys from the project settings:
- Public Key: pk-lf-xxxxxxxxxx
- Secret Key: sk-lf-xxxxxxxxxx

2. Configure Langfuse

To integrate Langfuse with your Inkeep Agent Framework instrumentation, you need to modify your instrumentation file to include the Langfuse span processor.

Replace the default setup with a custom NodeSDK configuration:

Set your environment variables:

LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxxxx
LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxxxx
LANGFUSE_BASE_URL=https://us.cloud.langfuse.com

Update your instrumentation file:

import {
  defaultSpanProcessors,
  defaultContextManager,
  defaultResource,
  defaultTextMapPropagator,
  defaultInstrumentations,
} from "@inkeep/agents-run-api/instrumentation";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseSpanProcessor } from "@langfuse/otel";

export const defaultSDK = new NodeSDK({
  resource: defaultResource,
  contextManager: defaultContextManager,
  textMapPropagator: defaultTextMapPropagator,
  spanProcessors: [...defaultSpanProcessors, new LangfuseSpanProcessor()],
  instrumentations: defaultInstrumentations,
});

defaultSDK.start();

Note

Make sure to install the required dependencies in your run API directory: bash cd apps/run-api pnpm add @opentelemetry/sdk-node @langfuse/otel

What This Configuration Does

Preserves all default instrumentation: Uses the same resource, context manager, propagator, and instrumentations as the default setup
Adds Langfuse span processor: Extends the default span processors with Langfuse's processor for specialized LLM observability
Maintains compatibility: Your existing traces will continue to work while adding Langfuse-specific features

Dataset setup and execution

Use the Inkeep Agent Cookbook which provides ready-to-use scripts for creating and running Langfuse dataset evaluations programmatically.

1. Clone the Agents Repository

git clone https://github.com/inkeep/agents.git
cd agents/agents-cookbook/evals/langfuse-dataset-example

Set up environment variables in a .env file:

# Langfuse configuration (required for both scripts)
LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
LANGFUSE_SECRET_KEY=your_langfuse_secret_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com

# Chat API configuration (for dataset runner)
INKEEP_AGENTS_RUN_API_KEY=your_api_key
INKEEP_AGENTS_RUN_API_URL=your_chat_api_base_url

# Execution context (for dataset runner)
INKEEP_TENANT_ID=your_tenant_id
INKEEP_PROJECT_ID=your_project_id
INKEEP_AGENT_ID=your_agent_id

2. Initialize Dataset with Sample Data

Run the basic Langfuse example to initialize a dataset with sample user messages:

pnpm run langfuse-init-example

This script will:

Connect to your Langfuse project
Create a new dataset called "inkeep-weather-example-dataset" with sample dataset items

3. Run Dataset Items to Generate Traces

Run dataset items to generate traces that can be evaluated:

pnpm run langfuse-run-dataset

This script will:

Read items from your Langfuse dataset
Execute each item against your weather agent
Generate the data needed for evaluation

Running LLM Evaluations in Langfuse Dashboard

Langfuse provides a powerful web interface for running LLM evaluations without writing code. You can create datasets, set up evaluators, and run evaluations directly in the dashboard.

Accessing the Evaluation Features

Log into your Langfuse dashboard: https://cloud.langfuse.com
Navigate to your project where your agent traces are being collected
Click "Evaluations" in the left sidebar
Click "Set up evaluator" to begin creating evaluations

Setting Up LLM-as-a-Judge Evaluators

Set Up Default Evaluation Model

Before creating evaluators, you need to configure a default LLM connection for evaluations:

Langfuse LLM Connection setup showing OpenAI provider configuration with API key field and advanced settings

Setting up the LLM Connection:

Navigate to "Evaluator Library" in your Langfuse dashboard
Click "Set up" next to "Default Evaluation Model"
Configure the LLM connection:
- LLM Adapter: Select your preferred provider
- Provider Name: Give it a descriptive name (e.g., "openai")
- API Key: Enter your OpenAI API key (stored encrypted)
- Advanced Settings: Configure base URL, model parameters if needed
Click "Create connection" to save

Navigate to Evaluator Setup

Go to "Evaluations" → "Running Evaluators"
Click "Set up evaluator" button
You'll see two main steps: "1. Select Evaluator" and "2. Run Evaluator"

Choose Your Evaluator Type

You have two main options:

Option A: Langfuse Managed Evaluators

Langfuse provides a comprehensive catalog of pre-built evaluators

To use a managed evaluator:

Browse the evaluator list and find one that matches your needs
Click on the evaluator to see its description and criteria
Click "Use Selected Evaluator" button

Customizing Managed Evaluators for Dataset Runs

Once you've selected a managed evaluator, you can edit it to target your dataset runs. This is particularly useful for evaluating agent performance against known test cases.

Example: Customizing a Helpfulness Evaluator

Select the "Helpfulness" evaluator from the managed list
Under Target select dataset runs
Configure variable mapping
- {{input}} → Object: Trace, Object Variable: Input
- {{generation}} → Object: Trace, Object Variable: Output

Option B: Create Custom Evaluator

Click "+ Create Custom Evaluator" button
Fill in evaluator details:
- Name: Choose a descriptive name (e.g., "weather_tool_used")
- Description: Explain what this evaluator measures
- Model: Select evaluation model
- Prompt: Configure a custom prompt

Example: Customizing a Weather Tool Evaluator

Prompt

You are an expert evaluator for an AI agent system.
Your task is to rate the correctness of tool usage on a scale from 0.0 to 1.0.

Instructions:

If the user’s question is not weather-related and the tool used is not get_weather_forecast, return 1.0.

If the user’s question is not weather-related and the tool is get_weather_forecast, return 0.0.

If the user’s question is weather-related, return 1.0 only if the tool used is get_weather_forecast; otherwise return 0.0.

Input:
User Question: {`{{input}}`}
Tool Used: {`{{tool_used}}`}

Configure variable mapping:
- {{input}} → Object: Trace, Object Variable: Input
- {{tool_used}} → Object: Span, Object Name: weather-forecaster.ai.toolCall, Object Variable: Metadata, JsonPath: $.attributes["ai.toolCall.name"]

Langfuse helpfulness evaluator setup screen showing evaluator configuration with variable mapping and trace targeting options

Enable and Monitor

Click "Enable Evaluator" to start automatic evaluation
Monitor evaluation progress in the dashboard
View evaluation results as they complete

On this page