Harden your application with LangSmith evaluation

Don’t ship on “vibes” alone. Define a test suite to capture
qualitative and quantitative performance metrics.

A proper evaluation framework gives you the confidence to put LLMs at the center of your application. LangSmith integrates with LangChain off-the-shelf and fully custom evaluators, allowing for measurement of application performance.

Test early, test often

LangSmith helps test application code
pre-release and while it runs in production.

Offline Evaluation

Test your application on reference LangSmith datasets. Use a combination of human review and auto-evals to score your results.

Integrate with CI

Understand how changes to your prompt, model, or retrieval strategy impact your app before they hit prod. Catch regressions in CI and prevent them from impacting users.

Online evaluation

Continuously monitor qualitative characteristics of your live application to spot problems or drift.
001

Dataset Construction

A prerequisite to any good testing framework is building a reference dataset to test on. Commonly one of the more tedious parts of the LLM application development process, LangSmith reduces effort by making it easy to save debugging and production traces to datasets. Datasets are collections of either exemplary or problematic inputs and outputs that should be replicated or rectified, respectively.

Go to Docs
002

Regression Testing

When there are so many moving parts to an LLM-app, it can be hard to attribute regressions to a specific model, prompt, or other system change. LangSmith lets you track how different versions of your app stack up based on the evaluation criteria that you’ve defined.

Go to Docs
003

Human Annotation

While LangSmith has many options for automatic evaluation, sometimes you need a human touch. LangSmith speeds up the human labeler workflow significantly by supporting a feedback config and queue of traces that users can easily work through by annotating application responses with scores.

004

Online Evaluation

Testing needs to happen continuously for any live application. LangSmith helps you monitor not only latency, errors, and cost, but also qualitative measures to make sure your application responds effectively and meets company expectations.

Don’t fly blind. Easily benchmark performance.

Evaluation gives developers a framework to make trade-off decisions between cost, latency, and quality.

Go to Docs

Ready to start shipping 
reliable GenAI apps faster?

LangChain and LangSmith are critical parts of the reference 
architecture to get you from prototype to production.