Noveum.ai - Your One-Stop AI Evaluation Platform | Noveum.ai

Introduction

Artificial Intelligence has soared in popularity, bringing with it an avalanche of new models and vendors vying for attention. From GPT-4 to DeepSeek, every few months we see a new "best model" on the block. But how do you systematically evaluate these models for your specific use case? That's where Noveum.ai comes in.

In this post, I'll walk you through what Noveum.ai is, how it works, and why I believe it's a game-changer for companies serious about AI. My name is Shashank Agarwal—founder of Noveum.ai, AI enthusiast, and someone who's been building and scaling large AI/ML platforms for over a decade. Let's dive in.

Why Noveum.ai?

Organizations often get stuck deciding which AI model to use. Blind guesswork or minimal experimentation leads to suboptimal performance, high costs, and wasted development time. Noveum.ai solves that problem by automating the entire evaluation process—collecting performance data, cost metrics, and feedback so you always know which model truly shines for your specific application.

Key benefits of using Noveum.ai:

A single AI Gateway that routes calls to multiple providers without changing your existing code.
Automated data collection of latency, token usage, cost, and more.
Seamless creation of custom evaluation datasets from real-world logs.
Powerful side-by-side comparison dashboards for GPT, Claude, DeepSeek, or any AI model out there.
Future-proofed approach with upcoming features like one-click fine-tuning and support for image/video models.

How It Works

Step 1: Integrate the AI Gateway

At the heart of Noveum.ai lies our open-source AI Gateway. Instead of calling providers like OpenAI directly, you make a single-line code change to route requests through Noveum's gateway.

Quick Start with Docker

The gateway can be deployed in seconds using Docker:

docker run -p 3000:3000 noveum/ai-gateway:latest

OpenAI-Compatible Interface

The gateway provides a drop-in replacement for OpenAI's API. You can use your existing OpenAI client libraries by just changing the base URL:

import OpenAI from 'openai';

const openai = new OpenAI({
    baseURL: 'http://localhost:3000/v1',
    apiKey: 'your-provider-api-key',
    defaultHeaders: { 'x-provider': 'openai' }
});

const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: 'Hello!' }]
});

Provider-Specific Examples

The gateway supports multiple AI providers through a unified interface. Here are some examples:

Anthropic (Claude):

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-provider: anthropic" \
  -H "Authorization: Bearer your-anthropic-api-key" \
  -d '{
    "model": "claude-3-sonnet-20240229-v1:0",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 1000
  }'

GROQ:

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-provider: groq" \
  -H "Authorization: Bearer your-groq-api-key" \
  -d '{
    "model": "mixtral-8x7b-32768",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 1000
  }'

Streaming Support

The gateway fully supports streaming responses for all providers:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="your-provider-api-key",
    default_headers={"x-provider": "anthropic"}  # or any other provider
)

stream = client.chat.completions.create(
    model="claude-3-sonnet-20240229-v1:0",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Comprehensive Metrics

Every response includes detailed metrics to help you track performance and costs:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1709312768,
  "model": "gpt-4",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 9,
    "total_tokens": 19
  },
  "system_fingerprint": "fp_1234",
  "metrics": {
    "latency_ms": 450,
    "tokens_per_second": 42.2,
    "cost": {
      "input_cost": 0.0003,
      "output_cost": 0.0006,
      "total_cost": 0.0009
    }
  }
}

That's it! This gateway is lightweight and can be deployed on Cloudflare Workers, Kubernetes, or Docker with minimal overhead. As soon as you route your calls through this gateway, Noveum starts capturing metrics such as latency, cost per request, and token usage in real time.

Step 2: Collect Logs & Build Datasets

Whenever your app sends requests to an AI model, the gateway logs details like:

Request Payload (prompt, parameters, etc.)
Response Payload (the model output)
Cost and Token Utilization (so you can see how your budget is being spent)
Latency (time-to-first-byte, total duration, etc.)

Over time, you accumulate a large library of real-world interactions. Noveum.ai lets you review and curate these logs, marking errors or successes to build up a customized evaluation dataset. This dataset reflects the actual requests your users make, making it highly relevant for model evaluation.

Step 3: Run Evaluation Jobs

Once you've built an Eval Dataset, you can run an evaluation job directly within Noveum's platform:

Pick Models: Select which AI models you'd like to benchmark (e.g., GPT-4, Claude 2, DeepSeek, or your in-house custom model).
Set Criteria: Define what matters most—accuracy, cost, latency, or a combination.
Launch Job: Noveum automatically sends your dataset prompts to each model and collects every response.
Panel of LLM Judges: We then use a panel of large language models to qualitatively assess each response, scoring them based on your custom criteria (like correctness, style, or reliability).

Step 4: Compare and Optimize

After the evaluation job completes, you get a comprehensive dashboard showing:

Accuracy Score: How each model performed compared to the ground truth or your ideal outcome.
Latency Metrics: Which provider consistently returns results faster.
Cost Analysis: Tokens used, cost per request, and overall monthly projections.

With this data, you can quickly identify which model outperforms the rest under your unique requirements. Noveum even offers a one-click switch to change your production model to a better-performing option—no complicated re-engineering necessary.

Step 5: Continuous Improvement

As your AI usage grows, so does your evaluation dataset. Noveum's gateway ensures you're always collecting logs, making it easy to re-run evaluation jobs whenever new models drop or your needs evolve. Plus, with fine-tuning on our near-term roadmap, you'll soon be able to refine the best models on your own data, all within Noveum.ai.

Real-World Example

Imagine you're running a customer support chatbot that fields thousands of requests daily. You started with GPT-4 but wonder if Claude 2 or DeepSeek might be cheaper or more accurate. With Noveum.ai:

Integrate: One line of code points your chatbot's calls to our gateway.
Collect Logs: Over a week, you gather real user queries and GPT-4's responses.
Build Dataset: Flag some of the best and worst interactions, creating an eval set of real Q&A pairs.
Evaluate: Let Noveum test the exact same queries on GPT-4, Claude 2, and DeepSeek.
Compare: You discover Claude 2 is nearly 25% cheaper but only 5% less accurate, a trade-off you can live with. Or maybe DeepSeek is consistently faster for your region, saving time for your support team.
Optimize: Switch to your new model (or keep GPT-4 if you care more about accuracy), and watch your logs to confirm results in production.

Why Noveum.ai?

Simplicity: No more ad-hoc scripts or manual spreadsheets. Our gateway and platform handle it automatically.
Customization: Your dataset, your criteria, your choice of models. We're model-agnostic, so you have freedom.
Future-Proofing: As new providers and models appear, Noveum keeps you updated without code rewrites.
Cost Savings: Companies often discover more cost-efficient models—sometimes saving over 10×—with minimal loss in accuracy.
Time to Value: An entire evaluation cycle can take minutes, not weeks of developer time.

Wrapping Up

Noveum.ai is here to democratize AI model evaluation, empowering teams to make data-driven decisions rather than relying on guesswork or brand hype. The process is straightforward—integrate, collect, evaluate, optimize—and the benefits are enormous: cutting costs, boosting performance, and freeing up your team to focus on innovation, not grunt work.

If you're ready to try it for yourself, check out the Noveum.ai beta sign-up or contact us for more details. The AI landscape may be changing at lightning speed, but with Noveum.ai, you'll always be on top of the best model for your application.

Let's build smarter, better AI—together.

Noveum.ai - Your One-Stop AI Evaluation Platform

Introduction

Why Noveum.ai?

How It Works

Step 1: Integrate the AI Gateway

Quick Start with Docker

OpenAI-Compatible Interface

Provider-Specific Examples

Streaming Support

Comprehensive Metrics

Step 2: Collect Logs & Build Datasets

Step 3: Run Evaluation Jobs

Step 4: Compare and Optimize

Step 5: Continuous Improvement

Real-World Example

Why Noveum.ai?

Wrapping Up

Get Early Access to Noveum.ai Platform