NovaEval Framework
NovaEval is a comprehensive, extensible AI model evaluation framework designed for production use. It provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
🚧 Development Status
⚠️ ACTIVE DEVELOPMENT - NOT PRODUCTION READY
NovaEval is currently in active development and not recommended for production use. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
We're looking for contributors! See the Contributing section below for ways to help.
🚀 Key Features
- Multi-Model Support: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
- Extensible Scoring: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
- Dataset Integration: Support for MMLU, HuggingFace datasets, custom datasets, and more
- Production Ready: Docker support, Kubernetes deployment, and cloud integrations
- Comprehensive Reporting: Detailed evaluation reports, artifacts, and visualizations
- Secure: Built-in credential management and secret store integration
- Scalable: Designed for both local testing and large-scale production evaluations
- Cross-Platform: Tested on macOS, Linux, and Windows with comprehensive CI/CD
📦 Installation
From PyPI (Recommended)
From Source
Docker
🏃♂️ Quick Start
Basic Evaluation
Configuration-Based Evaluation
Command Line Interface
NovaEval provides a comprehensive CLI for running evaluations:
Example Configuration
🏗️ Architecture
NovaEval is built with extensibility and modularity in mind:
Core Components
- Datasets: Standardized interface for loading evaluation datasets
- Models: Unified API for different AI model providers
- Scorers: Pluggable scoring mechanisms for various evaluation metrics
- Evaluators: Orchestrates the evaluation process
- Reporting: Generates comprehensive reports and artifacts
- Integrations: Handles external services (S3, credential stores, etc.)
📊 Supported Datasets
- MMLU: Massive Multitask Language Understanding
- HuggingFace: Any dataset from the HuggingFace Hub
- Custom: JSON, CSV, or programmatic dataset definitions
- Code Evaluation: Programming benchmarks and code generation tasks
- Agent Traces: Multi-turn conversation and agent evaluation
🤖 Supported Models
- OpenAI: GPT-3.5, GPT-4, and newer models
- Anthropic: Claude family models
- AWS Bedrock: Amazon's managed AI services
- Noveum AI Gateway: Integration with Noveum's model gateway
- Custom: Extensible interface for any API-based model
📏 Built-in Scorers
Accuracy-Based
- ExactMatch: Exact string matching
- Accuracy: Classification accuracy
- F1Score: F1 score for classification tasks
Semantic-Based
- SemanticSimilarity: Embedding-based similarity scoring
- BERTScore: BERT-based semantic evaluation
- RougeScore: ROUGE metrics for text generation
Code-Specific
- CodeExecution: Execute and validate code outputs
- SyntaxChecker: Validate code syntax
- TestCoverage: Code coverage analysis
Custom
- LLMJudge: Use another LLM as a judge
- HumanEval: Integration with human evaluation workflows
🚀 Deployment
Local Development
Docker
Kubernetes
🔧 Configuration
NovaEval supports configuration through:
- YAML/JSON files: Declarative configuration
- Environment variables: Runtime configuration
- Python code: Programmatic configuration
- CLI arguments: Command-line overrides
Environment Variables
CI/CD Integration
NovaEval includes optimized GitHub Actions workflows:
- Unit tests run on all PRs and pushes for quick feedback
- Integration tests run on main branch only to minimize API costs
- Cross-platform testing on macOS, Linux, and Windows
📈 Reporting and Artifacts
NovaEval generates comprehensive evaluation reports:
- Summary Reports: High-level metrics and insights
- Detailed Results: Per-sample predictions and scores
- Visualizations: Charts and graphs for result analysis
- Artifacts: Model outputs, intermediate results, and debug information
- Export Formats: JSON, CSV, HTML, PDF
Example Report Structure
🔌 Extending NovaEval
Custom Datasets
Custom Scorers
Custom Models
🤝 Contributing
We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework.
🎯 High-Priority Contribution Areas
We're actively looking for contributors in these key areas:
- 🧪 Unit Tests: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
- 📚 Examples: Create real-world evaluation examples and use cases
- 📝 Guides & Notebooks: Write evaluation guides and interactive Jupyter notebooks
- 📖 Documentation: Improve API documentation and user guides
- 🔍 RAG Metrics: Add more metrics specifically for Retrieval-Augmented Generation evaluation
- 🤖 Agent Evaluation: Build frameworks for evaluating AI agents and multi-turn conversations
Development Setup
🏗️ Contribution Workflow
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes following our coding standards
- Add tests for your changes
- Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
📋 Contribution Guidelines
- Code Quality: Follow PEP 8 and use the provided pre-commit hooks
- Testing: Add unit tests for new features and bug fixes
- Documentation: Update documentation for API changes
- Commit Messages: Use conventional commit format
- Issues: Reference relevant issues in your PR description
🎉 Recognition
Contributors will be:
- Listed in our contributors page
- Mentioned in release notes for significant contributions
- Invited to join our contributor Discord community
📄 License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
🙏 Acknowledgments
- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
- Built with modern Python best practices and industry standards
- Designed for the AI evaluation community
📞 Support
- Documentation: https://noveum.github.io/NovaEval
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
🔗 Related Resources
- NovaEval GitHub Repository - Source code and issues
- Noveum Trace SDK - Tracing and observability
- Noveum AI Platform - Complete AI evaluation platform
Made with ❤️ by the Noveum.ai team
Get Early Access to Noveum.ai Platform
Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.