Skip to main content
The dataloaders module provides a unified interface for loading benchmark datasets, normalizing their formats, and converting them to framework-specific document objects.

Overview

Dataloaders abstract away dataset-specific formats and provide:
  • Catalog-based loader creation - Factory pattern for consistent instantiation
  • Normalized record format - All datasets produce DatasetRecord objects
  • Framework conversion - Automatic conversion to Haystack or LangChain documents
  • Evaluation queries - Ground-truth QA pairs for retrieval benchmarking
  • Streaming support - Memory-efficient iteration over large datasets

Supported datasets

DatasetTypeRecordsQueriesDescription
TriviaQA (triviaqa)Open-domain QA~500 index~100 evalTrivia questions with evidence documents
ARC (arc)Science QA~1000 index~200 evalAI2 Reasoning Challenge questions
PopQA (popqa)Entity-centric QA~500 index~100 evalEntity-focused questions from Wikipedia
FActScore (factscore)Factuality QA~500 index~100 evalFactuality-focused evaluation dataset
Earnings Calls (earnings_calls)Financial QA~300 index~50 evalFinancial QA from earnings call transcripts

Architecture

Core components

dataloaders/
├── catalog.py          # Factory for creating loaders
├── base.py             # Abstract base class defining the contract
├── types.py            # Shared types (DatasetRecord, EvaluationQuery)
├── converters.py       # Framework document conversion
├── dataset.py          # LoadedDataset wrapper
├── evaluation.py       # Evaluation query extraction
└── datasets/           # Per-dataset implementations
    ├── triviaqa.py
    ├── arc.py
    ├── popqa.py
    ├── factscore.py
    └── earnings_calls.py

Class hierarchy

BaseDatasetLoader (ABC)
    ├── TriviaQALoader
    ├── ARCLoader
    ├── PopQALoader
    ├── FactScoreLoader
    └── EarningsCallsLoader

Basic usage

Creating a loader

Use the DataloaderCatalog factory to create loaders:
from vectordb.dataloaders import DataloaderCatalog

loader = DataloaderCatalog.create(
    name="triviaqa",
    split="test",
    limit=500
)

Loading datasets

# Load normalized records
dataset = loader.load()

print(f"Loaded {len(dataset.records)} records")
print(f"Dataset type: {dataset.dataset_type}")

Converting to framework documents

from vectordb.dataloaders.converters import DocumentConverter, records_to_items

# Convert records to normalized items
items = records_to_items(dataset.records)

# Convert to Haystack documents
haystack_docs = DocumentConverter.to_haystack(items)

# Convert to LangChain documents
langchain_docs = DocumentConverter.to_langchain(items)

Data structures

DatasetRecord

Normalized document record with text and metadata:
@dataclass(frozen=True, slots=True)
class DatasetRecord:
    text: str                    # Document content to index
    metadata: dict[str, Any]     # Dataset-specific metadata
Example:
DatasetRecord(
    text="The Great Wall of China was built over several centuries...",
    metadata={
        "id": "doc_001",
        "source": "triviaqa",
        "title": "Great Wall of China"
    }
)

EvaluationQuery

Evaluation query with ground-truth answers and relevant document IDs:
@dataclass(frozen=True, slots=True)
class EvaluationQuery:
    query: str                      # User/evaluation question
    answers: list[str]              # Ground-truth answers
    relevant_doc_ids: list[str]     # IDs of known relevant docs
    metadata: dict[str, Any]        # Additional metadata
Example:
EvaluationQuery(
    query="When was the Great Wall of China built?",
    answers=["over several centuries", "7th century BC"],
    relevant_doc_ids=["doc_001", "doc_045"],
    metadata={"difficulty": "easy", "category": "history"}
)

LoadedDataset

Wrapper containing loaded records and metadata:
class LoadedDataset:
    dataset_type: DatasetType      # "triviaqa", "arc", etc.
    records: list[DatasetRecord]   # Normalized documents

Dataset implementations

TriviaQA

Dataset ID: triviaqa
HuggingFace: trivia_qa
Structure: Questions with multiple evidence documents
loader = DataloaderCatalog.create("triviaqa", split="test", limit=500)
dataset = loader.load()
Record format:
  • text: Evidence document content
  • metadata.id: Document identifier
  • metadata.title: Document title
  • metadata.question: Associated question

ARC (AI2 Reasoning Challenge)

Dataset ID: arc
HuggingFace: ai2_arc
Structure: Science questions with multiple-choice answers
loader = DataloaderCatalog.create("arc", split="test", limit=1000)
dataset = loader.load()
Record format:
  • text: Question + answer choices
  • metadata.id: Question identifier
  • metadata.question: Question text
  • metadata.answerKey: Correct answer

PopQA

Dataset ID: popqa
HuggingFace: akariasai/PopQA
Structure: Entity-centric questions from Wikipedia
loader = DataloaderCatalog.create("popqa", split="test", limit=500)
dataset = loader.load()
Record format:
  • text: Wikipedia passage
  • metadata.id: Passage identifier
  • metadata.entity: Entity mention
  • metadata.question: Associated question

FActScore

Dataset ID: factscore
HuggingFace: dskar/FActScore
Structure: Factuality-focused QA pairs
loader = DataloaderCatalog.create("factscore", split="test", limit=500)
dataset = loader.load()
Record format:
  • text: Factual statement
  • metadata.id: Statement identifier
  • metadata.topic: Topic/category

Earnings Calls

Dataset ID: earnings_calls
HuggingFace: lamini/earnings-calls-qa
Structure: Financial QA from earnings call transcripts
loader = DataloaderCatalog.create("earnings_calls", split="train", limit=300)
dataset = loader.load()
Record format:
  • text: Transcript excerpt
  • metadata.id: Excerpt identifier
  • metadata.company: Company name
  • metadata.quarter: Reporting quarter

Base loader interface

All loaders implement the BaseDatasetLoader abstract class:
class BaseDatasetLoader(ABC):
    def __init__(
        self,
        dataset_name: str,
        split: str,
        limit: int | None = None,
        streaming: bool = True,
    ) -> None:
        """Initialize the loader with dataset configuration."""

    @property
    @abstractmethod
    def dataset_type(self) -> DatasetType:
        """Return the supported dataset type identifier."""

    @abstractmethod
    def _load_dataset_iterable(self) -> Iterable[Mapping[str, Any]]:
        """Return the raw dataset rows as an iterable."""

    @abstractmethod
    def _parse_row(self, row: Mapping[str, Any]) -> list[DatasetRecord]:
        """Parse a dataset row into normalized records."""

    def load(self) -> LoadedDataset:
        """Load the dataset and return normalized records."""

Document conversion

The DocumentConverter class provides framework-specific conversion:

Haystack conversion

from vectordb.dataloaders.converters import DocumentConverter
from haystack import Document

items = [{"text": "content", "metadata": {"id": "1"}}]
haystack_docs = DocumentConverter.to_haystack(items)

# Result: List[Document]
# Document(content="content", meta={"id": "1"})

LangChain conversion

from vectordb.dataloaders.converters import DocumentConverter
from langchain_core.documents import Document

items = [{"text": "content", "metadata": {"id": "1"}}]
langchain_docs = DocumentConverter.to_langchain(items)

# Result: List[Document]
# Document(page_content="content", metadata={"id": "1"})

Configuration

Dataloaders integrate with YAML configuration:
dataloader:
  dataset: "triviaqa"    # Dataset identifier
  split: "test"          # Dataset split
  limit: 500             # Optional record limit
Load configuration and create loader:
from vectordb.utils.config import load_config
from vectordb.dataloaders import DataloaderCatalog

config = load_config("config.yaml")
dl_config = config["dataloader"]

loader = DataloaderCatalog.create(
    name=dl_config["dataset"],
    split=dl_config["split"],
    limit=dl_config.get("limit")
)

Streaming mode

By default, loaders use streaming to handle large datasets efficiently:
loader = DataloaderCatalog.create(
    name="triviaqa",
    split="test",
    limit=None  # No limit
)

# Streams dataset without loading all records into memory
dataset = loader.load()

Custom loaders

Implement custom loaders by extending BaseDatasetLoader:
from vectordb.dataloaders.base import BaseDatasetLoader
from vectordb.dataloaders.types import DatasetRecord, DatasetType

class CustomLoader(BaseDatasetLoader):
    @property
    def dataset_type(self) -> DatasetType:
        return "custom"

    def _load_dataset_iterable(self):
        # Load raw dataset rows
        from datasets import load_dataset
        dataset = load_dataset(
            self.dataset_name,
            split=self.split,
            streaming=self.streaming
        )
        return dataset

    def _parse_row(self, row):
        # Parse row into DatasetRecord objects
        return [
            DatasetRecord(
                text=row["content"],
                metadata={"id": row["id"]}
            )
        ]

Error handling

Dataloaders raise specific exceptions for different error conditions:
from vectordb.dataloaders.types import (
    UnsupportedDatasetError,
    DatasetLoadError,
    DatasetValidationError
)

try:
    loader = DataloaderCatalog.create("invalid_dataset")
except UnsupportedDatasetError as e:
    print(f"Dataset not supported: {e}")

try:
    dataset = loader.load()
except DatasetLoadError as e:
    print(f"Failed to load dataset: {e}")
except DatasetValidationError as e:
    print(f"Dataset validation failed: {e}")

Best practices

Use the catalog

Always use DataloaderCatalog.create() instead of instantiating loaders directly for consistency

Set limits during development

Use limit= parameter during prototyping to avoid loading full datasets

Convert once

Convert to framework documents once during indexing, not repeatedly during queries

Handle exceptions

Catch DatasetLoadError and DatasetValidationError for robust pipelines