Skip to content

Datasets

cca_zoo.datasets provides utilities for generating synthetic multiview data and loading small real-world datasets.


JointData — simulated multiview data

JointData generates data from a linear latent variable model:

$$ X_i = Z W_i^\top + E_i $$

where:

  • $Z \in \mathbb{R}^{n \times k}$ is the shared latent variable ($k$ = latent_dimensions)
  • $W_i \in \mathbb{R}^{p_i \times k}$ is the view-specific loading matrix (drawn once at construction)
  • $E_i$ is independent Gaussian noise, with variance controlled by signal_to_noise

The loading matrices are fixed at construction so that successive calls to sample() share the same generative parameters.

from cca_zoo.datasets import JointData

data = JointData(
    n_views=2,
    n_samples=200,
    n_features=[50, 40],       # different feature counts per view
    latent_dimensions=2,
    signal_to_noise=2.0,       # higher = less noise
    random_state=0,
)

train_views = data.sample()    # list of 2 arrays: (200, 50), (200, 40)
test_views  = data.sample()    # independent sample from the same model

Parameters

Parameter Type Default Description
n_views int 2 Number of views
n_samples int 100 Observations per call to sample()
n_features int or list[int] 10 Features per view (scalar broadcasts)
latent_dimensions int 1 Shared latent dimension
signal_to_noise float or list[float] 1.0 SNR per view (higher = less noise)
random_state int or None None Seed for reproducibility

Usage patterns

# Same features for all views (scalar broadcasts)
data = JointData(n_views=3, n_samples=100, n_features=20, latent_dimensions=2, random_state=0)

# Different SNR per view
data = JointData(n_views=2, n_samples=100, n_features=20,
                 signal_to_noise=[4.0, 1.0], random_state=0)

# Callable shorthand (same as sample())
views = data()

Toy real-world datasets

Two small real-world datasets are included for quick experimentation:

load_linnerud

Wraps sklearn.datasets.load_linnerud. Returns two arrays:

  • View 1: exercise measurements (chin-ups, sit-ups, jumps) — shape (20, 3)
  • View 2: physiological measurements (weight, waist, pulse) — shape (20, 3)
from cca_zoo.datasets import load_linnerud

exercise, physiological = load_linnerud()
print(exercise.shape)        # (20, 3)
print(physiological.shape)   # (20, 3)

load_breast_cancer

Wraps sklearn.datasets.load_breast_cancer. Splits the 30 features into two halves to create a two-view dataset:

  • View 1: first 15 features — shape (569, 15)
  • View 2: last 15 features — shape (569, 15)
from cca_zoo.datasets import load_breast_cancer

view1, view2 = load_breast_cancer()
print(view1.shape)   # (569, 15)
print(view2.shape)   # (569, 15)