Skip to content

cca_zoo.datasets

Utilities for generating and loading multiview datasets.


JointData

JointData(n_views: int = 2, n_samples: int = 100, latent_dimensions: int = 1, n_features: int | list[int] = 10, signal_to_noise: float | list[float] = 1.0, random_state: int | None = None)

Generate multiview data from a linear latent variable model.

Each view is generated as::

x_i = Z @ W_i.T + noise_i

where Z ~ N(0, I_{k x k}) is the shared latent variable, W_i ~ N(0, I) is the view-specific loading matrix, and noise_i ~ N(0, sigma_i^2 I) is independent noise with variance controlled by signal_to_noise.

Parameters:

Name Type Description Default
n_views int

Number of views to generate. Default is 2.

2
n_samples int

Number of observations to generate. Default is 100.

100
latent_dimensions int

Dimension of the shared latent space. Default is 1.

1
n_features int | list[int]

Number of features per view. May be a single integer (same for all views) or a list of integers with one entry per view. Default is 10.

10
signal_to_noise float | list[float]

Signal-to-noise ratio. May be a single float (same for all views) or a list of floats. A higher value means less noise. Default is 1.0.

1.0
random_state int | None

Integer seed or None for reproducibility.

None
Example

import numpy as np data = JointData( ... n_views=2, n_samples=100, latent_dimensions=2, random_state=0 ... ) views = data.sample() len(views) 2 views[0].shape (100, 10)

Source code in cca_zoo/datasets/_simulated.py
def __init__(
    self,
    n_views: int = 2,
    n_samples: int = 100,
    latent_dimensions: int = 1,
    n_features: int | list[int] = 10,
    signal_to_noise: float | list[float] = 1.0,
    random_state: int | None = None,
) -> None:
    self.n_views = n_views
    self.n_samples = n_samples
    self.latent_dimensions = latent_dimensions
    self.n_features = n_features
    self.signal_to_noise = signal_to_noise
    self.random_state = random_state

    self._rng = np.random.default_rng(random_state)
    self._features_per_view: list[int] = self._broadcast_param(
        n_features, n_views, "n_features"
    )
    self._snr_per_view: list[float] = self._broadcast_param(
        signal_to_noise, n_views, "signal_to_noise"
    )

    # Pre-generate weight matrices so successive calls to sample()
    # share the same generative parameters.
    self._weights: list[np.ndarray] = [
        self._rng.standard_normal((p, latent_dimensions))
        for p in self._features_per_view
    ]

sample

sample() -> list[np.ndarray]

Draw a new set of multiview samples from the generative model.

Returns:

Type Description
list[ndarray]

List of numpy arrays, one per view, each of shape

list[ndarray]

(n_samples, n_features_i).

Source code in cca_zoo/datasets/_simulated.py
def sample(self) -> list[np.ndarray]:
    """Draw a new set of multiview samples from the generative model.

    Returns:
        List of numpy arrays, one per view, each of shape
        (n_samples, n_features_i).
    """
    z = self._rng.standard_normal((self.n_samples, self.latent_dimensions))
    views: list[np.ndarray] = []
    for w, snr in zip(self._weights, self._snr_per_view):
        signal = z @ w.T  # (n_samples, p_i)
        noise_std = 1.0 / np.sqrt(snr) if snr > 0 else 1.0
        noise = self._rng.standard_normal(signal.shape) * noise_std
        views.append(signal + noise)
    return views

load_linnerud

load_linnerud() -> tuple[np.ndarray, np.ndarray]

Load the Linnerud dataset as two views.

The Linnerud dataset (from scikit-learn) contains two sets of measurements on 20 middle-aged men: exercise performance and physiological measurements. This function returns them as a pair of numpy arrays suitable for two-view CCA.

Returns:

Type Description
ndarray

Tuple (exercise, physiological). exercise is shape (20, 3)

ndarray

with chin-up, sit-up, and jump counts. physiological is shape

tuple[ndarray, ndarray]

(20, 3) with weight, waist, and pulse measurements.

Example

X1, X2 = load_linnerud() X1.shape (20, 3) X2.shape (20, 3)

Source code in cca_zoo/datasets/_toy.py
def load_linnerud() -> tuple[np.ndarray, np.ndarray]:
    """Load the Linnerud dataset as two views.

    The Linnerud dataset (from scikit-learn) contains two sets of
    measurements on 20 middle-aged men: exercise performance and
    physiological measurements.  This function returns them as a pair of
    numpy arrays suitable for two-view CCA.

    Returns:
        Tuple ``(exercise, physiological)``. ``exercise`` is shape (20, 3)
        with chin-up, sit-up, and jump counts. ``physiological`` is shape
        (20, 3) with weight, waist, and pulse measurements.

    Example:
        >>> X1, X2 = load_linnerud()
        >>> X1.shape
        (20, 3)
        >>> X2.shape
        (20, 3)
    """
    from sklearn.datasets import load_linnerud as _load

    dataset = _load()
    # dataset.data = exercise, dataset.target = physiological
    return np.asarray(dataset.data), np.asarray(dataset.target)

load_breast_cancer

load_breast_cancer() -> tuple[np.ndarray, np.ndarray]

Load the Wisconsin breast cancer dataset split into two feature views.

The 30 features of the Wisconsin Diagnostic Breast Cancer dataset are split into two equal halves of 15 features each, providing a simple two-view dataset for benchmarking multiview methods.

Returns:

Type Description
tuple[ndarray, ndarray]

Tuple (view1, view2) where each array has shape (569, 15).

Example

X1, X2 = load_breast_cancer() X1.shape (569, 15) X2.shape (569, 15)

Source code in cca_zoo/datasets/_toy.py
def load_breast_cancer() -> tuple[np.ndarray, np.ndarray]:
    """Load the Wisconsin breast cancer dataset split into two feature views.

    The 30 features of the Wisconsin Diagnostic Breast Cancer dataset are
    split into two equal halves of 15 features each, providing a simple
    two-view dataset for benchmarking multiview methods.

    Returns:
        Tuple ``(view1, view2)`` where each array has shape (569, 15).

    Example:
        >>> X1, X2 = load_breast_cancer()
        >>> X1.shape
        (569, 15)
        >>> X2.shape
        (569, 15)
    """
    from sklearn.datasets import load_breast_cancer as _load

    dataset = _load()
    x: np.ndarray = np.asarray(dataset.data)
    midpoint = x.shape[1] // 2
    return x[:, :midpoint], x[:, midpoint:]