cca_zoo.datasets¶

Utilities for generating and loading multiview datasets.

JointData ¶

JointData(n_views: int = 2, n_samples: int = 100, latent_dimensions: int = 1, n_features: int | list[int] = 10, signal_to_noise: float | list[float] = 1.0, random_state: int | None = None)

Generate multiview data from a linear latent variable model.

Each view is generated as::

x_i = Z @ W_i.T + noise_i

where Z ~ N(0, I_{k x k}) is the shared latent variable, W_i ~ N(0, I) is the view-specific loading matrix, and noise_i ~ N(0, sigma_i^2 I) is independent noise with variance controlled by signal_to_noise.

Parameters:

Name	Type	Description	Default
`n_views`	`int`	Number of views to generate. Default is 2.	`2`
`n_samples`	`int`	Number of observations to generate. Default is 100.	`100`
`latent_dimensions`	`int`	Dimension of the shared latent space. Default is 1.	`1`
`n_features`	`int \| list[int]`	Number of features per view. May be a single integer (same for all views) or a list of integers with one entry per view. Default is 10.	`10`
`signal_to_noise`	`float \| list[float]`	Signal-to-noise ratio. May be a single float (same for all views) or a list of floats. A higher value means less noise. Default is 1.0.	`1.0`
`random_state`	`int \| None`	Integer seed or `None` for reproducibility.	`None`

Example

import numpy as np data = JointData( ... n_views=2, n_samples=100, latent_dimensions=2, random_state=0 ... ) views = data.sample() len(views) 2 views[0].shape (100, 10)

Source code in cca_zoo/datasets/_simulated.py

def __init__(
    self,
    n_views: int = 2,
    n_samples: int = 100,
    latent_dimensions: int = 1,
    n_features: int | list[int] = 10,
    signal_to_noise: float | list[float] = 1.0,
    random_state: int | None = None,
) -> None:
    self.n_views = n_views
    self.n_samples = n_samples
    self.latent_dimensions = latent_dimensions
    self.n_features = n_features
    self.signal_to_noise = signal_to_noise
    self.random_state = random_state

    self._rng = np.random.default_rng(random_state)
    self._features_per_view: list[int] = self._broadcast_param(
        n_features, n_views, "n_features"
    )
    self._snr_per_view: list[float] = self._broadcast_param(
        signal_to_noise, n_views, "signal_to_noise"
    )

    # Pre-generate weight matrices so successive calls to sample()
    # share the same generative parameters.
    self._weights: list[np.ndarray] = [
        self._rng.standard_normal((p, latent_dimensions))
        for p in self._features_per_view
    ]

sample ¶

sample() -> list[np.ndarray]

Draw a new set of multiview samples from the generative model.

Returns:

Type	Description
`list[ndarray]`	List of numpy arrays, one per view, each of shape
`list[ndarray]`	(n_samples, n_features_i).

Source code in cca_zoo/datasets/_simulated.py

def sample(self) -> list[np.ndarray]:
    """Draw a new set of multiview samples from the generative model.

    Returns:
        List of numpy arrays, one per view, each of shape
        (n_samples, n_features_i).
    """
    z = self._rng.standard_normal((self.n_samples, self.latent_dimensions))
    views: list[np.ndarray] = []
    for w, snr in zip(self._weights, self._snr_per_view):
        signal = z @ w.T  # (n_samples, p_i)
        noise_std = 1.0 / np.sqrt(snr) if snr > 0 else 1.0
        noise = self._rng.standard_normal(signal.shape) * noise_std
        views.append(signal + noise)
    return views

load_linnerud ¶

load_linnerud() -> tuple[np.ndarray, np.ndarray]

Load the Linnerud dataset as two views.

The Linnerud dataset (from scikit-learn) contains two sets of measurements on 20 middle-aged men: exercise performance and physiological measurements. This function returns them as a pair of numpy arrays suitable for two-view CCA.

Returns:

Type	Description
`ndarray`	Tuple `(exercise, physiological)`. `exercise` is shape (20, 3)
`ndarray`	with chin-up, sit-up, and jump counts. `physiological` is shape
`tuple[ndarray, ndarray]`	(20, 3) with weight, waist, and pulse measurements.

Example

X1, X2 = load_linnerud() X1.shape (20, 3) X2.shape (20, 3)

Source code in cca_zoo/datasets/_toy.py

def load_linnerud() -> tuple[np.ndarray, np.ndarray]:
    """Load the Linnerud dataset as two views.

    The Linnerud dataset (from scikit-learn) contains two sets of
    measurements on 20 middle-aged men: exercise performance and
    physiological measurements.  This function returns them as a pair of
    numpy arrays suitable for two-view CCA.

    Returns:
        Tuple ``(exercise, physiological)``. ``exercise`` is shape (20, 3)
        with chin-up, sit-up, and jump counts. ``physiological`` is shape
        (20, 3) with weight, waist, and pulse measurements.

    Example:
        >>> X1, X2 = load_linnerud()
        >>> X1.shape
        (20, 3)
        >>> X2.shape
        (20, 3)
    """
    from sklearn.datasets import load_linnerud as _load

    dataset = _load()
    # dataset.data = exercise, dataset.target = physiological
    return np.asarray(dataset.data), np.asarray(dataset.target)

load_breast_cancer ¶

load_breast_cancer() -> tuple[np.ndarray, np.ndarray]

Load the Wisconsin breast cancer dataset split into two feature views.

The 30 features of the Wisconsin Diagnostic Breast Cancer dataset are split into two equal halves of 15 features each, providing a simple two-view dataset for benchmarking multiview methods.

Returns:

Type	Description
`tuple[ndarray, ndarray]`	Tuple `(view1, view2)` where each array has shape (569, 15).

Example

X1, X2 = load_breast_cancer() X1.shape (569, 15) X2.shape (569, 15)

Source code in cca_zoo/datasets/_toy.py

def load_breast_cancer() -> tuple[np.ndarray, np.ndarray]:
    """Load the Wisconsin breast cancer dataset split into two feature views.

    The 30 features of the Wisconsin Diagnostic Breast Cancer dataset are
    split into two equal halves of 15 features each, providing a simple
    two-view dataset for benchmarking multiview methods.

    Returns:
        Tuple ``(view1, view2)`` where each array has shape (569, 15).

    Example:
        >>> X1, X2 = load_breast_cancer()
        >>> X1.shape
        (569, 15)
        >>> X2.shape
        (569, 15)
    """
    from sklearn.datasets import load_breast_cancer as _load

    dataset = _load()
    x: np.ndarray = np.asarray(dataset.data)
    midpoint = x.shape[1] // 2
    return x[:, :midpoint], x[:, midpoint:]