cca_zoo.datasets¶
Utilities for generating and loading multiview datasets.
JointData ¶
JointData(n_views: int = 2, n_samples: int = 100, latent_dimensions: int = 1, n_features: int | list[int] = 10, signal_to_noise: float | list[float] = 1.0, random_state: int | None = None)
Generate multiview data from a linear latent variable model.
Each view is generated as::
x_i = Z @ W_i.T + noise_i
where Z ~ N(0, I_{k x k}) is the shared latent variable,
W_i ~ N(0, I) is the view-specific loading matrix, and
noise_i ~ N(0, sigma_i^2 I) is independent noise with variance
controlled by signal_to_noise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_views
|
int
|
Number of views to generate. Default is 2. |
2
|
n_samples
|
int
|
Number of observations to generate. Default is 100. |
100
|
latent_dimensions
|
int
|
Dimension of the shared latent space. Default is 1. |
1
|
n_features
|
int | list[int]
|
Number of features per view. May be a single integer (same for all views) or a list of integers with one entry per view. Default is 10. |
10
|
signal_to_noise
|
float | list[float]
|
Signal-to-noise ratio. May be a single float (same for all views) or a list of floats. A higher value means less noise. Default is 1.0. |
1.0
|
random_state
|
int | None
|
Integer seed or |
None
|
Example
import numpy as np data = JointData( ... n_views=2, n_samples=100, latent_dimensions=2, random_state=0 ... ) views = data.sample() len(views) 2 views[0].shape (100, 10)
Source code in cca_zoo/datasets/_simulated.py
sample ¶
Draw a new set of multiview samples from the generative model.
Returns:
| Type | Description |
|---|---|
list[ndarray]
|
List of numpy arrays, one per view, each of shape |
list[ndarray]
|
(n_samples, n_features_i). |
Source code in cca_zoo/datasets/_simulated.py
load_linnerud ¶
Load the Linnerud dataset as two views.
The Linnerud dataset (from scikit-learn) contains two sets of measurements on 20 middle-aged men: exercise performance and physiological measurements. This function returns them as a pair of numpy arrays suitable for two-view CCA.
Returns:
| Type | Description |
|---|---|
ndarray
|
Tuple |
ndarray
|
with chin-up, sit-up, and jump counts. |
tuple[ndarray, ndarray]
|
(20, 3) with weight, waist, and pulse measurements. |
Example
X1, X2 = load_linnerud() X1.shape (20, 3) X2.shape (20, 3)
Source code in cca_zoo/datasets/_toy.py
load_breast_cancer ¶
Load the Wisconsin breast cancer dataset split into two feature views.
The 30 features of the Wisconsin Diagnostic Breast Cancer dataset are split into two equal halves of 15 features each, providing a simple two-view dataset for benchmarking multiview methods.
Returns:
| Type | Description |
|---|---|
tuple[ndarray, ndarray]
|
Tuple |
Example
X1, X2 = load_breast_cancer() X1.shape (569, 15) X2.shape (569, 15)