metaclean3

Submodules

Package Contents

Classes

MetaClean

Cleans a given data set.

MetaCleanFCS

Given a compensated (flow cytometry) / unmixed (spectral cytometry) and

FCSfile

Prepares an FCS file for MetaClean3.0.

class metaclean3.MetaClean(seg_method: str = 'pelt', cost_model_seg: str = 'rbf', jump_per_seg: int = 2, min_seg_size: int = 2, seg_no: int = 40, mean_no_per_seg: int = 50, min_no_per_seg_limit: int = 10, pelt_penalty: float = 5.0, merge_method: str = 'sequential', merge_signif_test: str = 'wilcox', p_thres: float = 0.05, signif_strict: bool = True, min_ref_percent: float = 0.5, cost_model_gain: str = 'rank', gain_diff: bool = True, percent_diff: float = 0.05, percent_shifts: float = [0.15, 0.2, 0.25, 0.3, 0.35, 0.4], small_no_per_seg0: int = 50, ignore_no: int = 10, ref_percents: list = [1, 0.5], ref_percent: float = 0.1, exception_no: int = 2, keep_skipped: str = 'all', pd_lenient: int = 2, min_ref_percent_to_keep: float = 0.4, random_seed: int = 623, verbose: bool = True, png_dir: str = '')

Cleans a given data set.

clean(data: pandas.DataFrame, val_cols: list | None = None, val_col_final: str | None = None)

Conduct cleaning of data (i.e. remove irregular objects/rows).

Args:

data (pandas.DataFrame): An object x feature matrix. val_cols (list | None): List of feature column names in data to be

used for cleaning. Defaults to None.

val_col_final (str | None): A summary feature column name in data

where (e.g. sum of val_cols values). val_cols and val_col_final can be overlapping. Defaults to None.

Returns:
tuple:
numpy.ndarray: A 1D boolean array with True/False labels

indicating which rows in data to keep/remove.

numpy.ndarray: A 1D integer array containing refined segment

labels.

numpy.ndarray: A 1D integer array containing raw segment labels.

get_segments(values: list | numpy.ndarray | pandas.Series)

Given an array, break the array into segments (find changepoints) and merge those segments based on significance tests to obtain the initial “raw” segments.

Args:
values (list | numpy.ndarray | pandas.Series): A 1D array of numeric

values.

Returns:

numpy.ndarray: A 1D integer array containing “raw” segment labels.

merge_segments_raw(values: list | numpy.ndarray | pandas.Series, chpts: list | numpy.ndarray | pandas.Series, gains: list | numpy.ndarray | pandas.Series)

Merge raw segments using statistical significance tests.

Args:
values (list | numpy.ndarray | pandas.Series): A 1D array of numeric

values.

chpts (list | numpy.ndarray | pandas.Series): A 1D array of

changepoints not including 0 and length of values as the first and last elements.

gains (list | numpy.ndarray | pandas.Series): List of gains

corresponding to chpts.

Returns:

numpy.ndarray: A 1D integer array containing “raw” segment labels.

merge_segments(values: list | numpy.ndarray | pandas.Series, vlist: list, segments: list | numpy.ndarray | pandas.Series)

Merge segments based on changepoint gain and segment quantile ranges.

Args:
values (list | numpy.ndarray | pandas.Series): A 1D array of numeric

values.

vlist (list): A list of 1D arrays with the same length as values. segments (list | numpy.ndarray | pandas.Series): Segment label

vector.

Returns:
tuple:

numpy.ndarray: sorted 1D integer segment label array. numpy.ndarray: sorted indices.

refine_segments(values: list | numpy.ndarray | pandas.Series, segments: list | numpy.ndarray | pandas.Series, ref_segment: int)

Refine segment to be kept.

Args:
values (list | numpy.ndarray | pandas.Series): A 1D array of numeric

values.

segments (list | numpy.ndarray | pandas.Series): Segment label

vector.

ref_segment (int): Label of the reference segment.

Returns:

numpy.ndarray: refined Segment label vector.

class metaclean3.MetaCleanFCS(clean_chans: list | numpy.ndarray | pandas.Series | None = None, clean_chans_no: int = 4, candidate_chans_type: str = 'fluo', corr_type: str = 'max', strict_remove_duplicates: bool = False, n_cores: int = -1, dens_agg_type: str = 'max', dens_k_dtm: int = 15, p: int = 2, eps: float = 0.1, outlier_thresh: float = 0.01, outlier_trees: int = 500, outlier_drop_and: bool = True, rm_outliers: str = 'all', random_seed: int = 623, verbose: bool = True, png_dir: str = '', metaclean: type[MetaClean] | None = None, **kwargs)

Given a compensated (flow cytometry) / unmixed (spectral cytometry) and transformed cytometry data, prepares the data for cleaning via 0.

apply(fcs: type[metaclean3.fcs.FCSfile] | None = None, randomize_duplicates_tf: bool = True, return_binned_data: bool = False, **kwargs)

Prepare given fcs and apply MetaClean.

Args:
fcs (FCSfile | None, optional): Initialized FCSfile data class.

Defaults to None.

randomize_duplicates_tf (bool, optional): See apply_features(). return_binned_data (bool, optional): Whether or not to return binned

data. Defaults to False.

**kwargs param: See FCSfile class attributes, namely data

(event x featire pandas.DataFrame).

Returns:
pandas.DataFrame: A data frame with the same number of rows as

fcs.data. Columns with prefix val_ contains feature values, outlier_keep contain boolean values for bins considered (True) as outliers, bin contain bin labels, segments_raw contain the raw segment labels, segments contain the merged and refined segment labels, and clean_keep contain boolean values for bins to keep (True) — clean_keep is the final recommentation given by 0.

apply_features(fcs: metaclean3.fcs.FCSfile | None = None, randomize_duplicates_tf: bool = True, calculate_outlier2: bool = False, **kwargs)

Calcuate bins and features using raw values.

Args:

fcs (FCSfile | None, optional): _description_. Defaults to None. randomize_duplicates_tf (bool, optional): Whether or not to

randomize duplicates. Only set this to false if duplicates in the input data have been dealth with already.

calculate_outlier2 (bool, optional): Whether or not to identify

outliers again more leniently if user wishes to remove only some outliers. Defaults to False.

Returns:
pandas.DataFrame: A data matrix containing bins (bin column),

feature values (prefix val_ column(s)), and outliers (outlier_keep column).

apply_clean(data: pandas.DataFrame)

Apply cleaning.

Args:

data (pandas.DataFrame): Object x feature matrix.

Returns:

pandas.DataFrame: See MetaClean.

remove_outliers(data_merged: pandas.DataFrame)

After cleaning is done, revisits outliers to either keep all, some, or none of the outliers.

Args:

data_merged (pandas.DataFrame): object/row x feature matrix.

Returns:
pandas.DataFrame: data_merged where the clean_keep column is

editted such that all, some, or no outliers are kept.

class metaclean3.FCSfile

Prepares an FCS file for MetaClean3.0.

Attributes:

data (pandas.DataFrame): FCS file data matrix sorted by time. time_step (float | None, optional):

See binning.get_time_binned(). Defaults to None.

time_chan (str): Time channel name. Defaults to ‘time’. bin_chan (str, optional): Name of the bin column to be added. fluo_chans (list | np.ndarray, optional):

Fluorescent channels that can be used for 0. Defaults to np.array([]).

phys_chans (list | np.ndarray, optional):

Physical morphological channels that can be used for 0. Defaults to np.array([]).

data : pandas.DataFrame
time_step : float | None
time_chan : str = 'time'
bin_chan : str = 'bin'
fluo_chans : list | numpy.ndarray
phys_chans : list | numpy.ndarray
time_bin : dataclasses.InitVar[str | None]
channel_unique_no : dataclasses.InitVar[int] = 25
min_bin_size : dataclasses.InitVar[int] = 2000
max_bin_size : dataclasses.InitVar[int] = 10000
min_events_per_bin : dataclasses.InitVar[int] = 50
__post_init__(time_bin, channel_unique_no, min_bin_size, max_bin_size, min_events_per_bin)
Args:
time_bin (str, optional): See binning.get_time_binned().

Defaults to ‘1S’.

channel_unique_no (int):

See channels.get_clean_fp_channels(). Defaults to 25.

min_bin_size (int, optional):

See binning.get_time_binned(). Defaults to 2000.

max_bin_size (int, optional):

See binning.get_time_binned(). Defaults to 10000.

min_events_per_bin (int, optional):

See binning.get_time_binned(). Defaults to 50.