metaclean3¶
Submodules¶
Package Contents¶
Classes¶
Cleans a given data set. |
|
Given a compensated (flow cytometry) / unmixed (spectral cytometry) and |
|
Prepares an FCS file for MetaClean3.0. |
-
class metaclean3.MetaClean(seg_method: str =
'pelt', cost_model_seg: str ='rbf', jump_per_seg: int =2, min_seg_size: int =2, seg_no: int =40, mean_no_per_seg: int =50, min_no_per_seg_limit: int =10, pelt_penalty: float =5.0, merge_method: str ='sequential', merge_signif_test: str ='wilcox', p_thres: float =0.05, signif_strict: bool =True, min_ref_percent: float =0.5, cost_model_gain: str ='rank', gain_diff: bool =True, percent_diff: float =0.05, percent_shifts: float =[0.15, 0.2, 0.25, 0.3, 0.35, 0.4], small_no_per_seg0: int =50, ignore_no: int =10, ref_percents: list =[1, 0.5], ref_percent: float =0.1, exception_no: int =2, keep_skipped: str ='all', pd_lenient: int =2, min_ref_percent_to_keep: float =0.4, random_seed: int =623, verbose: bool =True, png_dir: str ='')¶ Cleans a given data set.
-
clean(data: pandas.DataFrame, val_cols: list | None =
None, val_col_final: str | None =None)¶ Conduct cleaning of data (i.e. remove irregular objects/rows).
- Args:
data (pandas.DataFrame): An object x feature matrix. val_cols (list | None): List of feature column names in data to be
used for cleaning. Defaults to None.
- val_col_final (str | None): A summary feature column name in data
where (e.g. sum of val_cols values). val_cols and val_col_final can be overlapping. Defaults to None.
- Returns:
- tuple:
- numpy.ndarray: A 1D boolean array with True/False labels
indicating which rows in data to keep/remove.
- numpy.ndarray: A 1D integer array containing refined segment
labels.
numpy.ndarray: A 1D integer array containing raw segment labels.
- get_segments(values: list | numpy.ndarray | pandas.Series)¶
Given an array, break the array into segments (find changepoints) and merge those segments based on significance tests to obtain the initial “raw” segments.
- Args:
- values (list | numpy.ndarray | pandas.Series): A 1D array of numeric
values.
- Returns:
numpy.ndarray: A 1D integer array containing “raw” segment labels.
- merge_segments_raw(values: list | numpy.ndarray | pandas.Series, chpts: list | numpy.ndarray | pandas.Series, gains: list | numpy.ndarray | pandas.Series)¶
Merge raw segments using statistical significance tests.
- Args:
- values (list | numpy.ndarray | pandas.Series): A 1D array of numeric
values.
- chpts (list | numpy.ndarray | pandas.Series): A 1D array of
changepoints not including 0 and length of values as the first and last elements.
- gains (list | numpy.ndarray | pandas.Series): List of gains
corresponding to chpts.
- Returns:
numpy.ndarray: A 1D integer array containing “raw” segment labels.
- merge_segments(values: list | numpy.ndarray | pandas.Series, vlist: list, segments: list | numpy.ndarray | pandas.Series)¶
Merge segments based on changepoint gain and segment quantile ranges.
- Args:
- values (list | numpy.ndarray | pandas.Series): A 1D array of numeric
values.
vlist (list): A list of 1D arrays with the same length as values. segments (list | numpy.ndarray | pandas.Series): Segment label
vector.
- Returns:
- tuple:
numpy.ndarray: sorted 1D integer segment label array. numpy.ndarray: sorted indices.
- refine_segments(values: list | numpy.ndarray | pandas.Series, segments: list | numpy.ndarray | pandas.Series, ref_segment: int)¶
Refine segment to be kept.
- Args:
- values (list | numpy.ndarray | pandas.Series): A 1D array of numeric
values.
- segments (list | numpy.ndarray | pandas.Series): Segment label
vector.
ref_segment (int): Label of the reference segment.
- Returns:
numpy.ndarray: refined Segment label vector.
-
clean(data: pandas.DataFrame, val_cols: list | None =
-
class metaclean3.MetaCleanFCS(clean_chans: list | numpy.ndarray | pandas.Series | None =
None, clean_chans_no: int =4, candidate_chans_type: str ='fluo', corr_type: str ='max', strict_remove_duplicates: bool =False, n_cores: int =-1, dens_agg_type: str ='max', dens_k_dtm: int =15, p: int =2, eps: float =0.1, outlier_thresh: float =0.01, outlier_trees: int =500, outlier_drop_and: bool =True, rm_outliers: str ='all', random_seed: int =623, verbose: bool =True, png_dir: str ='', metaclean: type[MetaClean] | None =None, **kwargs)¶ Given a compensated (flow cytometry) / unmixed (spectral cytometry) and transformed cytometry data, prepares the data for cleaning via 0.
-
apply(fcs: type[metaclean3.fcs.FCSfile] | None =
None, randomize_duplicates_tf: bool =True, return_binned_data: bool =False, **kwargs)¶ Prepare given fcs and apply MetaClean.
- Args:
- fcs (FCSfile | None, optional): Initialized FCSfile data class.
Defaults to None.
randomize_duplicates_tf (bool, optional): See apply_features(). return_binned_data (bool, optional): Whether or not to return binned
data. Defaults to False.
- **kwargs param: See FCSfile class attributes, namely data
(event x featire pandas.DataFrame).
- Returns:
- pandas.DataFrame: A data frame with the same number of rows as
fcs.data. Columns with prefix val_ contains feature values, outlier_keep contain boolean values for bins considered (True) as outliers, bin contain bin labels, segments_raw contain the raw segment labels, segments contain the merged and refined segment labels, and clean_keep contain boolean values for bins to keep (True) — clean_keep is the final recommentation given by 0.
-
apply_features(fcs: metaclean3.fcs.FCSfile | None =
None, randomize_duplicates_tf: bool =True, calculate_outlier2: bool =False, **kwargs)¶ Calcuate bins and features using raw values.
- Args:
fcs (FCSfile | None, optional): _description_. Defaults to None. randomize_duplicates_tf (bool, optional): Whether or not to
randomize duplicates. Only set this to false if duplicates in the input data have been dealth with already.
- calculate_outlier2 (bool, optional): Whether or not to identify
outliers again more leniently if user wishes to remove only some outliers. Defaults to False.
- Returns:
- pandas.DataFrame: A data matrix containing bins (bin column),
feature values (prefix val_ column(s)), and outliers (outlier_keep column).
- apply_clean(data: pandas.DataFrame)¶
Apply cleaning.
- Args:
data (pandas.DataFrame): Object x feature matrix.
- Returns:
pandas.DataFrame: See MetaClean.
- remove_outliers(data_merged: pandas.DataFrame)¶
After cleaning is done, revisits outliers to either keep all, some, or none of the outliers.
- Args:
data_merged (pandas.DataFrame): object/row x feature matrix.
- Returns:
- pandas.DataFrame: data_merged where the clean_keep column is
editted such that all, some, or no outliers are kept.
-
apply(fcs: type[metaclean3.fcs.FCSfile] | None =
- class metaclean3.FCSfile¶
Prepares an FCS file for MetaClean3.0.
- Attributes:
data (pandas.DataFrame): FCS file data matrix sorted by time. time_step (float | None, optional):
See
binning.get_time_binned(). Defaults to None.time_chan (str): Time channel name. Defaults to ‘time’. bin_chan (str, optional): Name of the bin column to be added. fluo_chans (list | np.ndarray, optional):
Fluorescent channels that can be used for 0. Defaults to np.array([]).
- phys_chans (list | np.ndarray, optional):
Physical morphological channels that can be used for 0. Defaults to np.array([]).
- data : pandas.DataFrame¶
- time_step : float | None¶
-
time_chan : str =
'time'¶
-
bin_chan : str =
'bin'¶
- fluo_chans : list | numpy.ndarray¶
- phys_chans : list | numpy.ndarray¶
- time_bin : dataclasses.InitVar[str | None]¶
-
channel_unique_no : dataclasses.InitVar[int] =
25¶
-
min_bin_size : dataclasses.InitVar[int] =
2000¶
-
max_bin_size : dataclasses.InitVar[int] =
10000¶
-
min_events_per_bin : dataclasses.InitVar[int] =
50¶
- __post_init__(time_bin, channel_unique_no, min_bin_size, max_bin_size, min_events_per_bin)¶
- Args:
- time_bin (str, optional): See
binning.get_time_binned(). Defaults to ‘1S’.
- channel_unique_no (int):
See
channels.get_clean_fp_channels(). Defaults to 25.- min_bin_size (int, optional):
See
binning.get_time_binned(). Defaults to 2000.- max_bin_size (int, optional):
See
binning.get_time_binned(). Defaults to 10000.- min_events_per_bin (int, optional):
See
binning.get_time_binned(). Defaults to 50.
- time_bin (str, optional): See