metaclean3.utils

Module Contents

Functions

get_close_match(word, list_of_words)

Returns string from list_of_words that is the closest match to word.

align_list(list1, list2)

Converts list2 into list1.

arg_lim(x[, xmin, xmax])

Sets the lower and upper limit of a given numeric variable.

is_monotonic(x[, strict])

Determines whether a vector is increasing or decreasing monotonically.

mode(x)

Returns the mode in vector x.

str_to_time(time_bin)

Converts string to time format.

randomize_duplicates(arr[, noise_prop, strict, ...])

If there are any duplicate rows, adds random noise to last column.

duplicated_rows(arr[, strict, lenient_percent])

Identifies duplicate rows in the given 2D array.

round_to_1(x)

Round a number to the nearest power of 10.

seq_array(size[, max_val])

Creates an evenly-spaced integer numpy array.

order_value_bins(vlist, segments[, percentile_diff])

Order segments by their geometric means and quantiles.

get_timestep(meta[, timestep_key])

Extract timestep from flow cyotometry standard file meta data.

_clean_spil_index(spil_df)

_clean_spil_names(spil_names[, keep_left])

_clean_spillover(spillover[, data_col, keep_left])

get_spillover_raw(meta, dat_columns)

Extract spillover matrix from fcs meta data.

apply_compensation_matrix(data, spillover)

Apply spillover compensation on data (from fcmdata_helpers).

metaclean3.utils.get_close_match(word: str, list_of_words: list)

Returns string from list_of_words that is the closest match to word.

Args:

word (str): A string. list_of_words (list[str]): A list of strings word will match to.

Returns:

str: string from list_of_words that is the closest match to word.

metaclean3.utils.align_list(list1: list, list2: list)

Converts list2 into list1.

We assume both lists should be the same and we are only cleaning up list2. If some element in list2 does not exist in list1, this function finds the most similar string in list1 and replaces that element.

Args:

list1 (list[str]): List to compare to. list2 (list[str]): List to convert into list2.

Returns:

list2_ (list[str]): Cleaned-up version of list2.

metaclean3.utils.arg_lim(x: int | float, xmin: int | float = -np.Inf, xmax: int | float = np.Inf)

Sets the lower and upper limit of a given numeric variable.

Args:

x (int | float): Numeric variable. xmin (int | float, optional): Lower limit. Defaults to -numpy.Inf. xmax (int | float, optional): Upper limit. Defaults to numpy.Inf.

Returns:

int | float: x within the given lower and upper limit.

metaclean3.utils.is_monotonic(x: numpy.ndarray, strict: bool = True)

Determines whether a vector is increasing or decreasing monotonically.

Args:

x: Any integer/float vector that can be converted to a 1D numpy array. strict (bool, optional): Whether or not to test strict monotonicity.

Defaults to True.

Returns:

bool: Given vector is increasing or decreasing monotonically.

metaclean3.utils.mode(x: int | float)

Returns the mode in vector x.

Args:

x: Any integer/float vector that can be converted to a 1D numpy array.

Returns:

int | float: The mode in vector x.

metaclean3.utils.str_to_time(time_bin: str)

Converts string to time format.

Args:

time_bin (str): Time in string format.

Returns:
tuple:

float: Time value. str: Time unit.

metaclean3.utils.randomize_duplicates(arr: numpy.ndarray, noise_prop: float = 1 / 50000, strict: bool = False, lenient_percent: float = 0.99, seed: int = 123)

If there are any duplicate rows, adds random noise to last column.

Args:

arr (numpy.ndarray): 2D array noise_prop (float, optional): Proportion of noise.

Defaults to 1/10000.

strict (bool, optional): Actually find duplicates? This takes

too long, so if false, we just find duplicate of the row sums. Defaults to False.duplicated_rows

lenient_percent (float): If this percent of rows are unique,

the function returns the original data. Defaults to 0.99.

seed (int): Random seed. Defaults to 123.

Returns:

numpy.ndarray: 2D array with (almost, if not strict) unique rows.

metaclean3.utils.duplicated_rows(arr: numpy.ndarray, strict: bool = False, lenient_percent: float = 1.0)

Identifies duplicate rows in the given 2D array.

Args:

arr (numpy.ndarray): A 2D numpy array. strict (bool, optional): If set to True, finds rows with duplicate

value. If set to False, finds rows whose sum is duplicated. Defaults to False.

lenient_percent (float): If this percent of rows are unique,

the function returns the original data. Defaults to 1.0.

Returns:
tuple:
numpy.ndarray: 1D bool numpy array indicating whether each row in

arr is a duplicate.

int: Number of duplicate rows in arr.

metaclean3.utils.round_to_1(x: int | float)

Round a number to the nearest power of 10.

Args:

x (int | float): The number to round.

Returns:

float: The rounded number.

metaclean3.utils.seq_array(size: int, max_val: int = 2000)

Creates an evenly-spaced integer numpy array.

Args:

size (int): Length of desired output. max_val (int, optional): Desired frequency of output.

Defaults to 2000.

Returns:

numpy.ndarray: An evenly-spaced integer numpy array.

metaclean3.utils.order_value_bins(vlist: list | numpy.ndarray | pandas.Series, segments: list | numpy.ndarray | pandas.Series, percentile_diff: float = 0.9)

Order segments by their geometric means and quantiles.

Args:

vlist (list | numpy.ndarray | pandas.Series): 2D array to be sorted. segments (list | numpy.ndarray | pandas.Series): Segment label of values. percentile_diff (float, optional): Percentile to be evaluated while

sorting. For example, if set to 0.9, we evaluate the 10th and 90th percentiles. Defaults to 0.9.

Returns:

numpy.ndarray: list of sorted indices.

metaclean3.utils.get_timestep(meta: dict, timestep_key: str = '$TIMESTEP')

Extract timestep from flow cyotometry standard file meta data.

Args:

meta (dict): Flow cytometr standard file meta data. timestep_key (str, optional): Timestep key in meta data.

Defaults to ‘$TIMESTEP’.

Returns:

float | None: Timestep if it exists, otherwise None.

metaclean3.utils._clean_spil_index(spil_df)
metaclean3.utils._clean_spil_names(spil_names, keep_left=True)
metaclean3.utils._clean_spillover(spillover, data_col=None, keep_left=True)
metaclean3.utils.get_spillover_raw(meta: dict, dat_columns: list)

Extract spillover matrix from fcs meta data.

Args:

meta (dict): FCS meta data. dat_columns (list): Column names of data.

Returns:

pandas.DataFrame: Spillover matrix.

metaclean3.utils.apply_compensation_matrix(data: pandas.DataFrame, spillover: pandas.DataFrame | None)

Apply spillover compensation on data (from fcmdata_helpers).

Spillover columns should match some subset of data columns.

Args:

data (pandas.DataFrame): An event x feature matrix. spillover (pandas.DataFrame): spillover matrix to compensate data.

Returns:

pandas.DataFrame: Compensated data for columns specified in spillover.