triku.tl._triku_functions¶
Module Contents¶
- triku.tl._triku_functions.return_knn_array(object_triku, dist_conn, knn)¶
- triku.tl._triku_functions.get_n_divisions(arr_counts: scipy.sparse.csr.csr_matrix) int¶
- triku.tl._triku_functions.return_knn_expression(arr_expression: scipy.sparse.csr.csr_matrix, knn_indices: scipy.sparse.csr.csr_matrix) scipy.sparse.csr.csr_matrix¶
This function returns an array with the knn expression per gene and cell. To calculate the expression per gene we are going to apply the dot product of the neighbor indices and the expression.
That is, if we have n_g as number of genes, and n_c as number of cells, the matrix product would be:
Mask Expr Result
(n_c x n_c) · (n_c x n_g) = (n_c x n_g)
Then, the Result matrix would have in each cell, the summed expression of that gene in the knn (and also the own cell).
In this step we do not mask the array. Previously, after the calculation of the expression, we masked the knn expression of the cells that were originally expressing that gene. That is, for any gene, the knn expression of the cells are were originally not expressing that gene was set to zero. We do that because we saw that not doing that produced “dirtier” EMD calculations. The thing is that since the matrices are csr, constructing a masked array requires a new matrix and selecting the elements from the knn matrix, or deleting the existing ones based on the count array, in both cases time consuming.
To save that time, we will simply in the convolution step select the expression values with the mask for each gene, because that selection has to be done anyways.
- triku.tl._triku_functions.compute_conv_idx(counts_gene: numpy.ndarray, knn: int, p_zeros: float) Tuple[numpy.ndarray, numpy.ndarray]¶
Given a GENE x CELL matrix, and an index to select from, calculates the convolution of reads for that gene index. The function returns the
- triku.tl._triku_functions.calculate_emd(knn_counts: numpy.ndarray, x_conv: numpy.ndarray, y_conv: numpy.ndarray, n_divisions: int) Tuple[numpy.ndarray, numpy.ndarray]¶
Returns “normalized” earth movers distance (EMD). The function calculates the x positions and probabilities of the “real” dataset using the knn_counts, and the x positions and probabilities of the convolution as attributes.
To normalize the distance, it is divided by the standard deviation of the convolution. Since the convolution is already given as a distribution, mean and variance have to be calculated “by hand”.
- triku.tl._triku_functions.compute_convolution_and_emd(array_counts_csc: scipy.sparse.csc.csc_matrix, array_knn_counts_csc: scipy.sparse.csc.csc_matrix, idx: int, knn: int, min_knn: int, n_divisions: int) numpy.ndarray¶
Calculate the convolution and emd given the array with counts and with knn counts. To do the convolution we will select the gene column from each array.
From the array of counts we will simply select the values, and from the array of knn counts we will select the values of the indices from the array of counts (arr_counts[:, idx].indices).
Then, we are going to make the array integer. To do that, we recall the n_divisions argument, that applies binning to the unit. For instance, if the expression of a gene is 5.23 and 5 bins are set, the new expression is int(5.23 * 5) = int(26.15) = 26 -> 26 / 5 = 5.2 (so we lose 0.03 of expression). This is a scaling step
To do sparse array accession faster we will play with csr_matrix.data, csr_matrix.indptr and csr_matrix.indices attributes. This makes the code a bit obscure, but makes the selection faster (or at least guarantees it is not slower).
- triku.tl._triku_functions.emd_calculation(array_counts_csc: scipy.sparse.csr.csr_matrix, array_knn_counts_csc: scipy.sparse.csr.csr_matrix, knn: int, min_knn: int, n_divisions: int) Tuple[list, list, numpy.ndarray]¶
Calculation of convolution for each gene, and its emd. To do that we call compute_convolution_and_emd which, in turn, calls compute_conv_idx to calculate the convolution of the reads; and calculate_emd, to calculate the emd between the convolution and the knn_counts.
Since we are working with counts of each gene, instead of each cell, we will get the csc forms of array_knn_counts and array_counts. This conversion takes some time and memory, but it does save a lot of time afterwards, when doing the column indexing. e.g. with a 50000 x 10000 matrix, doing csr -> csc and csc indexing takes 8s, whereas doing csr indexing takes 30 mins!!
To make things faster we use ray parallelization. Ray selects the counts and knn counts on each gene, and computes the convolution and distance. The output result is, for each gene, the convolution distribution (x, and probabilities), and the distances.
- triku.tl._triku_functions.subtract_median(x, y, n_windows, distance_correction)¶
When working with EMD, we want to find genes with more deviation on emd compared with other genes with similar mean expression. With higher expressions EMD tends to increase. To reduce that basal level we will subtract the median EMD to the genes using a number of windows. The approach is quite reliable between 15 and 80 windows.
Too many windows can over-normalize, and lose genes that have high emd but are alone in that window.
- triku.tl._triku_functions.get_cutoff_curve(y, s) float¶
Plots a curve, and finds the best point by joining the extremes of the curve with a line, and selecting the point from the curve with the greatest distance. The distance between a point in a curve, and the straight line is set by the following equation if u,v is the point in the curve, and y = mx + b is the line, then x_opt = (u - mb + mv) / (1 + m^2)
Here y attribute refers to the emd distances (after median subtraction preferably). Those distances are sorted, and ordered, and the curve is extracted from there.