Run a k-nearest neighbors analysis on one specific profile

The function runs k-nearest neighbors analysis on a one specific profile. The function uses the 'knn' package.

computeKNNRefSample(
  listEigenvector,
  listCatPop = c("EAS", "EUR", "AFR", "AMR", "SAS"),
  spRef,
  fieldPopInfAnc = "SuperPop",
  kList = seq(2, 15, 1),
  pcaList = seq(2, 15, 1)
)

Arguments

listEigenvector: a list with 3 entries: 'sample.id', 'eigenvector.ref' and 'eigenvector'. The list represents the PCA done on the 1KG reference profiles and one specific profile projected onto it. The 'sample.id' entry must contain only one identifier (one profile).
listCatPop: a vector of character string representing the list of possible ancestry assignations. Default: c("EAS", "EUR", "AFR", "AMR", "SAS").
spRef: vector of character strings representing the known super population ancestry for the 1KG profiles. The 1KG profile identifiers are used as names for the vector.
fieldPopInfAnc: a character string representing the name of the column that will contain the inferred ancestry for the specified profile. Default: "SuperPop".
kList: a vector of integer representing the list of values tested for the K parameter. The K parameter represents the number of neighbors used in the K-nearest neighbor analysis. If NULL, the value seq(2,15,1) is assigned. Default: seq(2,15,1).
pcaList: a vector of integer representing the list of values tested for the D parameter. The D parameter represents the number of dimensions used in the PCA analysis. If NULL, the value seq(2, 15, 1) is assigned. Default: seq(2, 15, 1).

Value

a list containing 4 entries:

sample.id

a vector of character strings representing the identifier of the profile analysed.

matKNN

a data.frame containing the super population inference for the profile for different values of PCA dimensions D and k-neighbors values K. The fourth column title corresponds to the fieldPopInfAnc parameter. The data.frame contains 4 columns:

sample.id: a character string representing the identifier of the profile analysed.
D: a numeric strings representing the value of the PCA dimension used to infer the ancestry.
K: a numeric strings representing the value of the k-neighbors used to infer the ancestry..
fieldPopInfAnc: a character string representing the inferred ancestry.

Author

Pascal Belleau, Astrid Deschênes and Alexander Krasnitz

Examples


## Load the demo PCA on the synthetic profiles projected on the
## demo 1KG reference PCA
data(demoPCASyntheticProfiles)

## Load the known ancestry for the demo 1KG reference profiles
data(demoKnownSuperPop1KG)

## The PCA with 1 profile projected on the 1KG reference PCA
## Only one profile is retained
pca <- demoPCASyntheticProfiles
pca$sample.id <- pca$sample.id[1]
pca$eigenvector <- pca$eigenvector[1, , drop=FALSE]

## Projects profile on 1KG PCA
results <- computeKNNRefSample(listEigenvector=pca,
    listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"),
    spRef=demoKnownSuperPop1KG, fieldPopInfAnc="SuperPop",
    kList=seq(10, 15, 1), pcaList=seq(10, 15, 1))

## The assigned ancestry to the profile for different values of K and D
head(results$matKNN)
#>         sample.id  D  K SuperPop
#> 1 1.ex1.HG00246.1 10 10      SAS
#> 2 1.ex1.HG00246.1 10 11      SAS
#> 3 1.ex1.HG00246.1 10 12      SAS
#> 4 1.ex1.HG00246.1 10 13      SAS
#> 5 1.ex1.HG00246.1 10 14      SAS
#> 6 1.ex1.HG00246.1 10 15      EAS