Run a k-nearest neighbors analysis on a subset of the synthetic dataset

The function runs k-nearest neighbors analysis on a subset of the synthetic data set. The function uses the 'knn' package.

computeKNNRefSynthetic(
  gdsProfile,
  listEigenvector,
  listCatPop = c("EAS", "EUR", "AFR", "AMR", "SAS"),
  studyIDSyn,
  spRef,
  fieldPopInfAnc = "SuperPop",
  kList = seq(2, 15, 1),
  pcaList = seq(2, 15, 1)
)

Arguments

gdsProfile: an object of class SNPRelate::SNPGDSFileClass, the opened Profile GDS file.
listEigenvector: a list with 3 entries: 'sample.id', 'eigenvector.ref' and 'eigenvector'. The list represents the PCA done on the 1KG reference profiles and the synthetic profiles projected onto it.
listCatPop: a vector of character string representing the list of possible ancestry assignations. Default: c("EAS", "EUR", "AFR", "AMR", "SAS").
studyIDSyn: a character string corresponding to the study identifier. The study identifier must be present in the Profile GDS file.
spRef: vector of character strings representing the known super population ancestry for the 1KG profiles. The 1KG profile identifiers are used as names for the vector.
fieldPopInfAnc: a character string representing the name of the column that will contain the inferred ancestry for the specified data set. Default: "SuperPop".
kList: a vector of integer representing the list of values tested for the K parameter. The K parameter represents the number of neighbors used in the K-nearest neighbors analysis. If NULL, the value seq(2, 15, 1) is assigned. Default: seq(2, 15, 1).
pcaList: a vector of integer representing the list of values tested for the D parameter. The D parameter represents the number of dimensions used in the PCA analysis. If NULL, the value seq(2, 15, 1) is assigned. Default: seq(2, 15, 1).

Value

a list containing 4 entries:

sample.id

a vector of character strings representing the identifiers of the synthetic profiles analysed.

sample1Kg

a vector of character strings representing the identifiers of the 1KG reference profiles used to generate the synthetic profiles.

sp

a vector of character strings representing the known super population ancestry of the 1KG reference profiles used to generate the synthetic profiles.

matKNN

a data.frame containing the super population inference for each synthetic profiles for different values of PCA dimensions D and k-neighbors values K. The fourth column title corresponds to the fieldPopInfAnc parameter. The data.frame contains 4 columns:

sample.id: a character string representing the identifier of the synthetic profile analysed.
D: a numeric strings representing the value of the PCA dimension used to infer the super population.
K: a numeric strings representing the value of the k-neighbors used to infer the super population.
fieldPopInfAnc value: a character string representing the inferred ancestry.

Author

Pascal Belleau, Astrid Deschênes and Alexander Krasnitz

Examples


## Required library
library(gdsfmt)

## Load the demo PCA on the synthetic profiles projected on the
## demo 1KG reference PCA
data(demoPCASyntheticProfiles)

## Load the known ancestry for the demo 1KG reference profiles
data(demoKnownSuperPop1KG)

## Path to the demo Profile GDS file is located in this package
dataDir <- system.file("extdata/demoKNNSynthetic", package="RAIDS")

## Open the Profile GDS file
gdsProfile <- snpgdsOpen(file.path(dataDir, "ex1.gds"))

# The name of the synthetic study
studyID <- "MYDATA.Synthetic"

## Projects synthetic profiles on 1KG PCA
results <- computeKNNRefSynthetic(gdsProfile=gdsProfile,
    listEigenvector=demoPCASyntheticProfiles,
    listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"), studyIDSyn=studyID,
    spRef=demoKnownSuperPop1KG)

## The inferred ancestry for the synthetic profiles for different values
## of D and K
head(results$matKNN)
#>         sample.id D K SuperPop
#> 1 1.ex1.HG00246.1 2 2      SAS
#> 2 1.ex1.HG00246.1 2 3      AMR
#> 3 1.ex1.HG00246.1 2 4      AMR
#> 4 1.ex1.HG00246.1 2 5      EUR
#> 5 1.ex1.HG00246.1 2 6      EAS
#> 6 1.ex1.HG00246.1 2 7      EAS

## Close Profile GDS file (important)
closefn.gds(gdsProfile)