R/processStudy.R
computeKNNRefSynthetic.RdThe function runs k-nearest neighbors analysis on a subset of the synthetic data set. The function uses the 'knn' package.
an object of class
SNPRelate::SNPGDSFileClass, the
opened Profile GDS file.
a list with 3 entries:
'sample.id', 'eigenvector.ref' and 'eigenvector'. The list represents
the PCA done on the 1KG reference profiles and the synthetic profiles
projected onto it.
a vector of character string
representing the list of possible ancestry assignations. Default:
c("EAS", "EUR", "AFR", "AMR", "SAS").
a character string corresponding to the study
identifier.
The study identifier must be present in the Profile GDS file.
vector of character strings representing the
known super population ancestry for the 1KG profiles. The 1KG profile
identifiers are used as names for the vector.
a character string representing the name of
the column that will contain the inferred ancestry for the specified
data set. Default: "SuperPop".
a vector of integer representing the list of
values tested for the K parameter. The K parameter represents the
number of neighbors used in the K-nearest neighbors analysis. If
NULL, the value seq(2, 15, 1) is assigned.
Default: seq(2, 15, 1).
a vector of integer representing the list of
values tested for the D parameter. The D parameter represents the
number of dimensions used in the PCA analysis. If NULL,
the value seq(2, 15, 1) is assigned.
Default: seq(2, 15, 1).
a list containing 4 entries:
sample.ida vector of character strings
representing the identifiers of the synthetic profiles analysed.
sample1Kga vector of character strings
representing the identifiers of the 1KG reference profiles used to
generate the synthetic profiles.
spa vector of character strings representing
the known super population ancestry of the 1KG reference profiles used
to generate the synthetic profiles.
matKNNa data.frame containing the super population
inference for each synthetic profiles for different values of PCA
dimensions D and k-neighbors values K. The fourth column title
corresponds to the fieldPopInfAnc parameter.
The data.frame contains 4 columns:
sample.ida character string representing
the identifier of the synthetic profile analysed.
Da numeric strings representing
the value of the PCA dimension used to infer the super population.
Ka numeric strings representing
the value of the k-neighbors used to infer the super population.
fieldPopInfAnc valuea character string representing
the inferred ancestry.
## Required library
library(gdsfmt)
## Load the demo PCA on the synthetic profiles projected on the
## demo 1KG reference PCA
data(demoPCASyntheticProfiles)
## Load the known ancestry for the demo 1KG reference profiles
data(demoKnownSuperPop1KG)
## Path to the demo Profile GDS file is located in this package
dataDir <- system.file("extdata/demoKNNSynthetic", package="RAIDS")
## Open the Profile GDS file
gdsProfile <- snpgdsOpen(file.path(dataDir, "ex1.gds"))
# The name of the synthetic study
studyID <- "MYDATA.Synthetic"
## Projects synthetic profiles on 1KG PCA
results <- computeKNNRefSynthetic(gdsProfile=gdsProfile,
listEigenvector=demoPCASyntheticProfiles,
listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"), studyIDSyn=studyID,
spRef=demoKnownSuperPop1KG)
## The inferred ancestry for the synthetic profiles for different values
## of D and K
head(results$matKNN)
#> sample.id D K SuperPop
#> 1 1.ex1.HG00246.1 2 2 SAS
#> 2 1.ex1.HG00246.1 2 3 AMR
#> 3 1.ex1.HG00246.1 2 4 AMR
#> 4 1.ex1.HG00246.1 2 5 EUR
#> 5 1.ex1.HG00246.1 2 6 EAS
#> 6 1.ex1.HG00246.1 2 7 EAS
## Close Profile GDS file (important)
closefn.gds(gdsProfile)