Participants
Study participants came from two clinically-ascertained Cardiff University cohorts, the National Centre for Mental Health (NCMH) and CardiffCOGS, and from the UK Biobank. All participants provided written informed consent. Table 1 provides information on the assessments used to determine diagnosis in each sample. A flowchart of the samples, methods and number of participants recruited is shown in Fig. 1.
NCMH
NCMH participants were recruited via health care services, voluntary organisations or via public advertisement29. Trained researchers administered a brief standardized assessment to gather demographic and clinical information and participants were asked to provide a sample for DNA extraction and genetic analyses. Participants self-reporting a schizophrenia, psychosis or affective diagnosis were invited to take part in a research interview based on the Schedules for Clinical Assessment in Neuropsychiatry (SCAN)30. NCMH received approval from Health Research Authority and Wales Research Ethics Committee (REC) 2 (16/WA/0323).
CardiffCOGS
CardiffCOGS participants were recruited from community, in-patient and voluntary sector mental health services across the UK31. All participants completed a SCAN-based research interview30, underwent a case-note review and were asked to provide a sample for DNA extraction and genetic analyses. CardiffCOGS received approval from Southeast Wales REC (07/WSE03/110). CardiffCOGS participants were included to increase the sample size for the genetic analysis. These participants were not included in comparisons of self-report and research diagnoses, as self-report diagnoses are not available in this sample.
UK Biobank
UK Biobank is a population-based UK cohort of around 500,000 participants, aged between 40–69 at recruitment32. Participants completed a range of assessments and provided a sample for genetic analysis. Ethical approval was granted by the Northwest Multi-Centre Ethics Committee. This study was conducted under UK Biobank project number 13310.
Diagnosis definitions
Table 1 provides an overview of the self-reported, research interview and medical record diagnosis definitions used in this study.
Self-reported diagnosis
In NCMH, participants were asked whether a doctor or health professional had ever told the participant that they had a mental health diagnosis and prompted with a list of psychiatric diagnoses to choose from (Supplementary Fig. 1). In UK Biobank, participants were asked to report if a doctor had told them they had any serious medical condition in the initial assessment. A subset of participants in the UK Biobank (31%) completed the Mental Health Questionnaire (MHQ), where they were prompted with a list of psychiatric diagnoses to choose from (Supplementary Fig. 2). For both NCMH and UK Biobank, if the participant chose schizophrenia from the list or they verbally self-reported a schizophrenia diagnosis, they were assigned a schizophrenia self-reported diagnosis in this study. Table 1 describes the subtypes of self-reported diagnoses available in the clinically-ascertained sample. A self-reported schizoaffective disorder diagnosis was excluded from analyses, as it was not possible to differentiate between the depressed and manic subtypes.
Research interview diagnosis
In the clinically-ascertained samples (NCMH and CardiffCOGS), DSM-IV, DSM-5, and ICD-10 research diagnoses were derived from a SCAN-based clinical interview and note review where available. A research interview diagnosis of schizophrenia was given in this study if either a DSM or ICD schizophrenia criteria were met. If participants met criteria for schizoaffective disorder depressed-type (SA-D), they were also included alongside participants with schizophrenia given evidence that these participants do not differ on a range of phenotypic and genotype measures, including symptoms, cognition and polygenic risk33. ‘Other psychotic disorders’ in this study refer to the following diagnoses: psychosis not otherwise specified, schizophreniform disorder, delusional disorder, brief psychotic disorder, acute polymorphic disorder, and other psychotic illness.
Medical record diagnosis
In UK Biobank, a medical record diagnosis of schizophrenia and SA-D were defined as a F20/F25.1 ICD-10 code from national hospital admission records or death records, or an equivalent read code from primary care (Supplementary Table 1). Hospital records date back to 1997 for England, 1998 for Wales and 1981 for Scotland and contain coded data on admissions, operations, and procedures. Primary care data was obtained for approximately 45% of the UK Biobank cohort. In secondary analyses, hospital admissions for schizophrenia were further subdivided into primary and secondary admissions. Primary ICD-10 codes represent conditions that caused the admission and secondary ICD-10 codes represent conditions that coexist at the time of admission, affect the treatment received, or develop after admission.
Unaffected controls
Unaffected controls for the clinically-ascertained samples were NCMH participants with no history of a mental health diagnosis and who were recruited through participants with a psychiatric diagnosis (e.g., a family member/partner) or via advertisements. Unaffected controls for the UK Biobank analyses consisted of participants in UK Biobank who did not have a psychotic disorder diagnosis (F21-F29 inclusive) from admission records, death records, primary care records, or from self-reported sources.
Phenotypic data
The phenotypes compared across diagnostic groups included sex, age at interview (in years), educational attainment, and employment status. Educational attainment was dichotomised to GCSE (General Certificate of Secondary Education) and above, usually achieved at 16 years upon completing high school, or below GCSE/no qualification, consistent with previous research34, in addition to degree/no degree. Employment status was dichotomised to in current paid employment or not and restricted to participants under the age of 65 who did not report being retired.
Genetic data
Clinically-ascertained sample
The clinically-ascertained participants were genotyped on the Illumina OmniExpress (Infinium OmniExpress-24 Kit), Illumina PsychArray (Infinium PsychArray-24 Kit) or Illumina GSA (Infinium Global Screening Array-24 Kit) genotyping platforms. Quality control and imputation using the Haplotype Reference Consortium (HRC)35 was performed as part of the DRAGON-Data protocol36. Datasets containing participants from the clinically-ascertained samples were restricted to those with the diagnoses described above and who did not carry a neurodevelopmental CNV36. These samples were combined with samples from 1000 Genomes European phase 337 using PLINK v1.938 after restricting to overlapping SNPs. The 1000 Genomes sample was included to provide a population reference to allow studies using different arrays to be directly compared39. The following quality control exclusion criteria were subsequently applied to SNPs: minor allele frequency (MAF) < 0.05, genotyping rate < 0.05, and Hardy-Weinberg equilibrium p ≤ 10−6. Linkage disequilibrium-pruned SNPs (500 variant count window size, 20 variant count to shift the window at the end of each step, a pairwise r2 threshold of 0.2) were used to identify related individuals and to derive principal components (PC). One individual from each pair assumed to be duplicates (kinship coefficient > 0.98) or related (kinship coefficient > 0.1875) was removed at random. The first 5 PCs were used to perform multi-dimensional clustering to identify an ancestrally-homogenous subsample of individuals40. The first 5 PCs explained the majority of the variance in the principal components, adding additional PCs did not change the classifications. Individuals within a 90% threshold from the most central point were included for analyses. There were insufficient numbers of participants of non-European ancestries in NCMH and CardiffCOGS to allow us to analyse PRS in different ancestries.
UK Biobank
Imputed genetic data were provided by UK Biobank. Pre-imputation quality control and imputation have been described elsewhere41. Briefly, participants were assayed at the Affymetrix Research Services laboratory using the UK Biobank Axiom or UK BiLEVE Axiom purpose-built arrays. Imputation was completed using the HRC panel35. We applied additional quality control procedures using the same thresholds used in our clinically-ascertained sample and detailed elsewhere39,42. Genetic analyses were restricted to participants with European ancestry, to mirror the clinically-ascertained sample, using the method described above, see also Legge et al42.
Polygenic risk scores
In the clinically-ascertained sample and UK Biobank, PRSicev243 was used to calculate PRS for schizophrenia using GWAS de-duplicated summary statistics that were derived separately from our clinical sample and UK Biobank28. PRS were also calculated for bipolar disorder13 and major depressive disorder44. Summary statistics underwent quality control36 and SNPs with MAF > 0.01 outside of the major histocompatibility complex region were used in the PRS analysis. PRS were calculated, using relatively independent SNPs (r2 < 0.1, within 500 kb window), at a p-value threshold of 0.0528. Polygenic risk scores were standardised within samples prior to analysis.
Analysis
In NCMH, positive predictive values (PPV) were used to assess the ratio of participants with a self-reported schizophrenia diagnosis from a health professional who had a concordant DSM/ICD research interview diagnosis. We also considered a research interview diagnosis of schizophrenia and schizoaffective disorder depressive-type (SA-D) together as there is evidence these two groups do not substantially differ with respect to genetic liability to schizophrenia28,33. It was not possible to assess negative predictive values (NPV), sensitivity and specificity in the clinically-ascertained sample due to the recruitment methods; participants were only approached to complete a SCAN-based research interview if they self-reported a mood or psychotic disorder diagnosis.
In the UK Biobank, PPV, NPV, sensitivity and specificity were used to assess how predictive a self-reported clinical diagnosis from a health professional was of a medical record diagnosis. We scaled the PPV and NPV to the population point prevalence of schizophrenia (0.6%) (Supplementary Note 1). We could not calculate PPV related to a medical record diagnosis of schizophrenia and SA-D together due to a very low prevalence of SA-D in the UK Biobank.
In both NCMH and the UK Biobank, logistic regressions were used to test for phenotypic differences between individuals that only self-reported a diagnosis and those who had a research interview diagnosis/medical record diagnosis (some of whom also self-reported). Year of birth and sex were included as covariates.
Due to the limited number of genotyped participants in NCMH, the genetic analyses included participants from both NCMH and CardiffCOGS. In both the clinically-ascertained sample and the UK Biobank logistic regressions were used to test for genetic differences in schizophrenia between self-report-only and the research interview diagnosis/medical record diagnosis groups.
We compared the variance explained by schizophrenia PRS on the liability-scale (r2, assuming 1% lifetime risk) in schizophrenia case/control status in the clinically-ascertained sample and UK Biobank, separated by diagnosis definitions, against the variances reported by other samples of European genetic ancestry in the PGC3 schizophrenia GWAS. The r2 values refer to the variance explained by the schizophrenia PRS in comparison to a covariates-only baseline model. In addition, we calculated the variance explained in schizophrenia case/control status in UK Biobank for bipolar disorder13 and major depressive disorder44 PRS.
In the UK Biobank sample, further logistic regressions were used to assess if schizophrenia PRS was associated with the number of times a diagnosis was reported, the number of admissions and type of admission (primary and secondary). These PRS analyses were covaried for the first 5 PCs, array, age at assessment, and sex.
All statistical tests were two-sided. Unless otherwise specified, data analysis was conducted in R.