Analytical methods and arrays for use in the same

11324787 · 2022-05-10

Assignee

Inventors

Cpc classification

International classification

Abstract

The present invention relates to a method for identifying the skin sensitizer potency of a test agent, and arrays and analytical kits for use in such methods.

Claims

1. An in vitro method for measuring biomarkers for human skin sensitizers consisting of the steps of: a) exposing a population of dendritic cells or a population of dendritic-like cells to a test agent; and b) measuring in the exposed cells of step a) the expression of nucleic acid molecules encoding each of the following 51 biomarkers for human skin sensitizers: histone cluster 1, H2bm (HIST1H2BM); histone cluster 1, H4b (HIST1H4B); histone cluster 1, H1d (HIST1H1D); phosphoribosylaminoimidazole carboxylase, phosphoribosylaminoimidazole succinocarboxamide synthetase (PAICS); histone cluster 1, H4d (HIST1H4D); histone cluster 1, H2a1/histone cluster 1, H2bn (HIST1H2AL/HIST1H2BN); polo-like kinase 1 (PLK1); phosphoglycerate dehydrogenase (PHGDH); minichromosome maintenance complex component 2 (MCM2); minichromosome maintenance complex component 7 (MCM7); CD53; kinesin family member C1 (KIFC1); WEE1 G2 checkpoint kinase (WEE1); thymopoietin (TMPO); minichromosome maintenance complex component 4 (MCM4); proline/serine-rich coiled-coil 1 (PSRC1); ring finger protein 149 (RNF149); minichromosome maintenance complex component 6 (MCM6); minichromosome maintenance complex component 5 (MCM5); Fanconi anemia complementation group A (FANCA); carbamoyl-phosphate synthetase 2, aspartate transcarbamylase, and dihydroorotase (CAD); MRT4 homolog, ribosome maturation factor (MRTO4); target of EGR1, member 1 (TOE1); cell division cycle 20 (CDC20); peptidase M20 domain containing 2 (PM20D2); NmrA-like family domain containing 1 pseudogene (LOC344887); uracil DNA glycosylase (UNG); cyclin-dependent kinase inhibitor 1A (CDKN1A); histone cluster 1, H2bf (HIST1H2BF); histone cluster 1, H1b (HIST1H1B); jade family PHD finger 1 (JADE1); KIAA0125; cholinergic receptor, nicotinic alpha 5 (CHRNA5); chromatin assembly factor 1, subunit B (p60)/MORC family CW-type zinc finger 3 (CHAF1B/MORC3); sel-1 suppressor of lin-12-like 3 (SEL1L3); structure specific recognition protein 1 (SSRP1); forkhead box Ml (FOXM1); lamin B1 (LMNB1); histone cluster 1, H2ak (HIST1H2AK); centromere protein A (CENPA); non-SMC condensin I complex subunit H (NCAPH); histone cluster 1, H2bb (HIST1H2BB); ATPase, H+ transporting, lysosomal 56/58 kDa, V1 subunit B2 (ATP6V1B2); TNF receptor-associated protein 1 (TRAP1); phosphoribosylformylglycinamidine synthase (PFAS); transmembrane protein 97 (TMEM97); 24-dehydrocholesterol reductase (DHCR24); ferritin, heavy polypeptide 1 (FTH1); histone cluster 1, H2ae (HIST1H2AE); NAD(P)H dehydrogenase, quinone 1 (NQO1); and CD44.

2. The method according to claim 1, wherein measuring the expression of the biomarkers in step (b) is performed using binding moieties, each binding selectively to a nucleic acid molecule encoding one of the biomarkers for human skin sensitizers, and wherein the binding moieties each comprise a nucleic acid molecule.

3. An in vitro method for measuring biomarkers for human skin sensitizers comprising the steps of: a) exposing a population of dendritic cells or a population of dendritic-like cells to a test agent; and b) measuring in the exposed cells of step a) the expression of nucleic acid molecules encoding each of the following 51 biomarkers for human skin sensitizers: histone cluster 1, H2bm (HIST1H2BM); histone cluster 1, H4b (HIST1H4B); histone cluster 1, Hid (HIST1H1D); phosphoribosylaminoimidazole carboxylase, phosphoribosylaminoimidazole succinocarboxamide synthetase (PAICS); histone cluster 1, H4d (HIST1H4D); histone cluster 1, H2a1/histone cluster 1, H2bn (HIST1H2AL/HIST1H2BN); polo-like kinase 1 (PLK1); phosphoglycerate dehydrogenase (PHGDH); minichromosome maintenance complex component 2 (MCM2); minichromosome maintenance complex component 7 (MCM7); CD53; kinesin family member C1 (KIFC1); WEE1 G2 checkpoint kinase (WEE1); thymopoietin (TMPO); minichromosome maintenance complex component 4 (MCM4); proline/serine-rich coiled-coil 1 (PSRC1); ring finger protein 149 (RNF149); minichromosome maintenance complex component 6 (MCM6); minichromosome maintenance complex component 5 (MCM5); Fanconi anemia complementation group A (FANCA); carbamoyl-phosphate synthetase 2, aspartate transcarbamylase, and dihydroorotase (CAD); MRT4 homolog, ribosome maturation factor (MRTO4); target of EGR1, member 1 (TOE1); cell division cycle 20 (CDC20); peptidase M20 domain containing 2 (PM20D2); NmrA-like family domain containing 1 pseudogene (LOC344887); uracil DNA glycosylase (UNG); cyclin-dependent kinase inhibitor 1A (CDKN1A); histone cluster 1, H2bf (HIST1H2BF); histone cluster 1, H1b (HIST1H1B); jade family PHD finger 1 (JADE1); KIAA0125; cholinergic receptor, nicotinic alpha 5 (CHRNA5); chromatin assembly factor 1, subunit B (p60)/MORC family CW-type zinc finger 3 (CHAF1B/MORC3); sel-1 suppressor of lin-12-like 3 (SEL1L3); structure specific recognition protein 1 (SSRP1); forkhead box Ml (FOXM1); lamin B1 (LMNB1); histone cluster 1, H2ak (HIST1H2AK); centromere protein A (CENPA); non-SMC condensin I complex subunit H (NCAPH); histone cluster 1, H2bb (HIST1H2BB); ATPase, H+ transporting, lysosomal 56/58 kDa, V1 subunit B2 (ATP6V1B2); TNF receptor-associated protein 1 (TRAP1); phosphoribosylformylglycinamidine synthase (PFAS); transmembrane protein 97 (TMEM97); 24-dehydrocholesterol reductase (DHCR24); ferritin, heavy polypeptide 1 (FTH1); histone cluster 1, H2ae (HIST1H2AE); NAD(P)H dehydrogenase, quinone 1 (NQO1); and CD44, wherein step b) is performed using an array comprising binding moieties, wherein the binding moieties consist of different binding moieties each capable of binding selectively to a nucleic acid molecule encoding one of the 51 biomarkers for human skin sensitizers.

Description

(1) Preferred, non-limiting examples which embody certain aspects of the invention will now be described, with reference to the following figures:

(2) FIG. 1. Binary predictions using the GARD assay. ROC evaluation for (A) binary predictions of benchmark chemicals (filled line) and of 37 new chemicals (dotted line). (B) GARD SVM decision values (DVs) correlate with CLP potency (37 chemicals, 11 1A, 19 1B, 7 no cat). Increasing potency is associated with increasing DVs.

(3) FIG. 2. Visualization of the training dataset used to develop the random forest model, using the 52 variables. (A) PCA plot of the training set with separate biological replicates coloured according to CLP classifications of the chemicals. (B) Heatmap of the training set with replicates of the samples hierarchically clustered, where the grey scale represents the relative gene expression intensity.

(4) FIG. 3. Visualization of the test dataset using the 52 variables. (A) PCA plot of the test set with separate biological replicates coloured according to CLP classifications. The PCA was built on the training set and the test set plotted without influencing the PCA. (B) Heatmap of the test set with samples hierarchically clustered, where the grey scale represents relative gene expression intensity.

(5) FIG. 4. The CLP potency model contains information related to human potency. (A) PCA plot of replicate training and test set samples with available human potency classifications colored according to human potency. The PCA is based on the 52 Random Forest variables as input and the PCA was built on the training set. (B) PCA plot visualizing test set only.

(6) FIG. 5. Pathway analysis based on an input of the 883 most significant genes from a multigroup comparison of the three CLP classes.

(7) FIG. 6. Pathways unique for each protein reactivity groups, as identified by pathway analysis.

(8) FIG. 7. Venn diagrams [82] of common genes (A) and biological pathways (B) for the three different protein reactivity classes. MA=Michael acceptors; SB=Schiff base formation; SN=bi-molecular nucleophilic substitution/nucleophilic aromatic substitution.

(9) FIG. 8: Genomic Allergen Rapid Detection (GARD)—workflow. The SVM model calculates decision values for each RNA sample derived from chemically treated cells based on the gene expression data of 200 genes that represent the GARD prediction signature. The decision values are then used for binary classification of the chemicals [3]

(10) FIG. 9: Relationship between GARD decision values and human potency classes. The chemicals have been sorted according to human potency class and the respective average SVM decision value derived from triplicates can be found on the Y axis. The number on the x-axis refers to the number of the chemical in Table 8.

(11) FIG. 10: Scatter plot of O-PLS model (Y=human potency and sensitizer/non-sensitizer) based on 197 samples. Human potency classes as described in Basketter et al. (2014), class 1 representing highest potency, 6 representing non-sensitizers.

(12) FIG. 11: Permutation plot of O-PLS model. Comparison of goodness of the fit and prediction of the original model with several models, where the order of the Y-observations has been randomly permutated. The plot strongly indicates that the model has not been obtained by chance.

(13) FIG. 12: Observed Y values plotted against predicted Y values. The RMSEE (Root Mean Square Error of Estimation) indicates the fit error of the observations to the model. RMSEcv is a similar measure, but estimated using cross validation.

(14) FIG. 13: PCA plots and heat maps. PCA of training set (A), training and test set (B) and test set only (C) based on an input of 18 variables identified by random forest modeling. Each sphere represents a chemical, colored according to CLP classifications (yellow=1A, pink=1B, blue=no category). (D) Heat map of training set and (E) heat map of test set based on the expression of the 18 identified biomarkers.

EXAMPLE A

(15) Introduction

(16) Skin sensitization and associated diseases such as contact allergy affect a substantial portion of the general population with an estimated prevalence of 15-20% in industrialized countries [1]. Allergic contact dermatitis (ACD), a type IV hypersensitivity reaction, is common among certain occupational groups such as those regularly exposed to chemicals or involved in wet work [2]. However, also cosmetics and household products can contain numerous skin sensitizing chemicals. Through different legal frameworks, the European Union has prohibited animal testing for cosmetics and their ingredients [3], and imposed requirements for testing of >60,000 chemicals in the context of REACH [4]. Information on both the skin sensitizing capacity and the potency of a chemical has to be provided to meet the regulatory requirements for classification and sub-categorization. Animal-free alternative assays that meet these requirements are urgently needed.

(17) The molecular events leading to skin sensitization and consequently to ACD can be characterized by a number of sequential key events (KE) triggered by a chemical, and have been summarized in an adverse outcome pathway (AOP) as described by the OECD [5]. The initiating event (KE1) is defined as covalent protein modification by the skin sensitizing chemical after it has gained access to deeper skin layers. The following KE2 represents the inflammatory responses upon activation of keratinocytes. KE3 corresponds to the activation of dendritic cells, which in turn leads to activation and proliferation of T cells (KE4). Upon re-exposure to the sensitizer, the development of ACD may be triggered, which as the adverse outcome is characterized by skin lesions induced by specific Th1 and CD8+ T cells. While the KE in the AOP are well described, a detailed mechanistic understanding of the underlying biology of the individual key events is still missing [5].

(18) The murine Local Lymph Node Assay (LLNA) [6] has for many years been the preferred alternative for skin sensitization testing as it is able to provide data for both hazard identification and characterization, including skin sensitizer potency information. However, it is characterized by certain limitations such as susceptibility to vehicle effects and issues with false-positive results [7]. Several non-animal predictive methods have been developed to reduce animal experimentation used for chemical testing including computational approaches to integrate data from different test platforms for hazard identification as recently reviewed [8]. Three test methods for skin sensitization are accepted as test guidelines at the OECD; the ARE-NRF2 luciferase method (KeratinoSens™) assay [9, 10], the Direct Peptide Reactivity Assay (DPRA) [11] and the human Cell Line Activation Test (h-CIAT) [12]. In addition to hazard identification, information on skin sensitizer potency is imperative in order to allow quantitative risk assessment and to define exposure limits. Few approaches for the prediction of skin sensitizer potency have been published and were recently reviewed in [8], such as assays targeting KE2 (epidermal equivalent sensitizer potency assay [13], SENS-IS [14]) and the U-SENS assay modelling KE3 [15]. Furthermore, in silico models, often combining information from several in vitro methods, have been described; for example QSAR [16], artificial neural network [17], probabilistic models and integrated testing strategy (ITS) approaches including a Bayesian model [18-21].

(19) The alternative assay Genomic Allergen Rapid Detection (GARD) for the assessment of skin sensitization capacity of chemicals is based on global transcriptomic analysis of differential expression in a human myeloid cell line, induced by sensitizing chemicals in comparison to non-sensitizing controls. The resulting biomarker signature, the GARD prediction signature (GPS), consists of 200 transcripts, which are used as input into a support vector machine (SVM) model trained on a set of reference chemicals [22]. The changes in transcription can be linked to the maturation and activation of dendritic cells (KE3) during sensitization. In an in-house study based on 26 blinded chemicals, the accuracy of the assay was estimated to 89% [23]. Previous observations indicated that the GARD assay is able to provide information relevant also for potency assessment. Firstly, signalling pathways were differentially regulated depending on the potency of a subset of chemical reactivity groups [24]. Secondly, we observed that more potent sensitizers were generally assigned higher GARD SVM decision values compared to weaker sensitizers, indicating that there were genes within signature contributing with potency information (unpublished observations). However, the information in the GARD prediction signature were not sufficient to completely stratify chemicals into the well-defined potency groups as described by the Classification, Labelling and Packaging (CLP) Regulation [25].

(20) The CLP regulation is based on the Globally Harmonised System [25], and uses three categories for chemical classification; no category (no cat) for non-sensitizers, category 1B for weak and 1A for strong sensitizers. In the light of the above described observations, it was hypothesized that GARD can be developed further into a tool for the prediction of chemical skin sensitizer potency, targeting the CLP categories. As the established GARD Support Vector Machine model cannot be applied to multiclass problems, we used another approach based on random forest modelling [26]. Random forest is a decision-tree based method and well-suited for microarray data [27]. It divides the dataset internally and repeatedly into a training and test set through random sub-sampling (bootstrapping). Samples in the test set, referred to as out-of-bag samples, comprise approximately one third of the entire dataset, and are used in order to estimate the out-of-bag (OOP) error, i.e. the classification error.

(21) Here, we present a new approach to predict skin sensitizer potency according to CLP categories based on supervised machine learning using a random forest model. Firstly, the global gene expression data from a training set comprising 68 unique chemicals and 2 vehicle control samples were used as input into a random forest model. The random forest model was subsequently combined with an algorithm for backward variable elimination. The algorithm initially ranked the variable importance of each gene from the microarrays, and then iteratively fitted new random forests, while removing the least important variables from the previous iteration. Using this strategy, we were able to identify a set of 52 genes with the smallest OOB error rate when predicting the out-of-bag samples from the training set. The predictive performance of the 52 genes were challenged with an independent test set containing 18 chemicals previously unseen to the model. The chemicals in this test set could be predicted with an overall accuracy of 78%. In addition to the predictive model, we also demonstrated the versatility of analyzing whole transcriptomes of cells by performing pathway analysis to further improve the mechanistic understanding of skin sensitizing potency on a cellular level, confirming the hypothesis that different chemical reactivity classes induce distinct signalling pathways.

(22) Materials and Methods

(23) Cells and Flow Cytometry

(24) The myeloid cell line used in this study was derived from MUTZ-3 (DSMZ, Braunschweig, Germany) and maintained as described in [22, 28]. A phenotypic control analysis of the cells prior to each experiment was carried out by flow cytometry in order to confirm the cells' immature state. The following monoclonal antibodies were used: CD1a (DakoCytomation, Glostrup, Denmark), CD34, CD86, HLA-DR (BD Biosciences, San Jose, USA), all FITC-conjugated; CD14 (DakoCytomation), CD54, CD80 (BD Biosciences), all PE-conjugated. FITC- and PE-conjugated mouse IgG1 (BD Biosciences) served as isotype controls and propidium iodide as a marker for non-viable cells (BD Biosciences). Three batches of cells were exposed for 24 hours in independent experiments and viability and CD86 expression were assessed by flow cytometry. All FACS samples were analyzed on a FACSCanto II instrument with FACS Diva software for data acquisition. 10 000 events were acquired and further analysis was performed in FCS Express V4 (De Novo Software, Los Angeles, Calif.). Cells for RNA extraction were lysed in TRIzol® (Life Technologies/Thermo Fisher Scientific, Waltham, USA) and stored until further use in −20° C.

(25) Chemicals and Stimulations

(26) All chemicals were purchased from Sigma Aldrich (Saint Louis, USA) in high purity quality or they were provided by Cosmetics Europe. All chemicals were stored according to the recommendations of the supplier. The chemical stimulations of cells were performed as described earlier [28]. In short, GARD input concentrations were defined by solubility and cytotoxicity characteristics of the chemicals. An end concentration of 500 μM was targeted for non-cytotoxic and soluble chemicals and the highest possible concentration for chemicals with limited solubility (lower than 500 μM in medium). Cytotoxic chemicals were used in a concentration targeting a relative viability of cells of 90%. Most chemicals were used from a 1000× pre-dilution in dimethyl sulfoxide (DMSO) or autoclaved MilliQ water. DMSO concentration as vehicle never exceeded 0.1%. DMSO and MilliQ samples were included as vehicle controls in this study and thus belong to the group of non-sensitizer samples.

(27) RNA Extraction, cDNA and Array Hybridization

(28) RNA isolation from cells lysed in TRizol® was performed according to the manufacturer's instructions. Labeled sense DNA was synthesized according to Affymetrix (Affymetrix, Cleveland, USA) protocols using the recommended kits and controls. The cDNA was hybridized to Human Gene 1.0 ST arrays (Affymetrix) and further processed and scanned as recommended by the supplier.

(29) Binary Classifications

(30) Binary classifications of the 37 chemicals summarized in Table 1 into sensitizers or non-sensitizers were performed with the previously established model based on SVM, using SCAN-normalized [29, 30] expression data from the GPS as variable input into the learning algorithm [22]. Prior to model construction, potential batch effects between training set and test chemicals were eliminated by scaling array expression values for test chemicals against the training set. A scaling factor was generated by calculating the ratio of the average expression value for each gene in DMSO vehicle control samples of the training set and the average expression value for same gene in DMSO samples in the batch where the test chemical originated. The scaling factor for each gene was then multiplied with the gene expression values for the corresponding gene in the test chemical. SVM predictions were performed as described previously [22, 31]. In short, an SVM model based on a linear kernel was trained on reference chemical stimulations from the original training set used during identification of the GPS [22]. The trained model was subsequently applied to assign each test chemical with an SVM Decision Value (SVM DV). Resulting SVM DVs for all test chemicals were used to construct a receiver operating characteristics (ROC) curve, and the resulting area under the curve (AUC) was used as a classification measure [32]. SVM modeling and ROC curve visualizations were performed in R statistical environment, using the additional packages e1071 [33] and ROCR [34]. Prior to evaluating final predictive performance of the model, SVM DVs for each individual replicate of the test chemicals were calibrated against the cut-off for maximal predictive performance obtained during classification of benchmark samples in Table 2, as described in [31]. The calibrated SVM DVs were subsequently used for final classifications, and test chemicals were classified as sensitizers if the median output value of replicates >0. Accuracy, sensitivity and specificity was estimated using cooper statistics [35]. The non-parametric Two Sample Wilcoxon test was performed in order to determine if the SVM DV distributions between CLP categories 1A, 1B and no cat differed significantly.

(31) Data Handling and Statistical Analysis

(32) In order to build a random forest model, 68 chemicals and two vehicle control samples were defined as a training set, and 18 chemicals, six from each CLP category, were included in the independent test set. Chemicals in the test set were not included in the construction of the model. The aim was to obtain a balanced training set representing all three CLP categories (Table 3) and different chemical reactivity groups (Table 4). Most of the chemicals in the test set (14 out of 18) originated from the latest experimental campaign (Table 1), comprising 37 chemicals previously not investigated using the GARD assay. In the training set, roughly one third of samples (23 out of 68) were from this latest dataset. The vehicle samples were part of all projects and are thus present with higher replicate numbers.

(33) The new microarray data were merged with historical data [22, 23] and subjected to quality control. Four arrays were removed due to bad quality; however, no chemical was present in less than biological duplicates. Array data was imported into the R statistical environment and normalized using the SCANfast algorithm [29, 30]. As several experimental campaigns needed to be combined, this dataset was normalized using the ComBat method [36, 37] in order to remove batch effects between samples. At this time, the samples in the training set were separated from the samples in the test set. To avoid overfitting, only samples in the designated training set were used during identification of the predictive biomarker signature and for fitting of parameters to the classifier, and samples in the test set were set aside to validate the performance of the identified signature and the specified classifier. The predictive biomarker signature was identified by feeding normalized and batch corrected transcript intensities from individual samples in the training set into a random forest model [26] combined with a backward elimination procedure in the varSelRF package [38] in R/Bioconductor version 3.1.2. The initial forest used for ranking of variable importance was grown to 2000 trees and all other parameters were kept at the default settings. The package iteratively fits and evaluates random forest models, at each iteration dropping 20% of the least important variables. The best performing set of variables was selected based on OOB error rates from all fitted Random Forests as the smallest number of genes within one standard error from the minimal error solution (i.e 1.s.e rule). The variable selection procedure was validated by estimating the prediction error rate by the 0.632+ bootstrap method of the varSelRF package using 100 bootstrap samples, and the importance of individual transcripts in the biomarker signature was validated by the frequency of appearance in bootstrap samples (referred to as Validation Call Frequencies (VCF)). The predictive performance of the identified biomarker signature was validated by building a new forest in the Random Forest package [39], based on previous parameters, using only the samples in the training set and the selected transcripts in the biomarker signature as variable input. The model was applied to assign each individual replicate sample in the test set to a CLP category, and the majority vote across the biological replicate stimulations for each chemical was accepted as the predicted category. Heatmaps and Principal Component Analysis (PCA) plots were constructed in Qlucore Omics Explorer (Qlucore AB, Lund, Sweden).

(34) Pathway Analysis

(35) Pathway analysis was performed with the Key Pathway Advisor (KPA) tool [40] version 16.6, which provides a pathway analysis workflow to investigate e.g. gene expression data. It associates differentially expressed genes with both upstream and downstream processes in order to allow biological interpretation. The investigated dataset, consisting of the SCANfast- and ComBat-normalized expression values of the test set and the training set, in total 308 samples, was first variance-filtered in order to remove variables with consistently low variance until approximately a third was left (10009 variables, σ/σmax=0.1478). A multigroup comparison (ANOVA) comparing samples belonging to CLP no cat, 1B, and 1A was then applied in order to identify transcripts that were differentially regulated. The most significant 883 genes (false discovery rate FDR=10.sup.−9; p=8.53×10.sup.−11) were used as input into KPA (Affymetrix Exon IDs and respective p-value, overconnectivity analysis). In order to identify pathways associated with protein reactivity, the same variance-filtered dataset was filtered based on two-group comparisons (t-tests, Toxtree binding class “no binding” non-sensitizers (81 samples) versus “Michael acceptor” sensitizers (MA, 63 samples), “no binding” versus “Schiff base formation” (SB, 29 samples), “no binding” versus combined “bi-molecular nucleophilic substitution/nucleophilic aromatic substitution” (SN, 25 samples). Lists with the 500 most significantly regulated variables from each comparison, together with p-value and fold change (causal reasoning analysis), were then entered into the KPA tool for each protein reactivity group. The lowest p-value was reached when comparing MA samples to “no binding” (FDR=3.65×10.sup.−7), followed by SB (FDR=2.3×10.sup.−5) and (FDR=2.44×10.sup.−5).

(36) Scripts

(37) Listed below are the scripts used for normalization and potency predictions:

(38) Script 1: Used for combating batch effects between different projects prior to model construction:

(39) #This script is used for combating batch effects between different projects prior to model construction!

(40) library (sva)

(41) library(caret)

(42) data_set←read.delim(“/home/andy/alifiles-scannorm.txt”, stringsAsFactors=F, header=F)

(43) #Create numerical data set with colnames/rownames and appropriate design matrices and factors

(44) annotation_headers←data_set[1:17,1] #Header comprises row 1-17

(45) annotations←data_set[1:17,2:ncol(data_set)]

(46) colnames(annotations)←annotations[1,]

(47) rownames(annotations)←annotation_headers

(48) rownames_data_set←data_set[,1]

(49) data_set←data_set[,−1]

(50) #Convert the data to numerical values

(51) data_set←data.frame(sapply(data_set[18:nrow(data_set),],as.numeric))

(52) colnames(data_set)←annotations[1,] #The colnames corresponds to the IDs

(53) rownames(data_set)←rownames_data_set[18:length(rownames_data_set)]

(54) projects←factor(as.character(annotations[3,]))

(55) #CLP USENS is used in the combat model?

(56) groups←as.character(annotations[7,])

(57) #Make the NA unknown

(58) groups[is.na(groups)]←“unknown”

(59) groups←factor(groups)

(60) group_model←model.matrix(˜groups)

(61) #Remove near-zero-variance probes (could potentially interfere with ComBat)

(62) nzv←nearZeroVar(t(data_set))

(63) data_set←data_set[−nzv,]

(64) #The data consists of 5 different projects with pronounced batch effects

(65) #ComBat is used to remove the batch effects

(66) normalized_data←ComBat(data_set,projects,mod=group_model)

(67) normalized_data←rbind(annotations,normalized_data) write.table(normalized_data,”/home/andy/processed_data_sets/ComBat/allfiles SCAN_ComBat.txt”,col.names=F,sep=“\t”,quote=F)

(68) Script 2: Used for construction of the initial potency model:

(69) #Construction of the initial model potencymodel

(70) require(pROC)

(71) require(ROCR)

(72) require(varSelRF)

(73) require(ggplot2)

(74) require(reshape)

(75) require(randomForest)

(76) require(caret)

(77) library(sva)

(78) library(parallel)

(79) library(doMC)

(80) library(foreach)

(81) ##Load dataset

(82) data1←read.table(file=“allfiles-scannorm-cleaned_ComBat.txt”,stringsAsFactors=FALSE,row.names=1,header=F,sep=“\t”)

(83) ##Define trainingset. Row 10 correspond to annotations with info train/test

(84) trainingset←as.matrix(data1[,which(data1[10,]==“training set”)])

(85) trainingsetannot←as.vector(trainingset[7,])

(86) trainingsetannot[which(trainingsetannot==“no cat”)]←0

(87) trainingsetclass←as.factor(trainingsetannot)

(88) trainingset←as.matrix(trainingset[−(1:17),])

(89) class(trainingset)←“numeric”

(90) training←as.data.frame(t(trainingset))

(91) ##Load testdataset 1

(92) testset1←as.matrix(data1[,which(data1[10,]==“test set 1”)])

(93) testset1annot←as.vector(testset1[7,])

(94) testset1annot[which(testset1annot==“no cat”)]←0

(95) sampleannot←as.vector(testset1[2,])

(96) testset1←as.matrix(testset1[−(1:17),])

(97) class(testset1)←“numeric”

(98) testset1←as.data.frame(t(testset1))

(99) testsetclass←as.factor(testset1annot)

(100) ##load testdataset 2

(101) testset2←as.matrix(data1[,which(data1[10,]==“test set 2”)])

(102) sampleannottest2←as.vector(testset2[2,])

(103) testset2annot←as.vector(testset2[7,])

(104) testset2annot[which(testset2annot==“no cat”)]←0

(105) testset2annot←as.factor(testset2annot)

(106) testset2←as.matrix(testset2[−(1:17),])

(107) class (testset2)←“numeric”

(108) testset2←as.data.frame (t(testset2))

(109) ##Construct the model

(110) set.seed(2)

(111) vsrf←varSelRF(training,Class=trainingsetclass,ntree=2000,ntreeIterat=2000,vars.d rop.frac=0.2,keep.forest=T,whole.range=TRUE,verbose=TRUE) RFmodel←randomForest(y=trainingsetclass, x=subset(training, select=vsrf$selected.vars),ntree=vsrf$ntreeIterat,importance=TRUE)

(112) rfpr.pred←predict(RFmodel,newdata=subset(testset1, select=vsrf$selected.vars))

(113) confusionMatrix(rfpr.pred,testsetclass) a←data.frame(stimulant=sampleannot,group=testsetclass,predicted=rfpr.pred) write.table(a, file=“VarSelRF_combat_Test1.txt”, sep=“,”, row.names=TRUE)

(114) rfpr.pred2←predict(RFmodel,newdata=subset(testset2, select=vsrf$selected.vars))

(115) b←data.frame(stimulant=sampleannottest2,group=testset2annot,predicted=rfpr.pr ed2) write.table(b, file=“VarSelRF_combat_Test2.txt”, sep=“,” row.names=TRUE)

(116) #Extract variable importance

(117) write.table(vsrf$selected.vars, file=“VarSelRF_ComBat.txt”, sep=“,” row.names=TRUE)

(118) ##Bootstrapping of variable selection procedure

(119) forkCL←makeForkCluster(12)

(120) clusterSetRNGStream(forkCL, iseed=(100))

(121) clusterEvalQ(forkCL, library(varSelRF))

(122) vsrf.vsb

(123) varSelRFBoot(training,trainingsetclass,usingCluster=T,srf=vsrf,TheCluster=f orkCL,ntree=2000,ntreeIterat=2000,vars.drop.frac=0.2,bootnumber=100) stopCluster(forkCL)

(124) #######################Shuffle train and test#######################

(125) #Shuffle train and testset 18 times to make sure primary selection is not biased.#Combine train and test dataset

(126) notest2←as.matrix(data1[,−which(data1[10,]==“test set 2”)])

(127) notest2annot←as.vector(notest2[7,])

(128) notest2annot[which(notest2annot==“no cat”)]←0

(129) notest2classes←as.factor(notest2annot)

(130) ##notest2compounds←levels(factor(unlist(as.vector(notest2[2,]))))

(131) ##notest2←as.matrix(notest2[−(1:17),])

(132) ##class(notest2)←“numeric”

(133) #Generate a new sampling matrix without DMSO and Water! These are controls and should not be used for sampling to test set.

(134) notestDMSO←as.matrix(notest2[,−which(notest2[2,]==“dimethyl sulfoxide”)])

(135) notestDMSO_Unstim←as.matrix(notestDMSO[,−which(notestDMSO[2,]==“unstimulated”)])

(136) notest2DU←as.vector(notestDMSO_Unstim[7,])

(137) notest2DU[which(notest2DU==“no cat”)]←0

(138) notest2DUclasses←as.factor(notest2DU)

(139) #################

(140) ##construct the shuffling loops

(141) ##Shuffle train and test 20 times! The results are stored in testcomp, which are accessed through typing testcomp[[i]]

(142) testcomp←c( )

(143) looptest←NULL

(144) looptrain←NULL

(145) looptests←list( )

(146) trainingsetannot←list( )

(147) trainingsetclasses←list( )

(148) testsetannot←list( )

(149) testsetclasses←list( )

(150) testsetnames←list( )

(151) vsrf←list( )

(152) rfvsrf←list( )

(153) rfpr.pred←list( )

(154) cf←list( )

(155) results←list( )

(156) testcomploc←list( )

(157) rfpr.predtest2←list( )

(158) resultstest2←list( )

(159) vsrf.vsb←list( )

(160) #cl=makeCluster(4)

(161) #registerDoSNOW(cl)

(162) registerDoMC(18)

(163) startTime=Sys.time( )

(164) writeLines(paste(‘Starting shuffle: ‘,startTime, sep=’’))

(165) loop_results=foreach(i=1:18, .packages=c(‘varSelRF’,‘caret’,‘randomForest’)) % dopar % { set.seed((10+i)) testcomp←(sample(sort(unique(levels(factor(notestDMSO_Unstim[2,which(notest2DUclasse s==“1A”)]))),method=“radix”),6)) testcomp←c(testcomp,sample(sort(unique(levels(factor(notestDMSO_Unstim[2,which(notes t2DUclasses==“1B”)]))),method=“radix”),6)) testcomp←c(testcomp,sample(sort(unique(levels(factor(notestDMSO_Unstim[2,which(notes t2DUclasses==“0”)]))),method=“radix”),6)) testcomploc←c( ) for(n in 1:length(testcomp)) { testcomploc←c(testcomploc,which(notest2[2,]==testcomp[n])) } ##Get Ids from original matrix containing water and DMSO samples. #testcomploc←unlist(testcomploc) looptest←notest2[,testcomploc] looptrain←notest2[,−testcomploc] trainingsetannot←as.vector(looptrain[7,]) trainingsetannot[which(trainingsetannot==“no cat”)]←0 trainingsetclasses←as.factor(trainingsetannot) looptrain←as.matrix(looptrain[−(1:17),]) class(looptrain)←“numeric” looptrain←as.data.frame(t(looptrain)) testsetannot←as.vector(looptest[7,]) testsetannot[which(testsetannot==“no cat”)]←0 testsetclasses←as.factor(testsetannot) testsetnames←as.vector(looptest[2,]) looptest←as.matrix(looptest[−(1:17),]) class(looptest)←“numeric” looptest←as.data.frame(t(looptest)) vsrf←varSelRF(looptrain,Class=trainingsetclasses,ntree=2000, ntreelterat=2000,vars.drop.frac=0.2,keep.forest=T,whole.range=TRUE,verbose=FALSE) rfvsrf←randomForest(y=trainingsetclasses, x=subset(looptrain, select=vsrf$selected.vars),ntree=vsrf$ntreeIterat,importance=TRUE) rfpr.pred←predict(rfvsrf,newdata=subset(looptest, select=vsrf$selected.vars)) cf←confusionMatrix(rfpr.pred,testsetclasses) results←t(rbind(stimulant=testsetnames,group=testsetclasses,predicted=rfpr.pred)) #List contains all results #Results can be found through loop_results[[i]]$ list(Results=results,VarSelRF=vsrf, RFValSelRF=rfvsrf,Predictions=rfpr.pred,Confusion.Matrix=cf, TestSet=testsetnames,TestClasses=testsetclasses, Train=looptrain,Trainclass=trainingsetclasses)

(166) }

(167) print(loop_results[[2]]$TestSet)

(168) print(loop_results[[2]]$TestSet)

(169) #stopCluster(cl)

(170) ##Run Bootstrapping

(171) vsrf.vsb←list( )

(172) for(i in 1:18){ forkCL←makeForkCluster(12) clusterSetRNGStream(forkCL, iseed=(100+i)) clusterEvalQ(forkCL, library(varSelRF)) vsrf.vsb[[i]]←varSelRFBoot(loop_results[[i]]$Train,Class=loop_results[[i]]$Trainclass,usi ngCluster=T,srf=loop_results[[i]]$VarSelRF,TheCluster=forkCL,ntree=2000,ntr eeIterat=2000,vars.drop.frac=0.2,bootnumber=100) stopCluster (forkCL) print (Sys.time ( ))

(173) }

(174) Results

(175) Binary Classifications of 37 Chemicals

(176) A novel dataset comprising 37 well-characterized chemicals (Table 1) was selected in order to complement historical GARD data for 51 chemicals and to represent a relevant choice of chemicals, balanced in terms of chemical reactivity class, and use in consumer products. The novel chemicals were selected based on European Chemical Agency CLP databases and literature [15, 41] and in cooperation with the Skin Tolerance Task Force of Cosmetics Europe (CE), who kindly provided 27 chemicals. The 37 novel chemicals were predicted as sensitizers or non-sensitizers using the GPS and previously established protocols based on SVM classifications. The SVM model was applied to assign each individual replicate sample with a SVM DV. Prior to final classifications, SVM DVs from the 37 samples were first calibrated against 11 benchmark samples (Table 2) included in the same sample batch as the test chemicals. For the purpose of evaluating binary predictions, we here decided to prioritize human data [41] when available, where classes one to four correspond to sensitizers and five and six to non-sensitizer, instead of CLP classifications. Model performance predicting the 37 chemicals is summarized by an AUC ROC of 0.88, indicating a good discriminatory ability, and as illustrated in FIG. 1A (benchmark chemicals-filled line; 37 chemicals-dotted line). The sensitivity, specificity and accuracy based on Cooper statistics were estimated to 73%, 80% and 76%, respectively. In combination with previously published data [23] the updated predictive accuracy of GARD for binary classification of skin sensitizers is estimated to 84% based on a dataset comprising a total of 74 chemicals. When the chemicals in Table 1 were grouped according to CLP classification, and the respective SVM DV values obtained during classification of the 37 chemicals were summarized in a boxplot as presented in FIG. 1B, a potency gradient emerged, as the stronger sensitizers were assigned higher SVM DVs in comparison to the weaker sensitizers in category 1B and the non-sensitizers in no cat. According to non-parametric Two Sample Wilcoxon tests comparing SVM DV sample distributions, these groups differed significantly (no cat vs 1B: p=2.8.sup.−5; 1B vs 1A p=2.8.sup.−6, no cat vs 1A p=3.5.sup.−12). Although the differences between groups were significant, some overlap existed between individual chemicals, indicating that the information was not sufficient to completely stratify samples into well-defined potency groups.

(177) A Random Forest Model for the Prediction of CLP Categories

(178) In order to establish a biomarker signature for prediction of CLP categories, the dataset comprising the 37 novel chemicals were merged with the historical dataset comprising 51 chemicals. In total, the dataset consisted of 86 unique chemicals (Table 4) and two vehicle controls, balanced with regards to categories 1A and 1B and non-sensitizers (no cat) as described by CLP (Table 3). For four chemicals CLP classification 1B was changed to no category/non-sensitizer according to the sources indicated in Table 4 for one of three reasons: i) for retaining consistency with previous GARD projects (benzaldehyde, xylene), ii) for being used as vehicle in non-sensitizing concentration (DMSO), and for being a well-described false-positive in the LLNA (sodium dodecyl sulfate).

(179) A random forest model for the prediction of three CLP categories was developed based on a training set consisting of 70 unique samples, including two vehicle controls. 52 predictive variables (Table 5) were identified as optimal for CLP classification. The model's prediction error rate, derived from bootstrapping, was estimated to 0.225, which provides an indication of model performance. In order to visualize the dataset used to develop the model, principal component analysis (PCA) was performed. The 52 variables identified by random forest based on a whole-genome array analysis were used as input, and the PCA was built on the training set (FIG. 2A). FIG. 2A is based on chemicals with biological replicates colored according to CLP categories and a clear gradient from no cat to strong sensitizers (1A) can be observed along the first principal component. The heatmap of the training set with hierarchical clustering of the variables in FIG. 2B illustrates the regulation of genes in relation to the respective chemical and CLP category.

(180) Prediction of an Independent Test Set

(181) The model performance was further evaluated by predicting the CLP categories of an independent test set, which comprised 18 chemicals previously unseen to the model. The test set colored according to CLP categories was visualized in the PCA plot (FIG. 3A), without influencing the PCA components based on the 52 identified variables. FIG. 3B visualizes the regulation of the 52 genes in the normalized test set in form of a heatmap. When replicates were predicted separately and majority voting was used to classify the respective chemicals, 14 out of 18 chemicals were assigned into the right CLP category (Table 6, Suppl. Table 1), resulting in an overall accuracy of 78% (Table 7). The four misclassified chemicals were diethyl maleate, butyl glycidyl ether, lyral and cyanuric chloride. The only false-negative prediction was cyanuric chloride, which is classified as 1A in CLP and as no cat in our model, whereas the remaining three chemicals are classified as CLP 1B but were predicted as 1A. In a subsequent step to confirm that our selections of training and test set were unbiased and that the predictive model was not entirely dependent on the composition of the training set, we constructed 18 alternative random forest models, where the composition of chemicals in the training and test set were randomly shuffled. For each new model, we repeated the complete process of variable selection as described above. Dependent on the number of replicate samples available for each chemical stimulation, the total number of samples in each training and test set varied, but the number of chemicals in each set and their CLP distribution were kept constant. The alternative models were all significant and the average prediction error rate obtained from the bootstrapping procedure was identical to the initial model at 0.22, which supports that the presented model was not obtained due to a biased choice of training and test sets.

(182) The CLP Potency Model Contains Information Relevant for Human Potency Prediction

(183) Next, PCA was utilized in order to investigate how the 52 variables (Table 5) perform using information related to human potency categories [41]. The 70 training and 18 test set samples were colored according to human potency (class 1-6, 6=true non-sensitizer, 1=strongest sensitizer), and samples for which no human potency category was available were removed (see Table 4). Although the model has been developed to predict CLP potency categories, it also contains information related to human potency as illustrated by FIG. 4A, B.

(184) Identity of the Random Forest Model Variables

(185) The 52 variables identified by random forest (Table 5) represent transcripts belonging to different cell compartments and different functional roles. Five of them overlap with the GARD prediction signature. Five of the top 10 markers, which were most frequently chosen in the bootstrapping process and have the highest validation call frequencies (Table 5), are histone cluster 1 members, such as HIST1H2AB [42]. Histones are highly conserved and play an important role not only for maintaining chromatin structure but also in gene regulation [43]. PFAS and PAICS are involved in purine biosynthesis [44, 45], and TMEM97 is a regulator of cholesterol levels [46], which is further described to be involved in cell cycle regulation, cell migration and invasion in a glioma cell model according to RNA interference experiments [47]. DHCR24 is a multifunctional enzyme localized to the endoplasmic reticulum (ER) and catalyzes the final step in cholesterol-synthesis [48] but possesses also anti-apoptotic activity as for example shown for neuronal cells under ER stress [49]. PLK1, a kinase, has been shown to be phosphorylated in response to TLR activation and results from RNA interference suggested that PLK1 signaling was involved in the TLR-induced inflammatory response [50]. PLK1 was further reported to be involved in cell cycle regulation by inhibiting TNF-induced cyclin D1 expression and it could reduce TNF-induced NF-κB activation [51]. Many of the remaining transcripts are nuclear proteins and thus likely involved in DNA-dependent processes such as replication, transcription, splicing and cell cycle regulation. There are further several transcripts in the signature that code for proteins known for their involvement in immune responses and/or sensitization, such as NQO1, which is well-described for its role in the cellular response to skin sensitizers [52]. CD53 belongs to the tetraspanin family, transmembrane proteins which have multiple functions in e.g. cell adhesion, migration and signaling and which has been shown to be elevated on FcεRI-positive skin DCs from atopic dermatitis patients individuals [53] similarly as on peripheral blood-derived monocytes from patients with atopic eczema [54] in comparison to the respective healthy controls. CD44 is a cell surface glycoprotein, adhesion and hyaluronan receptor [55] being expressed by numerous cell types and for example involved in inflammatory responses [56], e.g. by mediating leukocyte migration into inflamed tissues, which has been shown in a mouse model of allergic dermatitis [57].

(186) Common and Unique Regulated Pathways are Induced by Sensitizers Differing in their Protein Reactivity

(187) The 33 key pathways identified with an input of the 883 most significantly regulated genes after a multigroup comparison (FDR=10.sup.−9) between CLP categories in KPA analysis (FIG. 5), mirror several of the functional groups of the 52 variables defined by Random Forest, such as gene regulation, cell cycle control and metabolism. Immune response-associated pathways such as “IL-4-induced regulators of cell growth, survival, differentiation and metabolism” and “Immune response_IL-3 signaling via JAK/STAT, p38, JNK and NF-κB” were among the 50% most significantly regulated ones.

(188) The analyses described subsequently focused on the three largest chemical reactivity groups in the present dataset; nucleophilic substitution (SN), Michael addition (MA) and Schiff base (SB) formation. Among the included chemicals, the majority of chemicals labeled as “no cat” possessed no protein binding properties; however, a few SB formation and SN chemicals were present. In category 1B, almost all protein reactivity types were represented, whereas there was a clear dominance of MA chemicals in category1A.

(189) For each associated protein reactivity, unique pathways could be identified for sensitizing chemicals belonging to the respective protein reactivity group as presented in FIG. 6. These results combine differentially regulated genes from the input data with so-called key hubs, molecules that are able to regulate the expression level of the input genes. They cannot necessarily be identified themselves by gene expression experiments as their regulation may either be visible on other biological levels, such as activity changes (e.g. for kinases) or the changes may be very short-termed or of low magnitude. In total, 173 genes were common for all three reactivity groups (FIG. 7A) and six pathways were present in all three reactivity groups (FIG. 7B); “Cell cycle: Role of APC in cell cycle regulation”, “Cell cycle: Role of SCF complex in cell cycle regulation”, “Development: Transcription regulation of granulocyte development”, “Cell cycle: Cell cycle (generic schema)”, “DNA damage: ATM/ATR regulation of G1/S checkpoint”, and “Mitogenic action of Estradiol/ESR1 (nuclear) in breast cancer”. Again, cell cycle pathways were highly represented. Oxidative stress responses were identified as part of the key pathway results only for MA and SB chemicals (FIG. 6). In the MA sensitizer chemical group, KEAP1 and NRF2 were found as a key hubs as well as their target genes NQO1 and HMOX1 [52, 58] and AHR [59, 60]. For MA chemicals, the target genes NQO1, HMOX1 and CES1 [61] were even present on the input gene level. On the input level, CES1 was present for SN chemicals as well, but only NQO1, NRF2 and AHR were identified as key hubs. KEAP1 was not found as key hub in SB and SN KPA analysis and AHR was the only key hub identified for all three protein reactivity groups. NF-κB subunits (ReIB or p52) were predicted key hubs in all reactivity groups except in MA chemicals.

(190) In summary, there seem to be common mechanistic responses to chemical exposures per se such as cell cycle and DNA damage-related, but the pathway analysis results also support the hypothesis that different chemical reactivity classes induce distinct signalling pathways as observed earlier by us [24] and in other experimental systems [14, 21, 62, 63]. Several pathways are linked to processes known to be relevant in skin sensitization.

(191) Discussion

(192) The amount of chemical per exposed skin area that induces sensitization varies significantly [41]; thus, skin sensitizer potency information is imperative for accurate risk assessment. Developers of alternative test method rely on human clinical data in order to achieve high predictivity of human sensitization, however, this type of data is rather scarce and most available data is derived from the LLNA [64]. Despite the fact that animal models reflect the complexity of systemic diseases such as skin sensitization, in vitro data has so far shown to correlate well and perform even better than animal models, especially when combined in an ITS [65]. Furthermore, alternative test systems may provide mechanistic insights that tests using whole animals cannot provide [66].

(193) Here, an approach to predict skin sensitizer potency is presented using the CLP system based on a dendritic cell (DC) model and transcriptional profiling. CLP categories are empirically determined and arbitrarily defined categories, which do not represent the diversity of different chemicals, their molecular features and mechanisms responsible for their sensitizing characteristics or the lack thereof. They are, however, what legislation currently requires in order to classify and label chemicals. We therefore investigated 37 additional chemicals previously not tested in the standard GARD assay, in order to combine these new data with historical datasets. In the binary classifications of these new chemicals according to the established GARD model, four misclassified sensitizing chemicals were close to the cut-off as defined by the benchmark samples, namely aniline, benzocaine, limonene and butyl glycidyl ether. Three of these belong to human potency class 4, which shows that the model cut-off is critical in order to translate the SVM values, often correlating well with potency, into accurate classifications. Together with historical predictions, GARD still shows an overall high accuracy of 84% for binary classifications.

(194) We then used both new and historical data in order to develop a random forest model for each CLP category, which displays balanced accuracies [67] of 96% for no category, 79% for category 1A and 75% for category 1B (based on majority votes, for performance on replicate basis see Table 7). Butyl glycidyl ether, diethyl maleate, cyanuric chloride and lyral were misclassified; the only false-negative prediction was no category instead of 1A for cyanuric chloride. However, cyanuric chloride reacts exothermally with water forming hydrogen chloride and possibly other reaction products. Due to this hydrolyzation reaction, probably already occurring in DMSO (containing water), the amount of cyanuric chloride and reaction products present in the assay are unknown. This chemical may fall outside the applicability domain of GARD platform-based assays. However, nothing in the quality control or other pre-modelling analyses motivated a removal of these samples. Diethyl maleate and lyral are classified as 1B by CLP, but as 1A by our model, which again seems to fit more to their human potency category, category 2, as described by Basketter et al. [41]. The forth misclassified chemical butyl glycidyl ether is a human potency category 3. Obviously, predicting 11B, i.e. weak sensitizers seemed the most challenging part. Also in the LLNA, potency predictions of weak sensitizers vary more than those of strong sensitizers [8, 68, 69]. Furthermore, 1B is a very heterogeneous group, both considering the range of LLNA EC3 concentrations and human potency categories associated to chemicals summarized in category 1B.

(195) The U-SENS™ assay, formerly MUSST, uses another myeloid cell line, U937, and CD86 measurements in order to distinguish sensitizers and non-sensitizers. When the authors combined CD86 with cytotoxicity data and certain cut-off levels in order to predict CLP categories, correct predictions of 82% of Cat. 1A ( 41/50) and 73% of Cat. 1B/No Cat ( 85/116) were reported [15]. However, it remains unclear how the more challenging discrimination between no cat and 1B would turn out. Cottrez et al. [70] have recently published a study, where they report that their alternative assay SENS-IS, a 3D reconstituted epidermis based model, performs very well for the prediction of skin sensitizer potency; however, they do not target CLP categories. Judging from FIG. 4, purely based on the 52 variable input, which was defined in order to predict CLP categories, also our model seems to contain information relevant for human potency classification. Once more chemicals receive human potency classifications, it should be possible to smoothly develop a human potency model based on the GARD platform.

(196) KPA pathway analysis identified biologically relevant events in the presented dataset as several pathways regulated have a known role in skin sensitization, e.g. cytokine signalling and oxidative stress responses (FIGS. 5-6). Although DCs are not the primary target for protein modification in vivo, we hypothesized that different protein reactivity classes influence the DC transcriptome differentially. Protein reactivity is one of the most important features of chemicals defining their skin sensitizing capacity and potency with certain limitations [63]. Protein reactivity-specific patterns were detectable as revealed by the comparison of the most significantly regulated genes induced by the reactivity groups MA, SB, and SN. Interestingly, NF-κB subunits were predicted key hubs for all reactivity groups except for MA chemicals, which may reflect the described inhibitory effect on NF-κB signalling of this type of chemicals [71]. Although some pathways do not seem to fit into the context, such as “Mitogenic action of Estradiol/ESR1 (nuclear) in breast cancer”, a closer look at regulated molecules reveals that those are certainly relevant also for other pathways. In this case, for example p21, c-myc, E2F1, SGOL2 (shugoshin 2), and CAD (carbamoyl phosphate synthetase) were involved, whereof the first ones are known cell cycle regulators/transcription factors [72] and play a role in chromosome segregation (SGOL2) [73]. CAD, an enzyme, which is rate-limiting in the biosynthesis of pyrimidine nucleotides, on the other hand, has more recently also been implicated in cooperation with cell signaling pathways [74] and seems to inhibit the bacterial sensor NOD2 (nucleotide-binding oligomerization domain 2) antibacterial function in human intestinal epithelial cells [75]. These examples may illustrate that our transcriptomic data deserves further attention and more detailed analyses and this type of analysis, using different bioinformatics tools and finally, functional analyses, may contribute to elucidate mechanisms underlying biological processes and diseases.

(197) As already discussed above, assigning correct potency classes to chemicals with weak or intermediate potency seems to be a more general problem. Benigni et al. [76] presented data showing that even experimental in vivo systems, though in general correlating well with human data, perform less well for sensitizers of intermediate potency. They further argue that the protein modification step is the rate-limiting step of the whole sensitization process and that in vitro tests targeting other AOP events do not add much information. On the other hand, there is only a weak relationship between the rate constant of MA sensitizers as determined by kinetic profiling with a model peptide and their potency in the LLNA [71]. This was described to be linked to the anti-inflammatory effect of MA chemicals by inhibiting NF-κB signalling, which increases with reactivity. However, considering the key events in the skin sensitization AOP, the sensitization process as such can be understood as a continuum and would thus not just be characterized by isolated events. Of course, in reality the process must start with the chemical penetrating the skin. In this context it should be noted that the concept that a chemical's ability to efficiently penetrate the stratum corneum is crucial for its skin sensitization capacity and potency has recently been proven wrong [77]. Furthermore, access to lower skin layers may in reality be greatly facilitated by impaired skin barrier function, due to e.g. wet work and small wounds. Interestingly, many strong sensitizers possess irritant properties, which correlate with cytotoxicity, and cytotoxicity in turn seems to contribute to sensitizer potency [18], which is also reflected in our dataset (data not shown). Cytotoxicity is also connected to the protein reactivity of the chemical: chemicals, which are strongly cysteine-reactive are in general cytotoxic, and thus may interfere with vital enzyme function [18, 78]. At the same time, irritant effects can generate danger signals such as extracellular ATP or hyaluronic acid degradation products [79, 80], which may serve to activate DCs and consequently, to prime naïve T cells. Additionally, other factors such as pre-existing inflammation and co-exposures to other substances, which certainly play a role in allergic sensitization, may be hard to implement in any test system. As in vitro assays further have to face the demand of being cost-effective and easy to perform, there are obvious limitations what can be achieved in vitro, but the performances of tests so far are very encouraging [76]. Although protein reactivity may be very important, cell-based systems should be capable of recapitulating certain additional events on top of peptide reactivity. Alternative assay performance can most likely be further improved as more mechanistic details of skin sensitization are revealed, which will allow identifying both applicability domains and pitfalls more easily.

(198) In conclusion, we have identified a predictive biomarker signature comprising 52 transcripts for classification of skin sensitizing compounds into CLP groups, as required by current legislation. When challenged with 18 independent test compounds, the assay provided accurate results for 78% of potency predictions. It further identified 11/12 sensitizers correctly, which indicates that it is rather conservative, i.e. avoids false-negative predictions. Since the presented biomarker signature is optimized for potency predictions, and not only for binary hazard classifications, we suggest a possible application for our model within an Integrated Testing Strategy for accurate potency predictions, similar as suggested by [18]. In addition, the results effectively illustrate the flexibility and versatility of the GARD setup. Measuring complete transcriptomes of cells provides the opportunity to perform mode-of-action based studies to identify pathways of sensitization, but also to identify predictive biomarker signatures, which we have previously shown also for binary skin sensitization predictions and respiratory sensitization. Thus, we here present an initial proof of concept of a potency model targeting the CLP groups, which can be modified and improved as more samples are analyzed and more accurate human reference data emerges.

REFERENCES

(199) 1. Peiser, M., et al., Allergic contact dermatitis: epidemiology, molecular mechanisms, in vitro methods and regulatory aspects. Current knowledge assembled at an international workshop at BfR, Germany. Cell Mol Life Sci, 2012. 69(5): p. 763-81. 2. Behroozy, A. and T. G. Keegel, Wet-work Exposure: A Main Risk Factor for Occupational Hand Dermatitis. Saf Health Work, 2014. 5(4): p. 175-80. 3. European Parliament, C.o.t.E.U., REGULATION (EC) No 1223/2009 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. 2009. 4. Hartung, T. and C. Rovida, Chemical regulators have overreached. Nature, 2009. 460(7259): p. 1080-1. 5. OECD, The Adverse Outcome Pathway for Skin Sensitisation Initiated by Covalent Binding to Proteins. Part 1: Scientific Evidence. 2012: p. 1-59. 6. Gerberick, G. F., et al., Local lymph node assay (LLNA) for detection of sensitization capacity of chemicals. Methods, 2007. 41(1): p. 54-60. 7. Anderson, S. E., P. D. Siegel, and B. J. Meade, The LLNA: A Brief Review of Recent Advances and Limitations. J Allergy (Cairo), 2011. 2011: p. 424203. 8. Ezendam, J., H. M. Braakhuis, and R. J. Vandebriel, State of the art in non-animal approaches for skin sensitization testing: from individual test methods towards testing strategies. Arch Toxicol, 2016. 9. Andreas, N., et al., The intra- and inter-laboratory reproducibility and predictivity of the KeratinoSens assay to predict skin sensitizers in vitro: Results of a ring-study in five laboratories. Toxicology in Vitro, 2011. 25(3): p. 733-744. 10. Natsch, A. and R. Emter, Skin sensitizers induce antioxidant response element dependent genes: application to the in vitro testing of the sensitization potential of chemicals.

(200) Toxicol Sci, 2008. 102(1): p. 110-9. 11. Gerberick, G. F., et al., Development of a peptide reactivity assay for screening contact allergens. Toxicol Sci, 2004. 81(2): p. 332-43. 12. Ashikaga, T., et al., Development of an in vitro skin sensitization test using human cell lines: the human Cell Line Activation Test (h-CLAT). I. Optimization of the h-CLAT protocol.

(201) Toxicol In Vitro, 2006. 20(5): p. 767-73. 13. Teunis, M. A., et al., International ring trial of the epidermal equivalent sensitizer potency assay: reproducibility and predictive-capacity. ALTEX, 2014. 31(3): p. 251-68. 14. Cottrez, F., et al., Genes specifically modulated in sensitized skins allow the detection of sensitizers in a reconstructed human skin model. Development of the SENS-IS assay. Toxicol In Vitro, 2015. 29(4): p. 787-802. 15. Piroird, C., et al., The Myeloid U937 Skin Sensitization Test (U-SENS) addresses the activation of dendritic cell event in the adverse outcome pathway for skin sensitization. Toxicol In Vitro, 2015. 29(5): p. 901-16. 16. Dearden, J. C., et al., Mechanism-Based QSAR Modeling of Skin Sensitization. Chem Res Toxicol, 2015. 28(10): p. 1975-86. 17. Tsujita-Inoue, K., et al., Skin sensitization risk assessment model using artificial neural network analysis of data from multiple in vitro assays. Toxicol In Vitro, 2014. 28(4): p. 626-39. 18. Jaworska, J. S., et al., Bayesian integrated testing strategy (ITS) for skin sensitization potency assessment: a decision support system for quantitative weight of evidence and adaptive testing strategy. Arch Toxicol, 2015. 89(12): p. 2355-83. 19. Jaworska, J., et al., Bayesian integrated testing strategy to assess skin sensitization potency: from theory to practice. J Appl Toxicol, 2013. 33(11): p. 1353-64. 20. Luechtefeld, T., et al., Probabilistic hazard assessment for skin sensitization potency by dose-response modeling using feature elimination instead of quantitative structure-activity relationships. J Appi Toxicol, 2015. 35(11): p. 1361-71. 21. Natsch, A., et al., Predicting skin sensitizer potency based on in vitro data from KeratinoSens and kinetic peptide binding: global versus domain-based assessment. Toxicol Sci, 2015. 143(2): p. 319-32. 22. Johansson, H., et al., A genomic biomarker signature can predict skin sensitizers using a cell-based in vitro alternative to animal tests. BMC Genomics, 2011. 12: p. 399. 23. Johansson, H., et al., Genomic allergen rapid detection in-house validation—a proof of concept. Toxicol Sci, 2014. 139(2): p. 362-70. 24. Albrekt, A. S., et al., Skin sensitizers differentially regulate signaling pathways in MUTZ-3 cells in relation to their individual potency. BMC Pharmacol Toxicol, 2014. 15: p. 5. 25. European Parliament, C.o.t.E.U., http://echa.europa.eu/sv/regulations/clp/. accessed Jul. 13, 2016. 26. Breiman, L., Random forests. Machine Learning, 2001. 45. 27. Diaz-Uriarte, R. and S. Alvarez de Andres, Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 2006. 7(1): p. 1-13. 28. Johansson, H., et al., The GARD assay for assessment of chemical skin sensitizers. Toxicol In Vitro, 2013. 27(3): p. 1163-9. 29. Piccolo, S. R., et al., A single-sample microarray normalization method to facilitate personalized-medicine workflows. Genomics, 2012. 100(6): p. 337-44. 30. Stephen R. Piccolo, A. H. B., W. Evan Johnson. https://www.bioconductor.org/packages/release/bioc/htmI/SCAN.UPC.html. [cited 2016 Oct. 14, 2016]; Bioconductor version: Release (3.3):[ 31. Forreryd, A., et al., From genome-wide arrays to tailor-made biomarker readout—Progress towards routine analysis of skin sensitizing chemicals with GARD. Toxicol In Vitro, 2016. 32. Lasko, T. A., et al., The use of receiver operating characteristic curves in biomedical informatics. J Biomed Inform, 2005. 38(5): p. 404-15. 33. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., e1071: Misc functions of the Department of statistics (q1071). TU Wien. R package version 1.6, 2011. http://CRAN.R-project.org/package=e1071. 34. Sing, T., et al., ROCR: visualizing classifier performance in R. Bioinformatics, 2005. 21(20): p. 3940-1. 35. Cooper, J. A., 2nd, R. Saracci, and P. Cole, Describing the validity of carcinogen screening tests. Br J Cancer, 1979. 39(1): p. 87-9. 36. Jeffrey T. Leek, W. E. J., Hilary S. Parker, Andrew E. Jaffe, John D. Storey, sva: Surrogate Variable Analysis. R package version 3.10.0., 2014. 37. Johnson, W. E., C. Li, and A. Rabinovic, Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 2007. 8(1): p. 118-27. 38. Diaz-Uriarte, R., GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics, 2007. 8: p. 328. 39. Wiener, A. L. a. M., Classification and Regression by randomForest. R News, 2002. 2(3), 18-22. 40. Key Pathway Advisor by Clarivate Analytics (Formerly the IP & Science business of Thomson Reuters). http://ipscience.thomsonreuters.com/product/metacore/. 2016. 41. Basketter, D. A., et al., Categorization of chemicals according to their relative human skin sensitizing potency. Dermatitis, 2014. 25(1): p. 11-21. 42. Singh, R., et al., Increasing the complexity of chromatin: functionally distinct roles for replication-dependent histone H2A isoforms in cell proliferation and carcinogenesis. Nucleic Acids Res, 2013. 41(20): p. 9284-95. 43. Harshman, S. W., et al., H1 histones: current perspectives and challenges. Nucleic Acids Res, 2013. 41(21): p. 9593-609. 44. Lane, A. N. and T. W.-M. Fan, Regulation of mammalian nucleotide metabolism and biosynthesis. Nucleic Acids Research, 2015. 45. Li, S. X., et al., Octameric structure of the human bifunctional enzyme PAICS in purine biosynthesis. J Mol Biol, 2007. 366(5): p. 1603-14. 46. Bartz, F., et al., Identification of cholesterol-regulating genes by targeted RNAi screening. Cell Metab, 2009. 10(1): p. 63-75. 47. Qiu, G., et al., RNA interference against TMEM97 inhibits cell proliferation, migration, and invasion in glioma cells. Tumour Biol, 2015. 36(10): p. 8231-8. 48. Waterham, H. R., et al., Mutations in the 3beta-hydroxysterol Delta24-reductase gene cause desmosterolosis, an autosomal recessive disorder of cholesterol biosynthesis. Am J Hum Genet, 2001. 69(4): p. 685-94. 49. Lu, X., et al., 3 beta-hydroxysteroid-Delta 24 reductase (DHCR24) protects neuronal cells from apoptotic cell death induced by endoplasmic reticulum (ER) stress. PLoS One, 2014. 9(1): p. e86753. 50. Hu, J., et al., Polo-like kinase 1 (PLK1) is involved in toll-like receptor (TLR)-mediated TNF-alpha production in monocytic THP-1 cells. PLoS One, 2013. 8(10): p. e78832. 51. Higashimoto, T., et al., Regulation of I(kappa)B kinase complex by phosphorylation of (gamma)-binding domain of I(kappa)B kinase (beta) by Polo-like kinase 1. J Biol Chem, 2008. 283(51): p. 35354-67. 52. Ade, N., et al., HMOX1 and NQO1 genes are upregulated in response to contact sensitizers in dendritic cells and THP-1 cell line: role of the Keap1/Nrf2 pathway. Toxicol Sci, 2009. 107(2): p. 451-60. 53. Peng, W. M., et al., Tetraspanins CD9 and CD81 are molecular partners of trimeric FcvarepsilonRI on human antigen-presenting cells. Allergy, 2011. 66(5): p. 605-11. 54. Jockers, J. J. and N. Novak, Different expression of adhesion molecules and tetraspanins of monocytes of patients with atopic eczema. Allergy, 2006. 61(12): p. 1419-22. 55. Lee-Sayer, S. S., et al., The where, when, how, and why of hyaluronan binding by immune cells. Front Immunol, 2015. 6: p. 150. 56. Johnson, P. and B. Ruffell, CD44 and its role in inflammation and inflammatory diseases. Inflamm Allergy Drug Targets, 2009. 8(3): p. 208-20. 57. Gonda, A., et al., CD44, but not I-selectin, is critically involved in leucocyte migration into the skin in a murine model of allergic dermatitis. Exp Dermatol, 2005. 14(9): p. 700-8. 58. Natsch, A., The Nrf2-Keap1-ARE toxicity pathway as a cellular sensor for skin sensitizers—functional relevance and a hypothesis on innate reactions to skin sensitizers. Toxicol Sci, 2010. 113(2): p. 284-92. 59. Schulz, V. J., et al., Aryl hydrocarbon receptor activation affects the dendritic cell phenotype and function during allergic sensitization. Immunobiology, 2013. 218(8): p. 1055-62. 60. Kohle, C. and K. W. Bock, Coordinate regulation of Phase I and II xenobiotic metabolisms by the Ah receptor and Nrf2. Biochem Pharmacol, 2007. 73(12): p. 1853-62. 61. Roberts, D. W., A. O. Aptula, and G. Patlewicz, Electrophilic chemistry related to skin sensitization. Reaction mechanistic applicability domain classification for a published dataset of 106 chemicals tested in the mouse local lymph node assay. Chem Res Toxicol, 2007. 20(1): p. 44-60. 62. Migdal, C., et al., Reactivity of chemical sensitizers toward amino acids in cellulo plays a role in the activation of the Nrf2-ARE pathway in human monocyte dendritic cells and the THP-1 cell line. Toxicol Sci, 2013. 133(2): p. 259-74. 63. Chipinda, I., J. M. Hettick, and P. D. Siegel, Haptenation: chemical reactivity and protein binding. J Allergy (Cairo), 2011. 2011: p. 839682. 64. Basketter, D. A., et al., Nothing is perfect, not even the local lymph node assay: a commentary and the implications for REACH. Contact Dermatitis, 2009. 60(2): p. 65-9. 65. Urbisch, D., et al., Assessing skin sensitization hazard in mice and men using non-animal test methods. Regul Toxicol Pharmacol, 2015. 71(2): p. 337-51. 66. Natsch, A., et al., Chemical basis for the extreme skin sensitization potency of (E)-4-(ethoxymethylene)-2-phenyloxazol-5(4H)-one. Chem Res Toxicol, 2010. 23(12): p. 1913-20. 67. Brodersen, K. H., et al., The Balanced Accuracy and Its Posterior Distribution, in Proceedings of the 2010 20th International Conference on Pattern Recognition. 2010, IEEE Computer Society. p. 3121-3124. 68. Dumont, C., et al., Analysis of the Local Lymph Node Assay (LLNA) variability for assessing the prediction of skin sensitisation potential and potency of chemicals with non-animal approaches. Toxicol In Vitro, 2016. 34: p. 220-8. 69. Hoffmann, S., LLNA variability: An essential ingredient for a comprehensive assessment of non-animal skin sensitization test methods and strategies. ALTEX, 2015. 32(4): p. 379-83. 70. Cottrez, F., et al., SENS-IS, a 3D reconstituted epidermis based model for quantifying chemical sensitization potency: Reproducibility and predictivity results from an inter-laboratory study. Toxicol In Vitro, 2016. 32: p. 248-60. 71. Natsch, A., T. Haupt, and H. Laue, Relating skin sensitizing potency to chemical reactivity: reactive Michael acceptors inhibit NF-kappaB signaling and are less sensitizing than S(N)Ar- and S(N)2-reactive chemicals. Chem Res Toxicol, 2011. 24(11): p. 2018-27. 72. Buchmann, A. M., S. Swaminathan, and B. Thimmapaya, Regulation of cellular genes in a chromosomal context by the retinoblastoma tumor suppressor protein. Mol Cell Biol, 1998. 18(8): p. 4565-76. 73. Xu, Z., et al., Structure and function of the PP2A-shugoshin interaction. Mol Cell, 2009. 35(4): p. 426-41. 74. Huang, M. and L. M. Graves, De novo synthesis of pyrimidine nucleotides; emerging interfaces with signal transduction pathways. Cell Mol Life Sci, 2003. 60(2): p. 321-36. 75. Richmond, A. L., et al., The nucleotide synthesis enzyme CAD inhibits NOD2 antibacterial function in human intestinal epithelial cells. Gastroenterology, 2012. 142(7): p. 1483-92 e6. 76. Benigni, R., C. Bossa, and O. Tcheremenskaia, A data-based exploration of the adverse outcome pathway for skin sensitization points to the necessary requirements for its prediction with alternative methods. Regul Toxicol Pharmacol, 2016. 78: p. 45-52. 77. Fitzpatrick, J. M., D. W. Roberts, and G. Patlewicz, What determines skin sensitization potency: Myths, maybes and realities. The 500 molecular weight cut-off: An updated analysis. J Appl Toxicol, 2016. 78. Bohme, A., et al., Kinetic glutathione chemoassay to quantify thiol reactivity of organic electrophiles—application to alpha,beta-unsaturated ketones, acrylates, and propiolates. Chem Res Toxicol, 2009. 22(4): p. 742-50. 79. Esser, P. R., et al., Contact sensitizers induce skin inflammation via ROS production and hyaluronic acid degradation. PLoS One, 2012. 7(7): p. e41340. 80. Martin, S. F., et al., Mechanisms of chemical-induced innate immunity in allergic contact dermatitis. Allergy, 2011. 66(9): p. 1152-63. 81. Roggen, E. L. and B. J. Blaauboer, Sens-it-iv: A European Union project to develop novel tools for the identification of skin and respiratory sensitizers. Toxicology in Vitro, 2013. 27(3): p. 1121. 82. Heberle, H., et al., InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams. BMC Bioinformatics, 2015. 16: p. 169.

EXAMPLE 1

(202) Introduction

(203) Based on previous studies we hypothesized that the GARD assay is capable of predicting skin sensitizer potency. Two approaches were pursued in parallel in order to develop potency prediction models. One makes use of our established support vector machine (SVM) trained to provide binary classifications [1-3]. The other approach is based on an orthogonal partial least squares projections to latent structures (O-PLS) linear regression model (Simca, Umetrics, Sweden). O-PLS [5] is a projection method related to principal component analysis and thus well suited for matrices with more variables than observations (or samples) as in the case of whole genome RNA microarray data (>29,000 transcripts). It is a modification of the PLS method (Wold, 1975), designed to divide the structured variability in the matrix X into predictive (correlated with Y) and orthogonal information (not correlated with Y), plus residual variation. This may improve the interpretability of the latent variables (linear combinations of the original variables).

(204) Results

(205) The relationship between GARD SVM decision values and human potency classes for a set of 34 chemicals and controls are illustrated in FIG. 9. The SVM model had been developed for binary classifications (sensitizer versus non-sensitizer).

(206) We have investigated several models in a multivariate approach using the Simca software. Human potency [6] is a classification of chemicals based on available human data combining partly chemically very different substances into one potency category, whereby class 1 represents highest potency and class 6 represents true non-sensitizers.

(207) Here, we present an O-PLS model, which is designed to predict two Ys, namely human potency and sensitizer/non-sensitizer. The scatter plot in FIG. 10 illustrates the separation of sensitizers and non-sensitizers, and the grouping of sensitizers along a second axis into a high- and a low-potency cluster.

(208) The fraction of the total variation that can be explained by the model after cross-validation (Q2cum) is 0.478, which is a value regarded as acceptable/good. A comparison of the goodness of the fit and prediction of the respective original model to the fit of several models with permutated Y-observations (FIG. 11) strongly indicates that the original model is valid. Y.sub.observed versus Y.sub.predicted is plotted in FIG. 12.

(209) Discussion

(210) In our analyses we see a clear relationship between our microarray data and human potency. Further model development is ongoing and performance is expected to improve with the number and types of chemicals tested. We will also investigate multivariate analysis methods as a feature selection tool, which ultimately may lead to new insights into mechanisms associated with sensitizer potency and provide means to improve the prediction of human potency of chemicals with high accuracy.

EXAMPLE 1 REFERENCES

(211) 1. Johansson H et al. A genomic biomarker signature can predict skin sensitizers using a cell-based in vitro alternative to animal tests. BMC Genomics. 2011. 2. Johansson H et al. The GARD assay for assessment of chemical skin sensitizers. Toxicology in vitro 2013. 3. Johansson H et al. GARD in-house validation—A proof of concept. Toxicological Sciences 2014. 4. Albrekt et al. Skin sensitizers differentially regulate signaling pathways in MUTZ-3 cells in relation to their individual potency. BMC Pharmacology and Toxicology 2014. 5. Trygg and Wold. Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics 2002. 6. Basketter et al. Categorization of chemicals according to their relative human skin sensitizing potency. Dermatitis 2014.

EXAMPLE 2

(212) Introduction

(213) Based on previous studies we hypothesized that the GARD assay is capable of predicting skin sensitizer potency. Two approaches were pursued in parallel in order to develop potency prediction models. One makes use of our established Support Vector Machine (SVM) trained to provide binary classifications [1-3] and shows correlating GARD SVM decision values and human potency classes [4] for a set of 34 chemicals and controls as illustrated in FIG. 9. The SVM model had been developed for binary classifications (sensitizer versus non-sensitizer).

(214) The other approach is based on Random Forest (RF) modeling, a decision tree-based method well suited for data sets with more variables than observations. It yields good predictive performance even when variables are noisy (no pre-selection required), and returns variable importance [5].

(215) Results

(216) With regard to the sensitizing potency of chemicals, the European Chemical Agency (ECHA) proposes a categorization according to regulation (EC) No 1272/2008 on classification, labelling and packaging of chemical substances and mixtures (CLP) consisting of: 1A (strong sensitizer), 1B (weak sensitizers) and no category (non-sensitizers) [7]. Here, we present a RF model, built on the arithmetic mean of transcript intensities from replicates in the training data using the random forest varSelRF package [6] in R/Bioconductor version 3.1.2 with error rates estimated by the supplied 0.632+ bootstrap method. It may also be built from transcript intensities of replicates directly. The model was trained using the training set described in Table 9 (70 substances), and then challenged with the test set in Table 10 (18 substances). A summary of the model performance is presented in Table 11. FIG. 13 comprises principal component analyses and heat map results of the described random forest model. A list of identified potency biomarkers is presented in Table 12.

(217) Discussion

(218) In our analyses we see a clear relationship between our microarray data and potency information, comprising both human potency and CLP. Using multivariate analysis methods as feature selection tool, new insights into mechanisms associated with sensitizer potency is acquired. Further, it enables prediction of sensitizer potency of chemicals with high accuracy.

EXAMPLE 2 REFERENCES

(219) 7. Johansson H et al. A genomic biomarker signature can predict skin sensitizers using a cell-based in vitro alternative to animal tests. BMC Genomics. 2011. 8. Johansson H et al. The GARD assay for assessment of chemical skin sensitizers. Toxicology in vitro 2013. 9. Johansson H et al. GARD in-house validation—A proof of concept. Toxicological Sciences 2014. 10. Basketter et al. Categorization of chemicals according to their relative human skin sensitizing potency. Dermatitis 2014 11. Diaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006. 12. Diaz-Uriarte R. GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 2007 13. https://echa.europa.eu/documents/10162/13562/clp_en.pdf.

(220) Tables

(221) TABLE-US-00001 TABLE A Transcript VCF Gene Gene cluster ID (%) Gene Title Symbol assignment Table A(i) 8117594 93 histone cluster 1, H2bm HIST1H2BM NM_003521 8124385 86 histone cluster 1, H4b HIS11H4B NM_003544 8124430 81 histone cluster 1, Hid HIST1H1D NM_005320 8095221 80 phosphoribosylaminoimidazole PAICS NM_001079524 carboxylase, phosphoribosylaminoimidazole succinocarboxamide synthetase 8124413 69 histone cluster 1, H4d HIST1H4D NM_003539 8117608 56 histone cluster 1, H2a1///histone HIST1H2AL/// NM_003511 cluster 1, H2bn HIST1H2BN 7994109 51 polo-like kinase 1 PLK1 NM_005030 7904433 44 phosphoglycerate dehydrogenase PHGDH ENST00000369407 8082350 44 minichromosome maintenance MCM2 NM_004526 complex component 2 8141395 43 minichromosome maintenance MCM7 NM_001278595 complex component 7 7903893 41 CD53 molecule CD53 NM_000560 8118669 41 kinesin family member C1 KIFC1 NM _002263 7938348 40 WEE1 G2 checkpoint kinase WEE1 NM_001143976 7957737 34 thymopoietin TMPO NM_001032283 8146357 34 minichromosome maintenance MCM4 NM_005914 complex component 4 7918300 33 proline/serine-rich coiled-coil 1 PSRC1 NM _001005290 8054329 31 ring finger protein 149 RNF149 NM_173647 8055426 31 minichromosome maintenance MCM6 NM_005915 complex component 6 8072687 29 minichromosome maintenance MCM5 NM_006739 complex component 5 8003503 20 Fanconi anemia complementation FANCA NM _000135 group A Table A(ii) 8040843 44 carbamoyl-phosphate synthetase CAD NM_004341 2, aspartate transcarbamylase, and dihydroorotase 7898549 42 MRT4 homolog, ribosome MRTO4 NM_016183 maturation factor 7901091 41 target of EGR1, member 1 TOE1 NM_025077 (nuclear) 7900699 40 cell division cycle 20 CDC20 NM_001255 8121087 36 peptidase M20 domain containing PM20D2 NM_00101085 2 8084630 35 NmrA-like family domain LOC344887 NR_033752 containing 1 pseudogene 7958455 30 uracil DNA glycosylase UNG NM_003362 8119088 27 cyclin-dependent kinase inhibitor CDKN1A NM_000389 1A (p21, Cip1) 8117395 26 histone cluster 1, H2bf HIST1H2BF NM_003522 8124527 25 histone cluster 1, H1b HIST1H1B NM_005322 7896697 21 unknown unknown unknown 8097417 20 jade family PHD finger 1 JADE1 NM_001287441 7977445 18 K1AA0125 KIAA0125 NR_026800 7985213 17 cholinergic receptor, nicotinic CHRNA5 NM_000745 alpha 5 8068478 17 chromatin assembly factor 1, CHAF1B/// NM_005441 subunit B (p60)///MORC family MORC3 CW-type zinc finger 3 8099721 16 sel-1 suppressor of lin-12-like 3(C. SEL1L3 NM_015187 elegans) 7948192 14 structure specific recognition SSRP1 NM_003146 protein 1 7960340 14 forkhead box M1 FOXM1 NM_001243088 8107706 14 lamin B1 LMNB1 NM _001198557 8124524 14 histone cluster 1, H2ak HIST1H2AK NM_003510 8040712 11 centromere protein A CENPA NM_001042426 8043602 10 non-SMC condensin I complex NCAPH NM_001281710 subunit H 8124394 7 histone cluster 1, H2bb HIST1H2BB NM_021062 8144931 7 ATPase, H+ transporting, ATP6V1B2 NM_001693 lysosomal 56/58kDa, V1 subunit B2 7999025 5 TNF receptor-associated protein 1 TRAP1 NM_001272049 Table A(iii) 8004804 83 phosphoribosylformylglycinamidine PFAS NM _012393 synthase 8005839 63 transmembrane protein 97 TMEM97 NM_014573 7916432 61 24-dehydrocholesterol reductase DHCR24 NM_014762 7948656 30 ferritin, heavy polypeptide 1 FTH1 NM _002032 8117408 30 histone cluster 1, H2ae HIST1H2AE NM _021052 8002303 17 NAD(P)H dehydrogenase, quinone NQO1 NM _000903 1 7939341 8 CD44 molecule (Indian blood CD44 NM_000610 group)

(222) TABLE-US-00002 TABLE 1 37 novel chemicals with CLP annotations used to complement existing GARD data. S = sensitizer, NS = non-sensitizer GARD binary GARD input class binary Name CAS# CLP cytotox [M] (HP*) prediction 2,4-Dinitrofluorobenzene 70-34-8 1A yes 0.00001 S S 3-Methylcatechol 488-17-5 1A yes 0.00004 S S bisphenol A-diglycidyl ether 1675-54-3 1A yes 0.00005 S S chlorpromazine 50-53-3 1A yes 0.0000125 S S cyanuric chloride 108-77-0 1A yes 0.00005 S NS glutaraldehyde 111-30-8 1A yes 0.00002 S S hexyl salicylate 6259-76-3 1A yes 0.00007 S S iodopropynyl 55406-53-6 1A yes 0.00001 S S butylcarbamate methyl heptine carbonate 111-12-6 1A yes 0.0001 S S p-benzoquinone 106-51-4 1A yes 0.00005 S S propyl gallate 121-79-9 1A yes 0.000125 S S abietic acid 514-10-3 1B no 0.000125 S S amylcinnamyl alcohol 101-85-9 1B yes 0.0003 S S anethole 104-46-1 1B no 0.0005 NS S aniline 62-53-3 1B no 0.0005 S NS anisyl alcohol 105-13-5 1B no 0.0005 NS NS benzocaine 94-09-7 1B no 0.0005 S NS benzyl benzoate 120-51-4 1B yes 0.0003 NS S butyl glycidyl ether 2426-08-6 1B yes 0.0005 S NS citral 5392-40-5 1B yes 0.0000625 S S citronellol 106-22-9 1B no 0.0005 NS S diethanolamine 111-42-2 1B no 0.0005 NS NS imidazolidinyl urea 39236-46-9 1B yes 0.00005 S S isopropyl myristate 110-27-0 1B no 0.0005 NS NS lilial 80-54-6 1B yes 0.0001875 S S limonene 5989-27-5 1B no 0.0005 S NS linalool 78-70-6 1B no 0.0005 S NS lyral 31906-04-4 1B yes 0.0001 S S pentachlorophenol 87-86-5 1B no 0.0000625 NS NS pyridine 110-86-1 1B no 0.0005 NS NS 1-bromobutane 109-65-9 no no 0.0005 NS NS cat benzoic acid 65-85-0 no no 0.0005 NS NS cat benzyl alcohol 100-51-6 no no 0.0005 NS NS cat citric acid 77-92-9 no no 0.0005 NS NS cat dextran 9004-54-0 no no 0.00003 NS NS cat kanamycin A 25389-94-0 no no 0.000125 NS NS cat tartaric acid 87-69-4 no no 0.0005 NS NS cat *based on [41] where available; HP 1-4 = S; 5-6 = NS. Otherwise according to CLP. See also Table 4.

(223) TABLE-US-00003 TABLE 2 11 benchmark chemicals. GARD Binary input Chemical CAS CLP class HP [M] 2,4-dinitrochlorobenzene 97-00-7 1A sens 1 0.000004 p-phenylenediamine 106-50-3 1A sens 1 0.000075 2-hydroxethylacrylate 818-61-1 1A sens na 0.0001 2-nitro-1,4- 5307-14-2 1A sens 2 0.0003 phenylenediamine 2-am inophenol 95-55-6 1A sens 2 0.0001 resorcinol 108-46-3 1B sens 4 0.0005 geraniol 106-24-1 1B sens 4 0.0005 hexylcinnamic aldehyde 101-86-0 1B sens 5 0.000032 benzaldehyde* 100-52-7 no cat non-sens 5 0.00025 chlorobenzene 108-90-7 no cat non-sens 6 0.000098 1-butanol 71-36-3 no cat non-sens 6 0.0005 *non-sens according to [41].

(224) TABLE-US-00004 TABLE 3 Training and test set composition Total CLP CLP CLP number 1A 1B no cat Training set 70 23 25 22 Test set 18 6 6 6

(225) TABLE-US-00005 TABLE 4 Controls and 86 unique chemicals used to train and test the Random Forest model for the prediction of CLP categories. binary Toxtree protein Stimulation HP CLP class set binding class 1-brombutane na no cat non-sens test SN2 anethole 5 1B sens test MA benzoic acid na no cat non-sens test No binding benzyl benzoate 5 1B sens test AT bisphenol A-diglycidyl ether 3 1A sens test SN2 butyl glycidyl ether 3 1B sens test SN2 citric acid na no cat non-sens test No binding cyanuric chloride na 1A sens test SNAr diethyl maleate 2 1B sens test MA diethyl phthalate 6 no cat non-sens test No binding ethyl vanillin nf no cat non-sens test SB glutaraldehyde 2 1A sens test SB iodopropynyl butylcarbamate 4 1A sens test AT linalool 4 1B sens test No binding lyral 2 1B sens test SB p-benzochinone na 1A sens test MA propyl gallate 2 1A sens test MA xylene.sup.1 6 no cat non-sens test No binding 1-butanol 6 no cat non-sens train No binding 2,4-dinitrochlorobenzene 1 1A sens train SNAr 2,4-dinitrofluorobenzene na 1A sens train SNAr 2-aminophenol 2 1A sens train MA 2-hydroxyethyl acrylate 3 1A sens train MA 2-mercaptobenzothiazole 3 1A sens train AT 2-nitro-1,4- 2 1A sens train MA phenylenediamine 3-methylcatechol na 1A sens train MA 4-methylaminophenol sulfate 3 1A sens train MA 4-nitrobenzylbromide na 1A sens train SN2 abietic acid 3 1B sens train No binding amylcinnamyl alcohol 4 1B sens train MA aniline 4 1B sens train No binding anisyl alcohol 5 1B sens train MA/SN2 benzaldehyde.sup.2 5 no cat non-sens train SB benzocaine 4 1B sens train No binding benzyl alcohol na no cat non-sens train No binding chloroanilin na 1B sens train No binding chlorobenzene na no cat non-sens train No binding chlorpromazine 3 1A sens train SB cinnamaldehyde 2 1A sens train MA cinnamyl alcohol 3 1B sens train MA citral 3 1B sens train SB citronellol 5 1B sens train No binding dextran 6 no cat non-sens train SB diethanolamine 5 1B sens train No binding dimethyl formamide nf no cat non-sens train nf dimethyl sulfoxide.sup.3 6 no cat non-sens train No binding diphenylcyclopropenone 1 1A sens train MA ethylenediamine 3 1B sens train SB eugenol 3 1B sens train MA formaldehyde 2 1A sens train SB geraniol 4 1B sens train SB glycerol 6 no cat non-sens train No binding glyoxal 2 1A sens train No binding hexane 6 no cat non-sens train No binding hexyl salicylate 4 1A sens train No binding hexylcinnamic aldehyde 5 1B sens train MA hydroquinone 3 1A sens train MA hydroxycitronellal 4 1B sens train SB imidazolidinyl urea 3 1B sens train AT isoeugenol 2 1A sens train MA isopropanol 5 no cat non-sens train No binding isopropyl myristate 5 1B sens train No binding kanamycin A 6 no cat non-sens train No binding Kathon CG 1 1A sens train nf lactic acid 6 no cat non-sens train No binding lauryl gallate 2 1A sens train MA lilial 4 1B sens train SB limonene 4 1B sens train No binding methyl heptine carbonate 2 1A sens train MA methyl salicylate 5 no cat non-sens train No binding methyldibromo glutaronitrile 2 1A sens train MA/SN2 octanoic acid 6 no cat non-sens train No binding pentachlorophenol 5 1B sens train SNAr phenol 6 no cat non-sens train No binding phenyl benzoate 3 1B sens train AT phenylacetaldehyde na 1B sens train SB p-hydroxybenzoic acid nf no cat non-sens train No binding potassium dichromate 1 1A sens train No binding potassium permanganate nf no cat non-sens train nf p-phenylenediamine 1 1A sens train MA pyridine 5 1B sens train No binding resorcinol 4 1B sens train MA salicylic acid 6 no cat non-sens train No binding sodium dodecyl sulfate.sup.4 6 no cat non-sens train SN2 tartaric acid na no cat non-sens train No binding tetramethylthiuram disulfide 3 1B sens train No binding Tween 80 6 no cat non-sens train na unstimulated 6 no cat non-sens train nf HP - Human potency; .sup.1,3,4non-sens Basketter et al.; .sup.2non-sens according to sens-it-iv project [81] MA - Michael Acceptor; SB - Schiff base formation; AT - Acyl transfer agent; SN2 - bi-molecular nucleophilic substitution; SNAr - nucleophilic aromatic substitution; na - not available; nf - not found.

(226) TABLE-US-00006 TABLE 5 The 52 variables identified by random forest modelling as optimal for CLP classification. VCF = variable call frequency. Transcript Gene cluster ID VCF (%) Gene Title Symbol 8117594 93 histone cluster 1, H2bm HIST1H2BM 8124385 86 histone cluster 1, H4b HIST1H4B 8004804 83 phosphoribosylformylglycinamidine synthase PFAS 8124430 81 histone cluster 1, H1d HIST1H1D 8095221 80 phosphoribosylaminoimidazole carboxylase, PAICS phosphoribosylaminoimidazole succinocarboxamide synthetase 8124413 69 histone cluster 1, H4d HIST1H4D 8005839 63 transmembrane protein 97 TMEM97 7916432 61 24-dehydrocholesterol reductase DHCR24 8117608 56 histone cluster 1, H2al/// histone cluster 1, H2bn HIST1H2AL/// HIST1H2BN 7994109 51 polo-like kinase 1 PLK1 7904433 44 phosphoglycerate dehydrogenase PHGDH 8040843 44 carbamoyl-phosphate synthetase 2, aspartate CAD transcarbamylase, and dihydroorotase 8082350 44 minichromosome maintenance complex component 2 MCM2 8141395 43 minichromosome maintenance complex component 7 MCM7 7898549 42 MRT4 homolog, ribosome maturation factor MRTO4 7901091 41 target of EGR1, member 1 (nuclear) TOE1 7903893 41 CD53 molecule CD53 8118669 41 kinesin family member C1 KIFC1 7900699 40 cell division cycle 20 CDC20 7938348 40 WEE1 G2 checkpoint kinase WEE1 8121087 36 peptidase M20 domain containing 2 PM20D2 8084630 35 NmrA-like family domain containing 1 pseudogene LOC344887 7957737 34 thymopoietin TMPO 8146357 34 minichromosome maintenance complex component 4 MCM4 7918300 33 proline/serine-rich coiled-coil 1 PSRC1 8054329 31 ring finger protein 149 RNF149 8055426 31 minichromosome maintenance complex component 6 MCM6 7948656 30 ferritin, heavy polypeptide 1 FTH1 7958455 30 uracil DNA glycosylase UNG 8117408 30 histone cluster 1, H2ae HIST1H2AE 8072687 29 minichromosome maintenance complex component 5 MCM5 8119088 27 cyclin-dependent kinase inhibitor 1A (p21, Cip1) CDKN1A 8117395 26 histone cluster 1, H2bf HIST1H2BF 8124527 25 histone cluster 1, H1b HIST1H1B 7896697 21 — — 8003503 20 Fanconi anemia complementation group A FANCA 8097417 20 jade family PHD finger 1 JADE1 7977445 18 KIAA0125 KIAA0125 7985213 17 cholinergic receptor, nicotinic alpha 5 CHRNA5 8002303 17 NAD(P)H dehydrogenase, quinone 1 NQO1 8068478 17 chromatin assembly factor 1, subunit B (p60) /// CHAF1B /// MORC family CW-type zinc finger 3 MORC3 8099721 16 sel-1 suppressor of lin-12-like 3 (C. elegans) SEL1L3 7948192 14 structure specific recognition protein 1 SSRP1 7960340 14 forkhead box M1 FOXM1 8107706 14 lamin B1 LMNB1 8124524 14 histone cluster 1, H2ak HIST1H2AK 8040712 11 centromere protein A CENPA 8043602 10 non-SMC condensin I complex subunit H NCAPH 7939341  8 CD44 molecule (Indian blood group) CD44 8124394  7 histone cluster 1, H2bb HIST1H2BB 8144931  7 ATPase, H+ transporting, lysosomal 56/58 kDa, ATP6V1B2 V1 subunit B2 7999025  5 TNF receptor-associated protein 1 TRAP1

(227) TABLE-US-00007 TABLE 6 Test set predictions using majority voting. GARD Human true predicted potency Chemical CLP CLP [41] Protein reactivity 1-brombutane no cat no cat na SN2 benzoic acid no cat no cat na No binding citric acid no cat no cat na No binding diethyl phthalate no cat no cat 6 No binding ethyl vanillin no cat no cat nf Schiff base formation xylene no cat no cat 6 No binding anethole 1B 1B 5 Michael acceptor benzyl benzoate 1B 1B 5 Acyl transfer agent linalool 1B 1B 4 No binding lyral 1B 1A 2 Schiff base formation butyl glycidyl ether 1B 1A 3 SN2 diethyl maleate 1B 1A 2 Michael acceptor cyanuric chloride 1A no cat na SNAr propyl gallate 1A 1A 2 Michael acceptor bisphenol 1A 1A 3 SN2 A-diglycidyl ether glutaraldehyde 1A 1A 2 Schiff base formation iodopropynyl 1A 1A 4 Acyl transfer agent butylcarbamate p-benzochinone 1A 1A na Michael acceptor

(228) TABLE-US-00008 TABLE 7 Statistics by class for separate replicates in the test set predictions. no cat 1A 1B Sensitivity 0.889 0.833 0.556 Specificity 0.917 0.806 0.917 Pos predictive value 0.842 0.682 0.769 Neg predictive value 0.943 0.902 0.805 Prevalence 0.333 0.333 0.333 Detection rate 0.296 0.278 0.185 Detection prevalence 0.352 0.407 0.241 Balanced accuracy 0.903 0.819 0.736

(229) TABLE-US-00009 SUPPL. TABLE 1 Predictions of replicates in the test set. Chemical true CLP predicted CLP bisphenol A-diglycidyl ether 1A 1A bisphenol A-diglycidyl ether 1A 1A bisphenol A-diglycidyl ether 1A 1A cyanuric chloride 1A 1B cyanuric chloride 1A no cat cyanuric chloride 1A no cat glutaraldehyde 1A 1A glutaraldehyde 1A 1A glutaraldehyde 1A 1A iodopropynyl butylcarbamate 1A 1A iodopropynyl butylcarbamate 1A 1A iodopropynyl butylcarbamate 1A 1A p-benzochinone 1A 1A p-benzochinone 1A 1A p-benzochinone 1A 1A propyl gallate 1A 1A propyl gallate 1A 1A propyl gallate 1A 1A anethole 1B 1B anethole 1B 1B anethole 1B 1B benzyl benzoate 1B 1B benzyl benzoate 1B 1B benzyl benzoate 1B 1B butyl glycidyl ether 1B 1A butyl glycidyl ether 1B 1B butyl glycidyl ether 1B 1A diethyl maleate 1B 1A diethyl maleate 1B 1A diethyl maleate 1B 1A linalool 1B no cat linalool 1B 1B linalool 1B 1B lyral 1B 1A lyral 1B 1B lyral 1B 1A 1-brom butane no cat no cat 1-brombutane no cat no cat 1-brombutane no cat no cat benzoic acid no cat no cat benzoic acid no cat no cat benzoic acid no cat no cat citric acid no cat no cat citric acid no cat 1B citric acid no cat no cat diethyl phthalate no cat no cat diethyl phthalate no cat no cat diethyl phthalate no cat no cat ethyl vanillin no cat no cat ethyl vanillin no cat 1B ethyl vanillin no cat no cat xylene no cat no cat xylene no cat no cat xylene no cat no cat

(230) TABLE-US-00010 TABLE B Transcript Gene Title Gene Symbol Cluster ID Rank Table B(i) Cyclin A2 CCNA2 8102643 2 Unknown Unknown 8151252 4 Phosphatidylinositol glycan anchor PIGW 8006634 5 biosynthesis, class W Small nuclear ribonucleoprotein D1 SNRPD1 8020411 6 polypeptide 16kDa Rho GTPase activating protein 19/// ARHGAP19/// 7935403 7 ARHGAP19-SLIT1 readthrough ARHGAP19- (NMD candidate) SLIT1 Histone cluster 1, H2ab HIST1H2AB 8124391 8 leucine-rich repeats and LRCH2 8174610 9 calponin homology (CH) domain containing 2 plakophilin 4///plakophilin 4 PKP4///PKP4 8045860 11 ribonucleoprotein, PTB-binding 2/// RAVER2/// 7902023 12 ribonucleoprotein, PTB-binding 2 RAVER2 deoxyuridine triphosphatase/// DUT///DUT 7983594 13 deoxyuridine triphosphatase aurora kinase A AURKA 8067167 14 Unknown Unknown 8055309 15 NDC1 transmembrane nucleoporin NDC1 7916316 16 kinesin family member 2/// KIF22///KIF22 7994620 17 kinesin family member 2 OK/SW-CL.58 OK/SW-CL.58 7970828 18 Unknown Unknown 7994343 19 Table B(ii) Histone cluster 1, H2ab/// HIST1H2AB/// 8117408 1 histone cluster 1, HIST1H2AE H2ae high mobility group box 3 HMGB3 8170468 10

(231) TABLE-US-00011 TABLE 8 Chemical list class 1 representing highest potency, 6 representing non-sensitizers number chemical SVM DV Human potency 1 Potassium dichromate 8.45 1 2 Di nitrochlorobenzene 6.31 1 3 PPD 5.32 1 4 Kathon CG 3.76 1 5 2-Aminophenol 5.98 2 6 Formaldehyde 2.32 2 7 Glyoxal 1.08 2 8 Iso-eugenol 1.91 2 9 2-Hydroxyethyl acrylate 8.87 3 10 Cinnamic alcohol 2.94 3 11 2-Mercaptobenzothiazole 2.52 3 12 Ethylenediamine 1.12 3 13 Penicillin G 1.06 3 14 Eugenol 1.02 3 15 Geraniol 1.76 4 16 Resorcinol 1.03 4 17 Hexylcinnamic aldehyde 1.24 5 18 Isopropanol −1.37 5 19 Methyl salicylate −1.53 5 20 PABA −1.04 5 21 Propylene glycol −1.18 5 22 Benzaldehyde −2.13 5 23 Phenol −1.08 6 24 Octanoic acid −1.19 6 25 Tween 80 −1.39 6 26 Salicylic acid −1.41 6 27 Sodium lauryl sulfate −1.52 6 28 Chlorobenzene −1.63 6 29 Glycerol −2.05 6 30 1-Butanol −2.07 6 31 Diethyl phthalate −2.15 6 32 Unstimulated −2.20 6 33 DMSO −2.31 6 34 Lactic acid −2.38 6

(232) TABLE-US-00012 TABLE 9 Chemicals training set Chemical Can CLP set vehicle GARD input [M] 2,4-dinitrochlorobenzene 97-00-7 1A training set DMSO 0.000004 p-phenylenediamine 106-50-3 1A training set DMSO 0.0003 sodium dodecyl sulfate 151-21-3 no cat training set H2O 0.0002 salicylic acid 69-72-7 no cat training set DMSO 0.0005 phenol 108-95-2 no cat training set H2O 0.0005 glycerol 56-81-5 no cat training set H2O 0.0005 lactic acid 50-21-5 no cat training set H2O 0.0005 chlorobenzene 108-90-7 no cat training set DMSO 0.000098 p-hydroxybenzoic acid 99-96-7 no cat training set DMSO 0.00025 benzaldehyde 100-52-7 no cat training set DMSO 0.00025 octanoic acid 124-07-2 no cat training set DMSO 0.0005 unstimulated no cat training set H2O vehicle dimethyl sulfoxide 67-68-5 no cat training set DMSO vehicle glyoxal 107-22-2 1A training set H2O 0.0003 2-mercaptobenzothiazole 149-30-4 1A training set DMSO 0.00025 resorcinol 108-46-3 1B training set H2O 0.0005 isoeugenol 97-54-1 1A training set DMSO 0.0003 eugenol 97-53-0 1B training set DMSO 0.0003 cinnamyl alcohol 104-54-1 1B training set DMSO 0.0005 geraniol 106-24-1 1B training set DMSO 0.0005 2-nitro-1,4-phenylenediamine 5307-14-2 1A training set DMSO 0.0003 isopropanol 67-63-0 no cat training set H2O 0.0005 Tween 80 9005-65-6 no cat training set DMSO 0.0005 2-hydroxyethyl acrylate 818-61-1 1A training set H2O 0.0001 formaldehyde 50-00-0 1A training set H2O 0.00008 Kathon CO 96118-96-6 1A training set H2O 0.0035% hexylcinnamic aldehyde 101-86-0 1B training set DMSO 0.00603224 2-aminophenol 95-55-6 1A training set DMSO 0.0001 methyl salicylate 119-36-8 no cat training set DMSO 0.0005 ethylenediamine 107-15-3 1B training set H2O 0.0005 potassium dichromate 7778-50-9 1A training set H2O 0.0000015 dimethyl formamide 66-12-2 no cat training set H2O 0.0005 1-butanol 71-36-3 no cat training set DMSO 0.0005 potassium permanganate 7722-64-7 no cat training set H2O 0.000038 phenylacetaldehyde 122-78-1 1B training set DMSO 0.00005 hydroquinone 123-31-9 1A training set DMSO 0.00002 4-nitrobenzylbromide 100-11-8 1A training set DMSO 0.0000015 diphenylcyclopropenone 886-38-4 1A training set DMSO 0.000005 hexane 110-54-3 no cat training set DMSO 0.0005 phenyl benzoate 93-99-2 1B training set DMSO 0.0002-0.0005 4-methylaminophenol sulfate 55-55-0 1A training set DMSO 0.000007 hydroxycitronellal 107-75-5 1B training set DMSO 0.0005 chloroanilin 106-47-8 1B training set DMSO 0.0005 Tetramethylthiuram disulfide 137-26-8 1B training set DMSO 0.0000001 lauryl gallate 1166-52-5 1A training set DMSO 0.000003 cinnamaldehyde 104-55-2 1A training set DMSO 0.00005 methyldibromo glutaronitrile 35691-65-7 1A training set DMSO 0.00002 2,4-dinitrofluorobenzene 70-34-8 1A training set DMSO 0.00001 3-methylcatechol 488-17-5 1A training set DMSO 0.00004 abietic acid 514-10-3 1B training set DMSO 0.000125 amylcinnamyl alcohol 101-85-9 1B training set DMSO 0.0003 aniline 62-53-3 1B training set DMSO 0.0005 anisyl alcohol 105-13-5 1B training set DMSO 0.0005 benzocaine 94-09-7 113 training set DMSO 0.0005 benzyl alcohol 100-51-6 no cat training set DMSO 0.0005 chlorpromazine 50-53-3 1A training set DM90 0.0000125 citral 5392-40-5 1B training set DMSO 0.0000625 citronellol 106-22-9 1B training set DMSO 0.0005 dextran 9004-54-0 no cat training set DMSO 0.00003 diethanolamine 111-42-2 1B training set H2O 0.0005 hexyl salicylate 6259-76-3 1A training set DMSO 0.00007 imidazolidinyl urea 39236-46-9 1B training set DMSO 0.00005 isopropyl myristate 110-27-0 1B training set DMSO 0.0005 kanamycin A 25389-940 no cat training set H2O 0.000125 lilial 80-54-6 1B training set DMSO 0.0001875 limonene 5989-27-5 1B training set DMSO 0.0005 methyl heptine carbonate 111-12-6 1A training set DMSO 0.0001 pentachlorophenol 87-86-5 1B training set DMSO 0.0000625 pyridine 110-86-1 1B training set H2O 0.0005 tartaric acid 87-69-4 no cat training set DMSO 0.0005

(233) TABLE-US-00013 TABLE 10 Chemicals test set GARD Chemical Cas CLP set vehicle input [M] diethyl phthalate 84-66-2 no cat test set 1 DMSO 0.00005 ethyl vanillin 121-32-4 no cat test set 1 DMSO 0.0005 diethyl maleate 141-05-9 1B test set 1 DMSO 0.00012 xylene 1330-20-7 no cat test set 1 DMSO 0.0005 1-brombutane 109-65-9 no cat test set 1 DMSO 0.0005 anethole 104-46-1 1B test set 1 DMSO 0.0005 benzoic acid 65-85-0 no cat test set 1 DMSO 0.0005 benzyl benzoate 120-51-4 1B test set 1 DMSO 0.0003 bisphenol 1675-54-3 1A test set 1 DMSO 0.00005 A-diglycidyl ether butyl glycidyl ether 2426-08-6 1B test set 1 DMSO 0.0005 citric acid 77-92-9 no cat test set 1 DMSO 0.0005 cyanuric chloride 108-77-0 1A test set 1 DMSO 0.00005 glutaraldehyde 111-30-8 1A test set 1 H2O 0.00002 iodopropynyl 55406-53-6 1A test set 1 DMSO 0.00001 butylcarbamate linalool 78-70-6 1B test set 1 DMSO 0.0005 lyral 31906-04-4 1B test set 1 DMSO 0.0001 p-benzochinone 106-51-4 1A test set 1 DMSO 0.00005 propyl gal late 121-79-9 1A test set 1 DMSO 0.000125

(234) TABLE-US-00014 TABLE 11 Model performance - external test set CLP 1A CLP 1B CLP no cat Accuracy 0.93 0.76 0.90 Sensitivity 0.86 0.63 1.0 Specificity 1.0 0.90 0.80

(235) TABLE-US-00015 TABLE 12 Variable Transcript Ranking frequencies Cluster ID Gene Title Gene Symbol 1 0.335 8117408 histone cluster 1, H2ab///histone cluster 1, H2ae HI5T1H2A8///HIST1H2AB 2 0.235 8102643 cyclin A2 CCNA2 3 0.22 8004804 phosphoribosylformylglycinamidine synthase PFAS 4 0.21 8151252 — — 5 0.175 8006634 phosphatidylinositol glycan anchor biosynthesis, class W PIGW 6 0.165 8020411 small nuclear ribonucleoprotein D1 polypeptide 16 kDa SNRPD1 7 0.16 7935403 Rho GTPase activating protein 19///ARHGAP19-SLIT1 readthrough (NM ARHGAP19///ARHGAP19 8 0.145 8124391 histone cluster 1, H2ab HIST1H2AB 9 0.1 8174610 leucine-rich repeats and calponin homology (CH) domain containing 2 LRCH2 10 0.085 8170468 high mobility group box 3 HMGB3 11 0.08 8045860 plakophilin 4///plakophilin 4 PKP4///PKP4 12 0.07 7902023 ribonucleoprotein, PTD binding 2///ribonucleoprotein, PTB binding 2 RAVER2///RAVER2 13 0.06 7983594 deoxyuridine triphosphatase /// deoxyuridine triphosphatase DUT///DUT 14 0.06 8067167 aurora kinase A AURKA 15 0.045 8055309 — — 16 0.03 7916316 NDC1 transmembrane nucleoporin NDC1 17 0.03 7994620 kinesin family member 22///kinesin family member 22 KIF22///KIF22 18 0.025 7970828 OK/SW-CL.58 OK/SW-CL.58 19 0.195 7994343 20 0.15 8141395 minichromosome maintenance complex component 7 MCM7 Transcript Cluster ID Transcript ID 8117408 EMST00000303910///GENSCAN00000029B43///BC093862///NM_021052///BC093836 8102643 ENST00000618014///ENST00000274026///GENSCAN00000033473///CR407692///BC104787///BC104783/// AK291931///NM_001237 8004804 ENST00000314666 /// ENST000000546020///ENST00000580356///BC031807///GENSCAN00000001671/// BC167158///BC146768///AK295895///AK292804///XM_006721546///AK292402///NM_012393 8151252 ENST00000521867///GENSCAN00000046362 8006634 ENST00000620233///ENST00000614443///ENST00000619326///ENST00000616581///AB097518///NM_178517 8020411 ENST00000582475///ENSt00000579618///ENsT00000303413///BC001721///BC072427///J03798///NM_00291916///NM_006938 7935403 ENST00000358308///ENST00000466484///ENST00000371027///ENST00000487035///ENST0000058531///ENST00000492211/// ENST00000479633///DQ338460///GENSCAN00000038278///BC114490///BC113888///AY36750///AK303055///AK093316/// NR_037909///AK090447 ///NM_001256423///NM_032900///NM_001204300 8124391 ENST00000615868///GENSCAN00000029813///AK311785///BC25140///NM_003513 8174610 ENST00000538422///ENST00000317135///GENSCAN00000029788///BC125224///NM_020871///NM_001243963 8170468 ENST00000325307///BX537505///NM_005342 8045860 ENST00000480171///ENST00000421462///ENST00000452162///ENST00000389759///ENST00000389757///ENST00000426248/// GENSCAN00000017165///BC034473///BC050308///AK124217///AK055823///AK054911///NM_001005476///NM_003628 7902023 ENST00000294426///EMST00000371072///GENSCAN00000026769///NM_018211///BC065303 7983594 ENST00000558978///ENST00000455976///ENST0000558472///ENST00000331200///ENST00000559416///U90223/// GENSCAN00000024989///U62891///U31930///CR541781///M89913///BC110377///CR541720///BC070339/// BC033645///AK298464///AB049113///AK291515///NM_001948///NM_001025248 8067167 ENST00000395914///ENST00000456249///ENST00000422322///ENST00000441357///ENST00000395907///ENST00000395913/// ENST00000395915///ENST00000312783///ENST00000395911///ENST00000347343///ENST00000371356///D84212///B0027464/// BC006423///BC002499///BC001280///AK301769///AF011468///AF008551///NM_198437///NM_198434///NM_198435/// NM_198433///NM_003600///NM_198436 8055309 — 7916316 DQ141696/// ENST00000371429///BC003082///AK302910///AK298909///AK295000///NR_033142///NM_018087///XM_006710762/// NM_001168551 7994620 ENST00000563263///ENST00000569382///ENST00000400751///ENST00000570173///ENST00000569636///ENST00000160827/// ENST00000561482///L29096///GENSCAN00000029536///BT007259///BC004352///BC028155///AK316389///AK316050/// NM_007317///AB017430///AK294380///NM_001256270///NM_001256269 7970828 AB064667 7994343 NONHSAT141477///ENST00000363268 8141395 ENST00000354230///ENST00000621318///ENSTD0000425308///ENST00000463722///ENST00000477372///ENST00000303887/// ENST00000286///ENST00000343023///ENST00000489841///GENSCAN00000001629///D86748///D55716///D28480/// BC013375///BC009398///NM_182776///AK293172///NM_005916///NM_001278595

(236) Transcript Cluster ID Transcript ID 8117408 ENST00000303910///GENSCAN00000029843///BC093862///NM_021052///BC093836 8102643 ENST00000618014///ENST00000274026///GENSCAN00000033473///CR407692///BC104787///BC104783///AK291931///NM_001237 8004804 ENST0000314666///ENST00000546020///ENST00000580356///BC031807///GENSCAN00D00001671///BC167158///BC146768///AK295895///AK292804///XM_006721546///AK292402///NM_012393 8151252 ENST00000521867//IGENSCAND000D463 62 8006634 ENST00000620233///ENST00000614443 f/f ENST00000619326///ENST00000616581///AB097818///NM_178517 8020411 ENST00000582475 ENST00000579618 ENST00000300413 BC001721///BC072427///103798///NM_001291916///NM_006938 7935403 ENSTOODO0359308///ENST00000466484///ENST00000371027///ENST00000487035///ENST00000358531///ENST00000492211///ENST00000479633///DQ338460///GENSCAN00000038278///BC114490///BC113B88///AY336750///AK303055///AK093316///NR_037909///AK090447///NM_001256423///NM_032900///NM_001204300 8124391 ENST00000615868///GENSCAN00000029813///AK311785///BC125140///NM_003513 8174610 ENST00000538422///ENST00000317135///GENSCAN00000029788///BC125224///NM_020871///NM_001243963 8170468 ENST00000325307///BX537505///NM_005342 8045860 ENST00000480171///ENST00000421462///ENST00000452162///ENST00000389759///ENST00000389757///ENST00000426248///GENSCAN00000017165///BC034473///BC050308///AK124237///AK055823///AK054911///NM_001005476///NM_003628 7902023 ENST00000294428///ENST00000371072///GENSCAN00000026769///NM_018211///BC065303 7983594 ENST00000558978///ENST00000455976///ENST00000558472///ENST00000331200///ENST00000559416///U90223///GENSCAN00000024989///U62891///U31930///CR541781///M89913///BC110377///CR541720///BC070339///BC033645///AK298464///AB049113///AK291515///NM_001948///NM_001025248 8067167 ENST00000395914///ENST00000456249///ENST00000422322///ENST00000441357///ENST00000395907///ENST00000395913///ENST00000395915///ENST00000312783///ENST00000395911///ENST00000347343///ENST00000371356///D4212///BC027464///BC006423///BC002499///BC001280///AK301769///AF011468///AF008551///NM_198437///NM_198434///NM_198435///NM_198433///NM_003600///NM_198436 8055309- 7916316 DQ141696///ENST00000371429///BC003082///AK302910///AK298909///AK295000///NR_033142///NM_018087///XM_006710762///NM_001168551 7994620 ENST00000563263///ENST00000569382///ENST00000400751///ENST00000570173///ENST00000569636///ENST00000160827///ENST00000561482///L29096///GENSCAN00000029536///BT007259///BC004352///BC028155///AK316389///AK316050///NM_007317///AB017430///AK294380///NM_001256270///NM_001256269 7970828 AB064667 7994343 NONHSAT141477///ENST00000363268 8141395 ENST00000354230///ENST00000621318///ENST00000425308///ENST00000463722///ENST00000477372///ENST0000303887///ENST00000485286///ENST00000343023///ENST00000489841///GENSCAN00000001629///D86748///D55716///D28480///BC013375///BCOU9398///NM_182776///AK293172///NM_005916///NM_001278595