MAC: Methylation Array Classifier


MAC will classify illumina X50k array medulloblastoma methylation data in to one of four molecular subgroups Download test data to try classifier: Test ZIP file Array processing and classification takes ~100 seconds for this test set of 24 arrays

Unclassifiable samples are those for which a confident subgroup call could not be made
Download table as .csv



Unclassifiable and Array QC failed samples are not shown in this plot. Boxes show the confidence interval for subgroup assignment generated by bootstrapping, and the individual data points represent the final probability associated with each subgroup call.
Download plot as high-res .png



MAC: Methylation Array Classifier 1.3.2-2.1


Machine learning model and classifier code: Reza Rafiee

Shiny web app code and adaptation: Matthew Bashton

Metagene model: Ed Schwalbe


Overview

MAC: Methylation Array Classifier will classify illumina 450k and EPIC medulloblastoma methylation array data in to one of four molecular subgroups: WNT, SHH, Group 3 and Group 4.

In summary the classifier works as described below:

  1. illumina 450k or EPIC data is normalised using the preprocessNoob function from minfi.
  2. The detection p-value for each probe on each array is obtained using minif, should any one array have more than 5% of its probes with a detection p -value, of grater than 0.05 these arrays are considered to have failed Array QC and will not be amenable to classification.
  3. The β-values of 10,000 probes used in our classifier are then extracted from normalised data for each sample.
  4. The 10,000 β-values for each sample are projected into the metagene space of our existing model, this is derived from the 450k data of our discovery cohort of 434 medulloblastoma samples.
  5. A multi-class optimised Support Vector Machine (SVM) validated and trained on 220 samples from our 450k medulloblastoma cohort is used to robustly assign subgroup to the projected test set produced in the previous step.
    • Our SVM is validated using a bootstrapping technique via 1,000 random iterations of 80% of the training set, confidence interval derived from this is plotted on the Classification Graph as a box plot.
    • The final probability assignment for a subgroup call is made by creating an SVM model with the whole (220 sample) 450k training set; these probabilities are given in the Classification Table in the initial tab.
    • Calls made with a probability below our predefined threshold are considered unreliable and samples will be labeled as Unclassifiable in the Classification Table, these samples will not be plotted in the Classification Graph.
  6. Various post processing and formatting operations on the data take place with the interactive website being implemented in the R Shiny reactive web application framework.

Reference

A manuscript is in preparation.


Download

The R code for this Shiny based website including training and validation cohorts can be downloaded from GitHub the website can also be run locally using Rstudio instructions and dependancies are outlined on GitHub.


Funding

MAC: Methylation Array Classifier development was funded by a Cancer Research UK program grant.

How to use our Classifier

MAC: Methylation Array Classifier will classify 450k microarray medulloblastoma methylation data in to one of four molecular subgroups. To use the classifier follow the steps outlined below:

  1. A compressed ZIP (.zip) file containing all illumina red and green channel .idat files as well as a sampleSheet.csv file is needed as input to use the classifier (see Input file fromat below for details). If you would like to test drive the classifier, or would like to see what an archive should contain a test zip file can be downloaded using the link in the grey box on the left.

  2. A ZIP file can then be uploaded by clicking on the 'Chose File' or 'Browse...' (browser dependent) button on the left, once uploaded the classification happens automatically.

  3. By default the Classification Table output is preselected and will present you with a four subgroup Medulloblastoma classification for each of your samples. Other tabs presenting other information can then be accessed by clicking their names present at the top of the main panel.

  4. The contents of Tables can be downloaded by clicking the grey download button, these .csv files can then be loaded into Excel or other spreadsheet software if required.

  5. The Classification Plot can also be downloaded as a .png by clicking on the grey Download button at the bottom of the Classification Plot tab.


Input file format

Each array on a illumina Infinium chip produces two corresponding files such as: 9403904132_R01C01_Grn.idat and 9403904132_R01C01_Red.idat; there will be a total of 12 (450k) or 8 (EPIC) these on each chip. The first part of the file name corresponds to the Sentrix barcode and the second to the location on the chip of the sample. All pairs of idats for each array need to be zipped up into a ZIP archive along with a CSV file SampleSheet.csv in order to be uploaded to the classifier.

The SampleSheet.csv file takes the format of:


[Header],,,,,,
Investigator_Name,,,,,,
Project_Name,,,,,,
Experiment_Name,,,,,,
Date,16/02/14,,,,,
,,,,,,
,,,,,,
[Data],,,,,,
Sample_Name,Sample_Plate,Sample_Group,Sample_Well,Pool_ID,Sentrix_ID,Sentrix_Position
Sample1,11,,D01,,9421912041,R04C01
Sample2,11,,E01,,9421912041,R05C01
Sample3,11,,F01,,9421912041,R06C01
Sample4,11,,A02,,9421912041,R01C02
Sample5,11,,B02,,9421912041,R02C02
Sample6,11,,C02,,9421912041,R03C02

The size and content of the header section will be dynamically skipped by MAC: Methylation Array Classifier as long as the column ID row is present and formated like this:

Sample_Name,Sample_Plate,Sample_Group,Sample_Well,Pool_ID,Sentrix_ID,Sentrix_Position

Following on from the header, each line corresponds to a sample on the chip(s) for the .idat files present, with the first column containing the sample name and columns six and seven corresponding to the Sentrix ID and row column position on the chip; these two strings must match up with the file naming convention mentioned above.

The SampleSheet.csv file must be included in the ZIP achive in order for MAC: Methylation Array Classifier to assocate each pair of idats with each sample, failure to do so will prevent the classifier working. Finally the last line in the sampleSheet.csv must be terminated with a carriage return.


Suppport

If you have any issues with using MAC: Methylation Array Classifier please contact Reza Rafiee.


WARNING: MAC: Methylation Array Classifier is for research use only, and should only be used on samples with a confirmed histopathological diagnosis of medulloblastoma.