Fourth Annual DNA Grantees' Workshop
Monday, June 23, 2003
AFTERNOON SESSION
Automated STR Mixture Analysis
Mark W. Perlin
Biography
MS. TOMSEY: Our next speaker is Dr. Mark Perlin. All of us know his dedication toward the expert systems and helping the throughput of the massive volumes of data that we generate on both convicted-offender samples and casework samples.
Dr. Perlin develops biomedical information and automated technologies. He's been working in genetics for more than 10 years. Dr. Perlin invented linear mixture analysis and deconvolution-based STR (short tandem repeat) genotyping technologies and directed the development of TrueAllele automated data scoring software.
Dr. Perlin is the chief executive officer and founder of Cybergenetics in Pittsburgh. He holds an adjunct faculty appointment in computer science at Carnegie Mellon University and in human genetics at the University of Pittsburgh. He received his bachelor of arts in chemistry from State University of New York in Binghamton, a Ph.D. in mathematics from the City University of New York Graduate School, an M.D. from the University of Chicago's Pritzker School of Medicine, and a Ph.D. in computer science from Carnegie Mellon University; and he completed his transitional residency at Mercy Hospital in Pittsburgh and is a fellow at IBM's Watson research facility. Quite a bit of pedigree here.
Today he will talk about automated STR mixture analysis. I'd like to welcome Dr. Mark Perlin. Perlin: Slide 1
DR. PERLIN: I'd like to thank our hosts for inviting our group to talk about automated STR mixture analysis. The problem that we're dealing with is the increasing casework load, which probably isn't going to get any better now that we're throwing more money at it. From some of this morning's talks, you know that there are bottlenecks in collecting and generating data. Perlin: Slide 2
In Joe Valero and Barry Duceman's article on dealing with increasing casework demands and DNA analysis [Valero, J., and B. Duceman, "Dealing with Increasing Casework Demands for DNA Analysis" Profiles in DNA 502 (2002): 3–6], it states that the next bottleneck is the interpretation bottleneck. So that's what I'll be talking about for the duration.
Our goal in this is what I like to term simplicity. There are four main pillars of this. The first is time. You want to get results back with good turnaround time. It would be nice to hand a result back to the police in days, not months. This isn't science fiction. It's actually happening in parts of the world, and we're working on this, but bear with me. So turnaround time is great. If you can get a 24-hour turnaround time of casework to police, that's something very good when working with active investigations. Perlin: Slide 3
Second is information. Information is something that the prosecutor, defense attorneys, and police work with. I'll be talking about information in terms of discriminating power.
The trier of fact is interested in understanding what these results might mean and, as we heard this morning, maybe the attorneys would like to understand what some of this means, too. So understandability or clarity is a nice goal to have for that audience.
In the end, the reliability of forensic results is good for society and for admissibility in courts.
So these are the four themes. I'll be touching on them as we go along.
We've developed an expert computer system. Actually, I'll be talking about version 14 of the Casework Mixture Analysis System. The goal of it is to handle the complexities of the case: looking at all the thousands of peaks and figuring out what they mean, putting that under the hood with mathematical modeling that can capture the biology and chemistry of the STR process with statistical modeling that can give probability results that are actually more understandable than many other things. I'll be describing that. A difference, for example, between version 14 and version 15, which we're finishing up now, is increasing the statistical modeling from hundreds of equations to thousands of equations and providing the computer processing that can use the math and statistics to automate this interpretation. Perlin: Slide 4
In order to demonstrate scientific reliability, it's good to do a validation study. Let me first begin with the design of the study. This was designed in collaboration with several laboratories, including Virginia and Palm Beach County Sheriff's Office. Perlin: Slide 5
The colors blue, green, and red represent the three axes of the study. In blue, you can see that we did a series of mixture proportions (90 percent and 70 percent)—90 percent being a clear major contributor down to 10 percent, which is sometimes a less than clear minor contributor. Green was the dilution axis. You can see we did 1 nanogram, 1/2, 1/4, and 1/8, moving down by a factor of 2. Then to make sure there wasn't any bias in the particular pair of male-female individuals that we put together for the mock rape kits, we created two different sets: set 1 with all these samples and then set 2. That was the overall design that was sent out to a number of labs to look at.
Next came the actual data generation. NIST (National Institute of Standards and Technology) was kind enough to volunteer to generate all of these mixed DNA templates, and then Cybergenetics developed laboratory protocols. We actually sent out about 50 pages of laboratory protocols, unsuccessfully thinking of absolutely everything that could go wrong: "Hold the pipette" (indicating). Perlin: Slide 6
What happened is that NIST sent out stock solutions to each lab, who would do their own dilutions, depending on the volumes that they needed, and prepare their own samples. The data generation was done across 10 laboratories. Some States have more than one lab in them, including Florida, Maryland, New York, Ohio, Pennsylvania, Virginia, and I guess the State of the United Kingdom, if I'm going to be consistent.
All sequencers and chemistries in current forensic use were represented, which was good. We added a few more groups at the end to make sure that happened. In addition to doing these 56 samples, we had each lab send in approximately 100 samples, just regular single-source samples, which I won't talk about today, but it'll be in a paper, where the computer calibrated the stutter properties, the pref-am (preferential amplification) properties, the peak variation, and all the different sorts of variation that it could find, because we used that PCR (polymerase chain reaction) variation as a way to model and understand the uncertainty in the data.
The interpretation went first with the data analysis. In the analysis, we used a program called TrueAllele. It takes a lot of knowledge about different processes and the original DNA sequencer files and transforms them into a database of quality-checked peaks. Any peak that appears to be of a height greater than zero is in the database. Perlin: Slide 7
That takes 3 to 5 minutes of total time, mainly quality-checking the computer's quality check. TrueAllele has other applications. There's another module that I won't talk about further, but it's also good for interpreting reference samples, which is useful for CODIS (Combined DNA Index System) data review.
Last year in the United States, I talked about our NIJ study, which we're getting ready for publication, where we validated and showed that TrueAllele eliminated most of the time and error and that maybe only 10 or 15 percent of the data actually needed to be looked at by a person.
In the U.K., when they deployed TrueAllele—they now have it in production—their staff went from 75 people down to 2. They're now running on one or two computers that examine 35,000 samples a month, and it's a nicely validated system. They're writing up their papers now on a study of it. They're actually using the extension of TrueAllele for their burglary process, examining 100,000 samples a year.
So it's good for that type of single-source sample. You'll be hearing from Barry Duceman about the New York experience. They've been waiting now for about a year to use TrueAllele and are starting to get on board with it. Again, it's nicely validated. Barry may raise the ethical question: If you have a system that makes no error, then why are people doing something? I'll leave that for Barry; I'd rather not go there.
DR. DUCEMAN: Thanks, Mark.
DR. PERLIN: You're welcome.
So moving on now to casework interpretation. We're currently working with a system called JustAllele that takes the quality-checked peak database and automatically transforms it into the forensic results that we'd like to have. Perlin: Slide 8
Mixing weights are good for scientists to look at. Genotype confidence sets in the end is what gives you all the information, and that's what I'll focus on. In looking at the study for scientists, it's nice to have some good visualization of the results. These take a few minutes per mock rape kit but take zero minutes or no human time at all because it's fully automated. That just gives you a sense of the time that analysts have to spend on these things.
I'm going to go through an example now of a typical two-contributor unknown-suspect mock rape kit, which is how we designed the study. There are a lot of data in the study, but the mixtures were all two-person mixtures, two-contributor mixtures. Perlin: Slide 9
So here we have two samples. This is from our lab. It's 1 nanogram of DNA (PowerPlex 16) run out on an ABI (Applied Biosystems) 310. The two samples were sample A, which was a reference sample, and sample C, which in the study design has a 30-percent minor contributor. It happens to be individual G, but the computer doesn't know that. It just knows for each of these cases that it has a reference and there's some mixture and it's trying to solve a two-contributor case. It can do more, but that's what this study was about.
This is what the data look like, and on the left you see an ordinary reference sample and on the right is something that should be instantly recognizable as a 2:1 mixture. That's what it looks like, two-contributors-to-one mixture. Perlin: Slide 10
One thing that comes out of the system are spreadsheet files. These are the mixture weights. In the study we know that the reference sample is a reference sample. We can think of the contributors as the rows and the samples as the columns. There are two contributors. The second contributor will end up being the unknown contributor. If you go to the second column, then you see that the mixing proportions are, in this case, 68 plus/minus 2 percent, and so on—roughly a 2:1 mixture. Perlin: Slide 11
There are error bars because this is a very statistical system and there's no such thing as an exact estimate from data, no matter how perfect the data might be. So you have to have some sort of notion of a confidence interval or confidence set, which will appear again when we talk about what genotypes look like.
So what do the genotypes look like? Well, let me go through an example of what all this looks like. Take a look at D18. The first two columns are the two alleles; 13 and 15 are the designations. The third column gives a probability, and the fourth column gives the cumulative probability, which I'll go into in a second. Then, to make the study a bit easier, there's a "1" that appears as the correct answer, which makes it easier for us to figure out what's going on. This replaced lots of human time of going back and looking at a sheet of paper. Perlin: Slide 12
Now, if you look at something that's a bit more ambiguous—in this particular case, the 1—this is the profile of the inferred 30-percent minor contributor. That's what you're looking at.
Then in this case, there were three contributors for THO1 in the confidence set—9, 9.3; 6, 9.3; and 9.3, 9.3. We basically stop once you hit around 99 percent. So in the fourth column there's the cumulative probability after you have the first genotype, and then you go down to the second one and the third one, and then you stop. It's just like any other statistics where you have a confidence set. It's just that, because it's not continuous, you don't draw an interval. You just keep adding to the list until you get and reach your confidence level.
In this case there are three genotypes that appear and that become the confidence set for the unknown contributor at the THO1 locus. Does that make sense? It'll make even more sense in a second, I hope. But the idea is that the data are not perfect enough, given the 30 percent level. In fact, there's only one sample with inherent statistical uncertainty. The computer has gone over these equations hundreds of thousands of times looking at it again and again, and when it's done it says: You know, that confidence level is a bit fuzzy, and that's what the fuzziness is. It's very objective, but that's what it is.
This is what the data look like. I'm going to refer back to them in a second. On top (in black) is a bar chart of the actual profile. The designations are at the bottom. The RFUs are on the Y axis. Then there's a model that happens to be the correct answer, but it's just the first answer. In blue is the contribution from the victim and in red is the contribution from the unknown suspect. Perlin: Slide 13
By eye, you see there's a reasonably good match, but there is some variation. There is some garbage on the bottom; the peaks are noisy. So instead of giving a perfect answer with no ambiguity, it says it thinks there might be some.
Now, what does it mean when it thinks there might be some ambiguity? This is where the whole information concept comes in of what we're trying to do with STRs, whether they're mixtures, whether there is one sample, whether there are 100 samples, whether there are no reference samples, or whatever comes up in the case.
By way of analogy, think of it like this. If someone tells you, I've got a picture of the perpetrator. Well, that might mean something to a jury or to some finder of fact. If the picture is a perfectly sharp and taken on a sunny day, then there is the perpetrator. You're done. But if you make the statement, "I have a picture of the perpetrator," and it's 100 feet away in the middle of a rainy, foggy night and the person is wearing a sweatshirt, well, maybe there are a million people who would be consistent with that picture. When is a picture a picture? A picture is not always a picture.
The same thing is true with DNA. When you have a perfect, wonderful, high-signal, clean, single-source, nondegraded piece of DNA, odds of 1 in 1017 will be expected. When you have some ambiguity in the data, there may be more individuals who could be included and statistics reflect that. And you're saying, what is the DNA, what does it mean, how discriminating is it, to what extent does it put a suspect at a crime scene? That's a nice thing to do objectively.
With methods like CPE, because there were four peaks in the black (as you see on top), what people might do is say: Look, I have this grid, there are 4 alleles, so there are 10 possibilities. What do you mean when you say there are 10 possibilities? You're saying, I have to add in all the population ambiguity from each of those possibilities. If all allele combinations were possible, there would be no information at all. The probability would be that a person at random would match that if all allele pairs were possible. Perlin: Slide 14
If you had an allelic ladder and that's what it looked like, everything's possible. There's no information. Does that make sense? I need to see some nodding heads. Thank you down in the front. They're very good over here. I appreciate that.
If you have only one allele pair and one designation, you may have odds of 1 in 100. That's a lot of information. If you have all possible allele pairs, then you have no information. It's not telling you anything that's discriminating.
Then typically with methods like CPE you get something in between: There are 10 possible allele pairs. So you add up the population frequencies of those and now maybe your probability is 1 in 10 instead of 1 in 100. You lose information. The discriminating power is telling you about the information.
Well, even though we may be happy with 10 because it's giving us some information, what the method actually said is there is probably only 1, but at that confidence level you can't rule out all 3. So what we're doing is reducing it from the loss of information of having 10 possibilities to having only 3. So we're increasing the focus of our DNA camera. We're getting more information by letting the computer do its millions of operations and think about the statistics.
Now, what happens when you start adding it all up? If you look at the light blue on the left, that's what a full profile would look like. The odds all the way at the end are 1 in 1017 or 17. You can have multiple probabilities and you can add their logs. What you see in blue is the contribution of each loci, but with that bit of ambiguity that we've introduced, you lose some information, and that information loss is reflected in the purple. If you have no information, then you go down to zero, and now you end up with 12. Perlin: Slide 15
What you're saying is that the 30-percent minor contributor has odds of one in a trillion. A lot better than CPE and very objective. For a forensic scientist, we can use the number 12, but for juries, maybe one in a trillion is easier.
What can you do once you have an objective, understandable measure of information (i.e., the discriminating power or the log, the exponent of the discriminating power)? You can use it for lots of things, for example, data quality. Here's a comparison of five of the labs running ABI 310s and all their own different panels. The axis, if you look at the top bar as you go across, is the percentage of DNA from the major contributor (i.e., from the unknown suspect). Perlin: Slide 16
So 90 percent is the major contributor. You see it falling off as information down to 10 percent pretty uniformly across all the groups. But when you go down to the second pair of individuals, you see that the amount of information isn't all the way up at 1017. It starts falling down, and the group on the right in blue just disappears all the way out.
Now, when they reran it everything came back up again. We found that a very good measure of quality is running a two-person mixture and looking at the 30 percent. We get tremendous variation if there's any issue in data quality. The computer will just give you a number.
You should be expecting a number like 12 from this data set. One in a trillion is what the information content is. If the lab's doing superb, the number goes up. If the lab's having a bad day, the number goes down.
Something that scientifically was interesting that will be described more in the papers is how does information (i.e., discriminating power) relate to the mixing weight and to the quantity of DNA that was present. The mixing weight is reflected in the x-axis and you see, as you'd expect, some gradual falloff of information, in discriminating power. But what's interesting is the theory predicts that as you keep having the amount of DNA inside each cluster of bar graphs that you will get a logarithmic falloff in information. You do. That's why you're getting nice step functions falling off. Perlin: Slide 17
So you can sort of predict when you're collecting evidence how much of what quality evidence you might need in order to reach a certain amount of information to identify a suspect, known or unknown, with a certain amount of information. Do you want one in a million, do you want one in a trillion, do you want one in a trillion trillion? How many samples do you need and of what quality should they be? Where do I stop? How do I continue?
This was a very nice result. Here are three of the two-person mixtures—the 30 percent, the 50 percent, and the 70 percent. There's no reference sample among these three different mixtures, and when you get down to the point where there's one-eighth of a nanogram, the bar chart starts falling down: 1 nanogram, 1/2, 1/4, and 1/8. Perlin: Slide 18
Most of the peaks wouldn't even be included in any review because there are thresholds. But our system has no thresholds. It's completely self-regulating. Any peak that has a height over zero is included in the analysis. Actually, when we move to low copy number, peaks that don't have a height of anything will probably be included as well, for a bit more work.
So instead of throwing out the information, we can take different mixtures down to 1/8 nanogram with peaks you wouldn't normally look at. The point is, if there's enough of the stuff that you scrape off the walls and the floors that looks like garbage, the computer can statistically synthesize it back into a tremendous amount of information, and it doesn't take weeks to do it. The computer takes minutes. In this case, I think it may have taken close to 1 hour, but it was not an easy problem. Version 15 is designed to take minutes, so we'll see.
The Federal Rules of Evidence have been discussed a lot today. Rule 702 is that you need reliable data, which our labs work on, and reliable methods of, in this case, interpretation, say interpreting complex mixtures in cases with 50 samples and so on; and you need to reliably apply that method to the data. Perlin: Slide 19
What is meant by "reliable"? Fortunately, there are guidelines on reliable. The old guideline was Frye [Frye v. United States, 293 F. 1013 (D.C. Cir. 1923)], which was general acceptance, which can still be used in many States. But the trend over time is more and more to Daubert [Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993)], which is to have the judge be the gatekeeper and determine how reliable something is. It's not the responsibility of experts outside of what has been; it's what is the state of the art.
Daubert requires testability, and fortunately, expert systems, computer systems, statistical models, mathematical models are quite testable. In fact, you just keep testing stuff on them over and over again. They have an error rate and, because this is science, we look for error rates. We try to find little things in the system that bother us and then we write another 100 equations or something to deal with that and make the models to be more specific. We use the error rates to improve the quality of our systems, much like the British use error rates to maintain the quality of their national DNA database. Error is a very powerful thing in science, and it's required under Daubert.
Peer review has many meanings under the law, but we're working with many crime labs, publishers, and so on, and in the end, once it gets into the system, we have general acceptance and so on. The nice things about automated methods are they can be described and be rendered quite admissible by doing the proper studies and determining their reliability.
So in conclusion, we've developed a casework interpretation system, JustAllele Version 14, soon to be replaced by Version 15. It's objective. It has to be objective. It's just math. There are no people involved. It's unbiased. Well, what do I mean by "unbiased"? We treat two types of cases in the world, no-suspect and suspect. To make it easier, we treat everything as no-suspect cases. It's the computer's job to infer everything that it could possibly know about a suspect before a match is ever done. There's no question, if you go in front of a trier of fact, that there could have been any bias as you looked at the suspect. The suspect's profile is there. It's not available and it's provably not available because the computer never had it. Perlin: Slide 20
The results are reproducible, which is fortunate, because every time you keep running the system you keep getting the same results. They're scientifically reliable, which leads to legal admissibility, and the system is available. We're working with several groups (e.g., New York State), and we have any number of concordance studies under way for automated rape kit interpretation. We're doing a number of studies on serial crime with the British. Actually, they have a list of 22 projects that we're working on, many of which involve quality assurance, automated troubleshooting, and maintaining the quality of the process.
The results of the system, if you think about it, are simple and understandable: You take this entire mass of data and you can reduce it to 12; the odds are one in a trillion that a random person could match this. Ah, the suspect matches it—that must mean something. You don't need to look at the peaks, you don't need to cut off the peaks. The system will look at every peak over and over and over again until it converges to a solution that it's happy with. Not because somebody asked to find a particular solution, but because it's mathematically happy with it.
And it achieves these four pillars: time (it takes minutes), information (it extracts a lot more discriminating power than people would do from mixtures), clarity (the results can be summed up in a single two-digit number, which is a lot better than standing up there with lots of pictures of peaks and defense attorneys and who knows what), and we believe that it's ultimately admissible because of the care we're taking to make it science. I know you've been waiting for the system to come out now for 2 or 3 years, but we really want it to be admissible. So it's coming out this year. But again, we're working to make it science, not marketing. That's why we're where we are.
I'd like to thank the National Institute of Justice for funding this project, and Dr. Lisa Forman for encouraging me to submit, for explaining to me what admissibility was in the first place and why I wanted to do it, and for her staff. Perlin: Slide 21
I'd also like to thank our collaborators, and there are many, many of them. I have to mention people who helped in the design and did a lot of work. I'm leaving out many, many people, but let me at least mention Cecelia Kraus, Gary Duceman, and John Butler. Margaret Kline generated the samples and so on. Jeff Ban has been involved; Dave Wheret; the labs, some of the smaller labs, like in Pittsburgh; and Tom Meyer's group and Mark Squibb's group at Ohio generated absolutely stunning data. It's amazing what the small local labs can generate in terms of quality. Not only can they generate it, but we can show bar graphs that show that their data is amongst the highest quality anywhere. And of course, I have to mention Cybergenetics, where we had people writing protocols, generating data, and doing all sorts of things with software.
Thank you very much.
MS. TOMSEY: Thank you, Dr. Perlin.

