Skip to local navigation | Skip to main content

Fourth Annual DNA Grantees' Workshop

Wednesday, June 25, 2003

MORNING SESSION

Research Update Briefings: Ongoing Projects (Continued)
Ted Staples, Moderator
Biography

MR. STAPLES: Okay everyone, if you could go ahead and take your seats. We're going to go ahead and start. We're a little bit ahead of time, which is okay. We'll just get to lunch a little bit sooner.

My name is Ted Staples. I'm from the Georgia Bureau of Investigation, and we'll be continuing this morning with our research update briefings on ongoing projects.

Improved Regulation of Y-Chromosome Markers for Forensic Analyses
Alan J. Redd

MR. STAPLES: Our first speaker is Dr. Alan Redd. He attained his Ph.D. in anthropology from Penn State University and is currently doing a postdoc at the University of Arizona.

Please welcome Dr. Redd.

DR. REDD: Thank you. It's a pleasure to be here. I'm going to talk today about some of the work that we are doing to develop markers on the Y chromosome for forensic applications. I think I'll talk for about 25 minutes. Redd: Slide 1

Before I start, I'd like to acknowledge many of the people who contributed to some of the data that I'm going to present. First, I'd like to acknowledge Jeff Ban, Rex Riis, and Russell Vossbrink for providing population samples from Virginia, South Dakota, and Arizona. Redd: Slide 2

I'd like to acknowledge some of the team members at the University of Arizona. Veronica Contreras and Veronica Kearney do a lot of the DNA typing. I'd like to thank Al Agellon for his computer assistance. Tatiana Karafet has been helpful with SNP (single nucleotide polymorphism), Michael Hammer makes most of the work possible, and of course, the National Institute of Justice funds the projects, which makes it possible as well.

We also have some great collaborators. We've got John Butler and his team at NIST (National Institute of Standards and Technology) and Pete and Richard Shotsky. We also plan to do some validation with Sue Narveson.

So here's a brief outline of the talk I'm going to give. First, I'm going to tell you a bit about the Y chromosome and why it's useful and, perhaps, what promise it has for forensic applications. Then I'm going to tell you about a study that we published last year, in which we found 14 novel Y–STRs (Y-chromosome short tandem repeats), that dramatically improved language resolution above current levels. I'm also going to describe our growing population database and show you samples. Redd: Slide 3

I'll also discuss a couple of things in related to databasing. In order to structure your database, it's important to understand the genetic structure and variation between populations. I'll show you some data in that regard. Then I'm going to quantify levels of admixture in different U.S. populations, and that's again very important for knowing which populations you can lump together and determining whether you need different regional databases. Finally, I'm going to talk briefly about a new bit of research that I'm getting into, in which I'm testing the ability to predict Y-chromosome ethnicity or philogeographic ethnicity.

This slide shows our sex chromosomes, an X and a Y. Approximately 300 million years ago, these chromosomes didn't look like this at all. They looked like any other autosomal pair. It wasn't until around that time that the SRY (sex-determining region of the Y chromosome) gene evolved on the Y chromosome. Suppressing recombination was one of many changes throughout its evolution. Today, 95 percent of the Y chromosome doesn't recombine with the X chromosome. Redd: Slide 4

In a recent paper published in Nature, John Butler said that the Y chromosome is more complex than we thought before. Well, that doesn't mean that men are any more complex than we thought they were. Actually, the major gist of the paper is that the Y chromosome is highly duplicated, and there is a lot of gene conversion going on. So it's as if the Y chromosome is recombining with itself through gene conversion. Instead of the non-recombining portion of the Y chromosome, now there is a new term called the male-specific portion of the Y chromosome.

The Y chromosome is a great tool because we know of tons of markers. It took a long time to find them, but we now have an explosion of new markers, both SNPs and microsatellites, and of course, it's paternally inherited. This particular guy got his Y from dad and his X from mom.

Perhaps, you have seen this before but I think it's a useful slide to show. It shows inheritance patterns in four compartments of the genome. Here at the bottom, representing ego, is one male. He has a short Y chromosome (in blue), a mitochondrial genome (circled), and a single pair and autosomal pair (left). If you trace the inheritance of his Y chromosome, he got it from his father and from his grandfather, and ultimately it traces back to a single ancestor, his great-grandfather. Likewise, he inherited his mitochondrial DNA from his great-grandmother, so it's a single female ancestor. Redd: Slide 5

If you were to trace the history of autosomal markers, it would be much more complicated. For example, you'd find that the yellow portion in this autosome traces back to this particular ancestor, and so on down the line. You wouldn't be able to reconstruct that unless you had the full pedigree, and it would be very tricky.

This shows you the simplicity of inheritance for both the Y chromosome and mitochondria. But it also points to a limitation of the Y chromosome when you're typing autosomes. If you were to type these 15 individuals, you would find that they all would be unique. You would have a genetic fingerprint. They would all be individualized.

But if you were to type some Y-chromosome markers, you would probably be able to differentiate the unrelated ones (i.e., the ones with different color), but the Y chromosomes that are of the same color (i.e., the related ones) would be the same. So when you're dealing with the Y chromosome, you're typing paternal lineages; you're not typing individuals. That's a limitation for the Y system and for mitochondrial DNA. You don't get the same level of resolution, but it's also a strength in other applications. Redd: Slide 6

So when will the male-specific DNA typing system be important? Well, sexual assault cases are the obvious application because it can simplify your analysis when you have either single or multiple male donors. It can also work when you have vasectomized or azoospermic men; that is, men who don't produce sperm. In these cases, you can't do a differential extraction, so you could go straight to the Y chromosome. Redd: Slide 7

Of course, it's the same case for other body fluid mixtures, because you can still use the Y chromosome. You can also use the Y chromosome to identify missing persons. I don't know whether this is true, but I heard some talk that people were trying to find out whether or not Saddam Hussein or Osama bin Laden was dead or alive by comparing samples of tissue collected from paternal male relatives. So that's one example of how you could use the Y chromosome.

As I mentioned earlier, I think the Y chromosome can be used to determine Y-chromosome ethnicity using philogenetic patterns of variation. Actually, the Y chromosome has the highest and most varied level of Fst—a measure of genetic distance—compared with autosomes or mitochondrial DNA. Therefore, you would actually have a stronger signal for this particular purpose.

Y–STRs and Y–SNPs are two classes of markers that are commonly found in the Y chromosome. They have completely different patterns of evolution and mutation rates. STRs have been extremely valuable markers in a huge number of disciplines, predominantly because they're so abundant in eukaryotic genomes and their mutation rate is so fast that they provide a high level of resolution—so fast that you can actually observe quite easily the mutations in pedigrees. SNPs come in different forms and different flavors (e.g., a single nucleotide polymorphism or insertions and deletions), and they have much slower mutation rates. Redd: Slide 8

Here's a bit of terminology. A bunch of microsatellites on a Y chromosome define what we call a haplotype, whereas a bunch of SNP markers on the Y chromosome define what we call a haplogroup.

This is a tree that was published last year. My boss actually published it. Mike Hammer is the first author. He actually didn't put his name there. He put "YCC" (Y Chromosome Consortium) as the first author, because it was a joint effort with his lab and Cavali-Sforza's. They typed 247 SNPs in a global sample and found a total of 153 lineages. There is really 18 major lineages (A through R), and that's the shape of the Y tree for humans. This tree unifies about seven or eight nomenclature systems that were in the literature, so it's a very useful road map for anybody who wants to do Y–SNPs. It's still very confusing even with this map, but it was much more confusing before it was published. Redd: Slide 9

Now I'm going to tell you a bit about the 14 novel markers that we found on the Y chromosome and how they improved resolution. John Butler mentioned that this set of markers, which has been called the minimal haplotype, is commonly used in Europe.

This slide shows the relative positions of all markers on the Y chromosome. One of the markers is a multicopy marker, DOS385, so with a single pair of primers you get two bands. This was the state of affairs when we started to do our research. With the addition of another multicopy marker, YCAII, this set of markers defines core loci known as the extended haplotype. Redd: Slide 10

Well, when we started out we thought perhaps we needed to identify some more microsatellites, predominantly because we didn't think that the lineage resolution was very good. This slide shows a database of nearly 13,000 Europeans, and in that database only 42 percent of those lineages are unique. That means a huge chunk of them are shared by at least one other male or many others. Redd: Slide 11

The most common lineage is found in 3 percent of the database. Actually, that's my lineage. So that you can think about it like this: It's possible that I'm related to 3 percent of the males in Europe, but it's also possible that these microsatellites really don't differentiate some related but different lineages.

So we thought we needed to improve upon that, because I don't think I'm related to 3 percent of the Europeans. But actually, in the United States the most common lineage is one step away, and that lineage is found in 2 percent of the population database in Europe. We also needed to identify some microsatellites that had longer repeats. YCAII suffers from stutter, which you saw in John's slide perhaps yesterday or the day before.

For further improvement, we wanted to compare microsatellites in a big geographic database using some of the microsatellites that were available but weren't being used. Most importantly, we wanted to directly compare microsatellites in the same set of samples, so you could let them compete against each other to see which ones were best at resolving paternal lineages. Redd: Slide 12

So we published this manuscript in Forensic Science International last year [2002], and in it we characterized and provided a nomenclature system for 14 microsatellites and typed them in 2 samples. One was the YCC panel, which I'll describe in a minute, and the other was a European-American or Caucasian sample from South Dakota. Redd: Slide 13

This is the first sample that we typed these markers in and others. It's called the Y Chromosome Consortium sample. These are cell lines that were established by Mike Hammer and Nathan Ellis in 1991, and it's a great sample. It's nearly an infinite resource of DNA that you can use over and over again. They come from different geographic locations, from Africa, Europe, the Middle East, and Asia. They're from local tribal populations, including the Koysan and the Pygmies, some Native Americans, and Suuri from Brazil. Redd: Slide 14

It's important to note that with this sample we probably have mostly unrelated paternal lineages, but it's very likely that we have some related ones as well. The probability of finding related individuals increases especially when you start to sample tribal populations.

Here's a slide showing the location of the 14 novel markers that we found and their location on the Y chromosome. It was based on a BLAT search. You can see that we found eight tetranucleotides, five pentanucleotides, and the first hexanucleotide repeat, so these have longer repeat motifs. Redd: Slide 15

Two of these loci are multi-copy loci. DOS464 is actually found in four positions—A, B, C, and D. It's actually found in the DAS region. These are the palindromes on the Y chromosome. Palindromes are repeated DNA that get flipped and inserted in the opposite direction. DYS459 is the other multi-copy marker. It's found in two positions. They're separated by a large number of megabases.

This slide shows how we characterized these loci. I don't want you to look too closely at it. I do want to tell you, though, that we found some loci that had a large number of repeats, and it's known in the literature that microsatellites with longer repeat stretches are more prone to mutation. So it's a good thing to look for if you want to find markers that are more variable. Redd: Slide 16

This is a slide that shows the history of Y–STR discovery, and I'm also using it to show you that I compared many the markers in the YCC panel. The markers in red are the extended haplotypes. Most of them were discovered in the early 1990s. We typed those markers. Of course we typed our 14 novel markers, and we typed a majority of the others here in green. Redd: Slide 17

In this slide, I ranked gene diversity from highest to lowest. The highest is up here on the left, and this is the top 50 percent, this is the lower 50 percent. You can see that 10 of the 14 novel markers are in the top 50 percent. The novel ones are in blue. Redd: Slide 18

Then you can see that the extended haplotype has some on both sides. Some of our markers were slow as well. The most variable markers were the multicopy markers, those with the asterisks. The most variable marker was DYS464. That was the multi-copy marker that we discovered in the palindromic region.

Here's a slide showing profiles from five males with the DYS464 locus. In some males you can have one or four different peaks and variations in between. There are different ways that you can score this marker. You could either use peak number or peak height. For example, this particular male could be called 13,14,17 or 13,14,17,17. This peak here is twice as high as these other two here. We actually chose to do peak number—the more conservative scoring. Redd: Slide 19

It's interesting to look at the resolving power of this single nucleotide in comparison to many SNPs. If you type the YCC panel—remember, that was the cell line panel, 73 males—you see 41 haplotypes and 41 unique bar codes. However, if you type 245 SNPs in the same sample, you only see 38 haplogroups. So you get fewer haplogroups with 245 Y–SNPs compared with a single microsatellite. That shows you the power of resolution of a highly polymorphic microsatellite in comparison to SNPs. It doesn't mean SNPs aren't valuable. They have a different use, which I'll show you later.

When we sent that manuscript to get published, one of our reviewers said that because the YCC is not a real population—it contains a couple of individuals plucked from here and a couple plucked from there—you can't tell anything about gene diversity of those microsatellites. So demonstrated it in a population sample. We thank Rex Riis for providing DNA and plenty of it from a South Dakota European-American population. Redd: Slide 20

When you start doing population sampling, you need multiplexes. We've been designing some multiplexes, and we're trying to improve them by learning more from John Butler. These wouldn't be called manly-plexes by any means because the pectorals are a little bit deflated. [For more on manly plexes, see John Butler's discussion on "NIST Y-Chromosome Work and Standard Reference Material 2395" from the first day of this meeting.]

But one thing is for sure: These multiplexes have gone through a lot of testing. We've probably typed about 10,000 males with these two sets of multiplexes that we amplify separately and load together. We've probably typed about 5,000 males with this multiplex in particular; and with this really sort of wimpy multiplex, not a manly multiplex at all, we've probably done about 1,000 or 2,000 individuals.

I couldn't get a couple of Y–STRs in the multiplexes, so I uniplexed them. This slide shows you that I typed 26 Y–STRs, which gives 31 bands because some of them were multicopy. Redd: Slide 21

So this is the real population, and the gene diversity rankings with this real population compared to the cell line panel are rather similar. The correlation was 0.8. You can see that many of our markers—our novel ones in blue—are up at the top. In fact, 7 are in the top 10. The minimal haplotypes, which are shown in red, are sort of going to the bottom. 385 is still a great marker, and many of the others are still fine. They're just not winning this particular competition. Redd: Slide 22

This slide shows how we completely resolved the paternal lineages in these two samples, and it shows the rate at which that occurs. The discrimination capacity is along the Y-axis. That's simply the number of unique haplotypes divided by the total sample size in percent. Redd: Slide 23

Along the bottom here, I started with the minimal haplotype, then I added a single Y–STR in order of highest diversity to lowest diversity and recalculated discrimination capacity and plotted it, and it was added in a cumulative fashion as well. You can see that they—the European-American sample and the YCC sample—both jumped up to the nearly 95 percent with just the addition of DYS464 (the highest diversity one).

With the addition of only three more markers after that, they were all distinct. So not very many more markers completely resolved all those paternal lineages. If you were to look just at the minimal haplotypes, you'd think that they were relatives, but in fact they're not.

When you look at all 26 microsatellites, the males actually differ: Only one of them differed by 2 steps and all the rest of them differed by 3 or more steps, so they're not related.

It took a little bit longer in the YCC panel. Again, that's probably because they have a high probability of having paternal relatives when you look at tribal samples. In fact, there were four pairs that didn't resolve very quickly. It took 14 additional markers on top of the minimal haplotype to get them split out.

John Butler typed the pairs that didn't split with autosomal markers, and these were pairs of males from the same population: two Koysan and two Native Americans. We found that they weren't brothers because they differed at both alleles of a particular locus for 2 to 5 out of the 15 loci typed, so they were probably paternal cousins.

Using a coalescent method and a Bayesian statistical method, we estimated when they last shared an ancestor. These pairs of individuals probably shared an ancestor about 200 years ago, with a wide confidence interval of about 2 to 40 generations.

SWGDAM (Scientific Working Group on DNA Analysis Methods) met and they chose some microsatellites that could be used in the United States and they dubbed this the U.S. haplotype. They took the minimal haplotype from Europe, threw out YCAII, and added a couple of additional markers, DYS438 and 439. Although they didn't include any of our novel markers, perhaps they will in the future. Redd: Slide 24

It's worth noting that many of the DNA genealogy companies are already using many of our markers. Although they're not taking off yet in the United States, they're certainly taking off elsewhere and perhaps in the future they'll be used in the United States.

I'm going to show you a bit of information on the U.S. population database that we're building on the Y chromosome. This is slide shows the location of the samples and the people who gave them to us. I'm going to present some data on samples from South Dakota, Arizona, and Virginia. Redd: Slide 25

We actually have data on three ethnic groups: African-Americans, European-Americans, and Hispanics. We also have three populations of Native Americans: Sioux, Navajo, and Apache. Again I'd like to thank Jeff Ban, Rex Riis, and Russell Vossbrink for giving us these samples.

From about 965 samples, nearly 1,000 males, we typed 26 Y–STRs and about 50 Y–SNPs. That's about 80,000 data points. It's still too small, unfortunately. We want to make it bigger. At the end of the slide show, I'm going to invite and ask some of you to give us some more samples from different regions of the United States.

This is the diversity rank in bigger samples of real U.S. populations. Again, it shows that the gene diversity rank of our blue markers here did quite well. Eight of them are in the top 10. Redd: Slide 26

This shows lineage resolution in the U.S. ethnic groups using the minimal haplotype. Using the minimal haplotype plus 464, it jumped up to the 90s for these three ethnic groups. It didn't go very high for the Native American populations: It stayed at about 67. With all 26 Y–STRs, we got complete resolution. Redd: Slide 27

We nearly get 100 percent for the three populations of African Americans, European Americans, and Hispanics. Only the Hispanics had a couple of individuals that were the same, and with Native Americans, this bigger sample shows the difficulty of discriminating paternal lineage patterns among tribal populations.

So how does the genetic structure look in the United States? U.S. populations have been called sort of a melting pot. Well, how much ancestry do we share? Do U.S. populations cluster with the same ethnic group or do they cluster with populations from the same State? Do they cluster with populations from different ethnic groups? These are the types of questions that are important when building a database. Redd: Slide 28

This is an MDS plot, a multidimensional scaling plot. It's the same concept as a mileage chart in the beginning of a geographic atlas. The mileage chart shows the distances between all pairs of cities. Essentially, I used a distance matrix to build this plot. Redd: Slide 29

Populations that are similar should be close together, and those that are different should be far away. It turns out that populations generally cluster by ethnic group. You've got a cluster of African Americans here [indicating], Hispanics are here in the middle, the European Americans are not too far away; and two of the Native American populations are clustered up here to the right. The Sioux population, however, didn't cluster with the Native Americans, and I'll tell you more about why I think that is later.

I performed an analysis of molecular variance and found that the variation between populations is about 15 percent. The variation between populations in any of these circles is not significant. In fact, it's very small, which supports the clustering and the database. You can pool African Americans, European Americans, Hispanics, and some Native Americans. I didn't include the Sioux in this analysis because they really require separate treatment.

This slide shows a Y–SNP tree in the world. It's a global survey of 62 SNPs and 2,000 samples. Again, there are 18 major haplogroups. The size of the circles corresponds to the frequency. I just want to highlight the frequent haplotypes in the U.S. population. Native Americans are basically Q haplogroup, European Americans are predominantly this RP25 haplogroup, although they have others as well, and African haplogroups A, B, and E are very common. E–P1 is the most common African-American haplogroup. E–P2 is found both in populations from southern Europe and northern Africa. These SNP's have a very high degree of ethnic geographic specificity. Redd: Slide 30

This is an MDS plot based on a different data set—the slowly evolving Y–SNPs—and the results are identical. You still see the four clusters, with the Sioux being the outlier. This time the analysis of variance shows that the variation between populations is larger. The variation between the groups is about 32 percent, and the variation between populations within any of those ethnic groups is very small. So these are completely different types of markers. They're both on the Y chromosome, but they're giving very consistent results. Redd: Slide 31

So this particular slide deals with, and a few that I'll show later deal with, admixture in U.S. populations: What does Y-chromosome admixture look like in U.S. populations? Here you will see that I color coded haplogroup origins in Arizona, South Dakota, and Virginia. Green represents African-origin haplogroups, blue is for European origin, and red is for Native American. About 25 to 30 percent of the African-American populations have Y chromosomes of European origin. That's actually higher than what is found on the autosomes. For autosomes it's about 12 to 20 percent. Redd: Slide 32

This slide shows admixture and ethnicity in the three Native American populations. You can see that European admixture in the Navajo and Apache is very small, about 0 to 5 percent. But European lineages are quite high in the Sioux, about 53 percent. They also have a good number of Native American lineages as well. Redd: Slide 33

When I went to the literature and looked at the history of the Sioux, they started to trade with Europeans at an early date. In particular, they traded with French trappers for guns and horses, and documents indicate that French men married Sioux women. It may be that these chromosomes are actually predominantly French. It would be interesting to test for whether you can tell that or not.

This shows the haplogroup profiles of Hispanics. They're by far the most colorful. They have predominantly European haplogroups, about 60 to 70 percent (blue). They have Native American haplogroups as well, about 12 to 20 percent (red), and African-American Y chromosomes, about 9 to 30 percent. Redd: Slide 34

I put a tilde here [indicating] because the E–P2 haplogroup is actually found in Europe and Africa, so this may be an overestimate. I need to resolve those haplogroups further using additional SNPs.

What about the European Americans? Well, they don't have very much admixture. These particular populations contain about 1 to 3 percent Native American genes and perhaps about 2 to 6 percent African-American genes, on the Y chromosome of course. Redd: Slide 35

I received a phone call from a forensic caseworker who asked me if I could predict ethnicity using some DNA from one of his cases. He said it was a high-profile case. My first response was kind of. Well, I don't know. I mean, you better be careful because the Y is only one fraction of your genome and it may not correlate at all with the phenotypic characteristics that are used in, I guess, dragnets. So I said yes, in theory it can be done, but there may be a lot of qualifications, caveats, and limitations. So I wanted to test whether or not I could do it and to what degree I could do it well. Redd: Slides 36 and 37

So here is a discriminant analysis that addresses that very question. You just look at the microsatellite data or the SNP data and predict what ethnic group it came from. Given that I had the self-assigned ethnic group, I could test the success of this particular process. Here is the percentage of correct assignments along the Y-axis and then the ethnic groups along the X-axis. As you can see the STRs and SNPs, you can do it about 35 to 65 percent of the time, although there are plenty of mistakes. Hispanic data were the worst but that's not surprising when you remember what the haplogroup profile looks like for Hispanics.

If you remember the multidimensional scaling plot, the Hispanics were close to the European Americans. So I repeated the analysis with those two populations lumped together. That's the only additional analysis I did.

As you can see, it definitely improved the discrimination and the correct assignment for the other populations. The correct assignment basically jumped from 65 to 95 percent for the three ethnic groups. Redd: Slide 38

So, given all the caveats of not knowing what the Y ethnicity means with regard to phenotypic variation and the rest of the genome, it looks like there's a lot of information there that may be useful. I guess that kind of question could be better answered by caseworkers. They might know better whether this kind of information could be valuable. Of course, it's important to note that this is just a preliminary analysis. Redd: Slide 39

So in conclusion, I think the Y chromosome can be an additional tool used in forensic applications. I know that ReliaGene has been doing work with this, and they've got a multiplex out there. I know Promega has got a multiplex on the way. I even know that ABI (Applied Biosystems) is gearing up too, but I don't know how far along they are. These kits are going to really help push the Y chromosome into more routine use in forensic labs.

Our 14 novel markers dramatically improved paternal lineages regardless of the samples we used, whether it was the YCC or the population samples.

The genetic structure in U.S. populations mostly follows ethnic groups, although there are exceptions. Remember the Sioux? They didn't really follow that pattern.

The admixture varies greatly in U.S. ethnic groups and populations. Hispanics were the most admixed, followed by African Americans, and European Americans. Of course, the Sioux were very admixed.

Predicting Y-chromosome ethnicity is possible in many cases, although the usefulness of this may need to be evaluated in more detail.

This slide shows some of the locations of the samples that I presented. I just want to thank the people who gave us the samples. These samples in blue are the ones that I presented today, and the ones in yellow are samples that we have. Mickey Prinz, Mark Nelson, and Heather Coyle have already given us samples that we're planning to type. We want to thank them, and we are still hopeful that we will receive samples from the other locations. Redd: Slide 40

Of course, I'd like to invite other people to donate samples. We would really appreciate it. If you would like to do this, please contact my boss because his name and e-mail are in the manual or you can come and contact me in person today. Again, we would love to include your samples and build a bigger database that could be more valuable.

Thank you for your time.


Previous          Contents          Next
Date Entered: January 17, 2008