Cohort study

Cohort studies

Suppose we want to know whether an exposure causes a disease. Let’s say whether smoking causes adenocarcinoma of the lung. We could create a randomized control trial where we have collect 1000 people, divide them randomly in two groups of 500. Group A and group B. Group A has to start smoking and group B shall not smoke. That is not possible, since you cannot force people to smoke if they don’t want to. This is considered unethical. However, we can use a different study design; the cohort study. This is a study design where we collect x number of smokers and x number of non-smokers. We assign them to two groups: one that is exposed (to smoking) and one that is not exposed (to smoking). We follow them in time and measure information like death or disease incidence or number of hospitalizations etc. We call the most important information we measure our primary outcome. Usually studies also measure secondary outcomes, but these are of less importance.

Benefits & limitations

In short, a cohort study is where you have an exposed group vs. a non-exposed group and you follow them to see whether the exposure has any relation to the primary outcome. We can do this prospectively, meaning participants have not developed a disease or have not died yet, but have been exposed. We can also choose to assess retrospectively, meaning a group of people has died and we examine whether they have been exposed or not to an exposure we wish to know about. For example 1000 people have died in an asbestos factory and you go back in time to see if there is an exposure difference. You find out that 900 of them were exposed to asbesots while 100 were not. The key distinction of retrospective cohort resaerch is that a researcher goes back in time to find out what might be associated with an outcome.
To repeat in other words, a cohort study involves participants being placed into two groups, followed over a period of time and we measure outcomes that interest us: how many people died, how many people got a disease? And we try to correlate that with factors or exposures they were exposed to. But how do we know this disease they developed is due to the specific exposure and not due to differences in age, sex, income, education level, etc.  For example the group of smokers might have less money that the non-smokers group and therefore visit the GP less. Or the group of non-smokers could be 70-years old while the group of smokers is 40-years old. The researcher tries to correct for these confounders. They try to cancel them out or even better, compare groups who have everything in common, but differ in only exposure or risk factor. That would be ideal, but is almost impossible in real life to find such groups. Thus, one of the biggest limitations of cohort studies is to assess whether associations between a group and risk factors are causal. In other words: how sure can we be that exposure X really causes disease Z without any other factors playing a role, such as age, sex, the location where one lives, the food one eats etc.

Thus to effectively correct for these confounders and find significant results, a cohort study usually takes a long time. Imagine that to develop adenocarcinoma of the lung will take some time after starting with smoking. This long period of time causes another limitation: the condition of people changes. Some in the smoking group decide to stop smoking, others die, others move to another country and discontinue their involvement in the research. What about social factors? A society might evolve to look down on smokers and they might under report their usage. Conversely, it might be considered cool to smoke, and some in the non-smoker group start to smoke, etc. etc. There are so many variables that can change and which make the results hard to interpret. Researchers then want to select people who do not move or who commit and this leads to selection bias.


The word cohort comes from the Latin cohors, meaning a group of warriors proceeding together in time. The word cohort study is attributed to Frost, who studied tuberculosis in the beginning of the 20th century. In the 1960s the Dutch scientist Korteweg used this method of study to analyse the incidence of lung cancer in the Netherlands. Yet this does not mean that before the 1930s no cohort studies were performed. They just had a different name, such as longitudinal or follow-up or just prospective studies. In the late 1800s a need for data on health became needed to make effective policy. Not only policymakers required data, but insurance companies as well. They recorded for example the number of deaths for specific occupations. In the 1950s, some very landmark studies which helped us gain insight into risk factors for certain outcomes were implemented. They continue until today. For example the Framingham study which studies an entire town in the US to find out what the risk factors of cardiovascular disease are. Closer to home, we have the Generation X study of the Erasmus MC which follows children over a period of time.

A lot of our knowledge in medicine (cancer and radiation, asbest and mesotheliama, high blood pressure and heart attack) comes from cohort studies. Cohort studies show an association in relative risks. For example high cholesterol gives a 4x greater risk for a heart attack compared to those that have low cholesterol. We have to note that cohorts do not prove anything in the strict scientific sense, they only express risks or probabilities of association. A high cholesterol is associated with heart attack, but it is not proven in the strict scientific sense. But what if researchers take large groups of people and examine them? What if not only a researcher in the USA, but also one in India, Thailand, Sydney and Greece examines the same outcome for the same exposure and they reach the same conclusion and (approximately) same relative risks? Or what if 90% of the smokers develop lungcarcinoma and only 5% of the non-smokers? Is a consistent relative risk of 20 times greater risk of developing a certain outcome enough prove?

What is meant with a strict scientific sense are the postulates of Koch pre the cohort study area. Koch postulated criteria to establish a hard causative relationship between a disease and a microbe, since at the end of the 19th century people were more concerned with communicable diseases (diseases that spread, like virus and bacteria). Koch stated for example that a microbe that causes the disease must be present in all sick individuals and not in healthy ones and then upon introduction of the microbe in a healthy individual, that person would also become sick. These postulates could not really be used for risk factors. With these, you could not prove smoking to be causing lungcancer, because healthy individuals got the same lungcancer too. Therefore, the need for a different type of ‘proving’ emerged with the emergence of non-communicable disease (chronic diseases).

Over the years, scientists have used different analytical models to make cohort studies better and make predictions based on regression / proportional hazard models. Results are being accompanied by many analyses, p-values and confidence intervals and we will discuss in other articles.