Partnership for Food Safety Education

FightBAC!

  • Food Safety Basics
    • The Core Four Practices
    • Featured Resources
  • Food Poisoning
    • About Foodborne Illness
    • Foodborne Pathogens
    • Causes & Symptoms
    • Food Safety Glossary
  • Food Safety Education
    • National Food Safety Education Month
    • Safe Flour Handling
    • The Story of Your Dinner
    • Food Safety Mythbusters
    • Safe Poultry Handling
    • Prep Yourself: Delivery Food Is on the Way
    • Safe Produce
    • Recall Basics
    • Go 40 or Below
    • Safety in All Seasons
  • K-12 Education
    • Curricula & Programs
      • Grades K – 3
      • Grades 3 – 5
      • Grades 4 – 8
      • Grades 9 – 12
    • Hands On
    • Kids Games & Activities
    • School Lunches
  • Child Care
    • Babies & Toddlers
    • Child Care Training
    • Kids Games & Activities
  • Safe Recipes
    • Safe Recipe Style Guide
    • Safe Recipe Activity for Middle School
    • Cookbooks
    • Appetizers
    • Side Dishes
    • Entrees
    • Desserts
  • Free Resources
    • Recorded Webinars
    • World Food Safety Day
    • Global Handwashing Day
    • Recursos en español
    • Coronavirus Resources
    • Evaluation Toolkit
  • About Us
    • Partnership & History
    • Board of Directors
    • Who Is Involved
    • PFSE Team
    • Brand Assets
    • BAC Fighter Ambassadors
    • Job Openings
    • Contact Us
  • Get Involved
  • Events
    • Upcoming Events
    • 2025 Conference
  • News & Blogs

Selecting an Evaluation Design

About
Overview & Importance of Evaluation
Formative Program Planning
Mapping the Intervention & Evaluation
Selecting an Evaluation Design
Data Collection
Data Analysis
Return to Start
About
Overview & Importance of Evaluation
Formative Program Planning
Mapping the Intervention & Evaluation
Selecting an Evaluation Design
Data Collection
Data Analysis
Return to Start
Download the Full Guide PDF
Access the Toolkit Resources

Selecting an Evaluation Design

Observational and experimental designs

An evaluation design is the structure or framework you decide to use to conduct your evaluation. There are two main types of evaluation designs: observational and experimental.

An observational study is a study design in which participants are not pre-assigned to participate in the program or not (control/comparison group).

An experimental study involves an intentional assignment of participants to the experimental/intervention group and the control/comparison group. This allows the evaluator to alter the independent variable (the program or activity) and be able to control external factors that influence the outcome variable. Experimental designs are not always easy to implement, but are the best option for reducing internal threats to validity (for definition see section “internal validity” below).

When to collect data

Deciding when to collect evaluation data is an important part of selecting an evaluation design. Options include the following:

  • Collect data only, usually in a post-test. A post-test is when you collect data after the program or intervention.
  • Conduct a pre/post-test, where you collect data before and after the program takes place.
  • Collect data multiple times throughout the evaluation process.
  • Conduct a retrospective pre-test that is administered at the same time as the post-test.

More information about the benefits and limitations of each of these options will be listed in the table Evaluation Designs – Description, Benefits, and Limitations

Internal Validity

When deciding how to design your evaluation and when to collect data, it is important to think about minimizing threats to internal validity that could bias your data and evaluation findings. Internal validity refers to the extent to which you can ensure or demonstrate that external factors, other than the independent variable or your program, did not influence the outcome variables [2]. This can influence how true or accurate your findings and conclusions are, so it is important to protect against internal validity threats to ensure your evaluation is sound and reliable.

Below are descriptions [2] and examples of different types of threats to internal validity:

  • Maturation occurs when participants have matured or developed mentally or emotionally throughout the evaluation process, which influences the outcome variable.
    • Example: Kids perform better on a post-test foodborne pathogens quiz than on the pre-test simply because they are older and have become better test takers, not because of the new food safety curriculum at their school (education program/independent variable).
  • History threats happen when events that have taken place in the participants’ lives throughout the program or evaluation process, which influence the outcome variable.
    • Example: Participants score high on a household audit because most of them recently read a newspaper article that highlighted consequences of foodborne illness, not because of the “Food Safety at Home Reminders” magnet they received in the mail.
  • Testing effect occurs when the participants’ post-test data is influenced by their experience of taking the pre-test.
    • Example: Participants’ understanding about the importance of separating raw meat, poultry, seafood, and eggs from other foods in their shopping cart improved in the post-test because of being exposed to that information in the pre-test, not because they read new signage on cross contamination in their local grocery store.
  • Instrumentation takes place when data on the outcome variable is influenced by differences in the way the pre-test and post-test assessments are administered or collected. Pre-test and post-test assessments need to be the same to prevent instrumentation from occurring.
    • Example: Participants more positively describe their safe food handling practices in the post-test interview because of the way the new interviewer described and interpreted the questions, not because a new training program encouraged them to adopt new safe food handling practices at home.
  • Recall bias takes place when participants do not accurately remember events they have experienced in the past and this influences the accuracy of the data collected.
    • Example: Participants do not remember how long they generally take to wash their hands so they guess a number of seconds that is inaccurate. The finding that most participants wash their hands for at least 20 seconds is due to participants incorrectly recalling how long they wash their hands, not due to reading new handwashing messages posted all over social media.
  • Social desirability bias occurs when participants provide responses they believe will be pleasing to the interviewer and will make them seem more favorable. This can lead to over-reporting of perceived “positive” behaviors and under-reporting of perceived “negative” behaviors.
    • Example: In one-on-one interviews participants share that foodborne illness is of great concern to them and that they always try their best to practice safe food handling practices at home because they want to impress the interviewer and think that is the “correct” answer, not because it is actually true.
  • Attrition bias happens when a loss of participants in the experimental or control/comparison group influences the evaluation data. Attrition can be due to reasons such as loss in follow up, death, or moving away.
    • Example: The experimental/intervention group group loses about a third of participants for the post-test survey and this negatively impacts the overall knowledge testing score on safe storage of foods. As a result, change in score is mostly not related to the educational video that participants watched.
  • Selection bias occurs when differences in the data collected from the experimental and control/comparison group are due to differences between the individuals in each group, not because one group participated in the program and the other did not.
    • Example: Participants in the experimental group demonstrate greater motivation to adopt safe cooking practices at home because the group is comprised of more risk-averse personalities, not because the experimental group was exposed to interactive TV ads on safe cooking.

Use of a comparison/control group

One way to protect your evaluation from validity threats is to use a comparison or control group of individuals who do not participate in the program to compare to individuals who do participate in the program.

The best way to choose a control group and prevent selection bias is to randomly choose both who will participate in the program and who will be in the control group. This is called random assignment and through this method both groups will be theoretically alike [2].

If random assignment is not possible, you can collect demographic information about individuals in each group in the evaluation so that when analyzing data you can adjust for differences between each group [2]. Remember, the longer you wait before collecting data after the program or intervention, the more likely it is that both groups will regress towards the mean and have fewer differences in regards to the outcome variable [2]. It is generally best to collect your post-test data not long after the program is implemented. However, there are also times when a delay in collecting post-test data is necessary to allow participants enough time to implement a new behavior.

"How They Did It" in a magenta box.To evaluate the effectiveness of Web-based and print materials developed to improve food safety practices and reduce the risk of foodborne illness among older adults, a randomized control design was used. One hundred participants were in the website intervention group, 100 in the print materials group, and 100 in the control group. Participants took a Web-based survey that was emailed to them before the intervention and about 2 months following the intervention.

To measure food safety behavior, participants were asked to report their behaviors when they last prepared specific types of food. To assess perception of risk of foodborne illness, participants were asked to rate agreement to the following statement with a 4-point Likert scale: “Because I am 60 years or older, I am at an increased risk of getting poisoning or foodborne illness.” Participants were also asked about how satisfied they were with the educational materials and how informative and useful they found them. Overall findings showed insignificant difference in the changes between groups, demonstrating that the materials did not impact food safety behavior.

Kosa, K. M., Cates, S.C., Godwin, S.L., Ball M., & Harrison R. E. (2011). Effectiveness of educational interventions to improve food safety practices among older adults. Journal of Nutrition in Gerontology and Geriatrics, 30(4), 369-383.

Now that you have learned about threats to internal validity, you can weigh the benefits and limitations of various evaluation designs based on your resources and evaluation needs. Below is a table of common evaluation designs and the benefits and limitations of each option [Table based on information from: 2]. In general, the design options increase in rigor as you go down the table. It is important to know that some of these designs are frequently used to evaluate health programs, but are generally weak in terms of being able to tell you whether change in the outcome variable can be attributed to the program.

Note: selecting a rigorous evaluation design may not be possible for you since they are usually more expensive, time consuming, and complex. If this is the case, consider selecting one of the first design options listed, such as a one group post-test only design or a one group pre/post-test approach.

Evaluation Designs – Description, Benefits, and Limitations

Evaluation Design Description/ Example Benefits Limitations
One group post-test only
• Collect data after implementing the program.
• Example: You implement a food safety workshop and then hand out a survey before participants leave.
• Good to use if a pre-test might bias the data collected/findings or when unable to collect pre-test/baseline data.
• Generally inexpensive.
• Easy to understand and for staff with little training to implement.
• May be the only design option if you do not plan ahead and decide to evaluate once the program has already begun.
• Weak design because you do not have baseline data to be able to determine change.
• Not very useful in understanding the actual effect of the program.
• Examples of potential validity threats: history and maturation because you only have information regarding a single point in time and don’t know if any other events or maturation took place to influence the outcome.
One group, retrospective pre/post-test
• Collect both pre- and post-test data after implementing the program. The pre-test will involve participants thinking back to their experience before the program.
• Example: You implement a food safety workshop and then hand out a two-part (pre- and post-) survey before participants leave.
• Good to use when you are unable to collect traditional pre-test/baseline data.
• Generally inexpensive.
• May be the only design option if you do not plan ahead and decide to evaluate once the program has already begun.
• May demonstrate more accurately how much participants feel they have benefited from the program [1].
• Without a comparison group, this design is not very useful in understanding whether a change in the outcome variable is actually due to the program.
• Examples of potential validity threats: Recall bias and social desirability. Social desirability can be more influential in a retrospective pre-test than a traditional pre-test [1].
Comparison group post-test only
• You have a comparison group of individuals that do not participate in the program. Following the program you collect data from the program participants and the comparison group.
• Example: You implement a food safety workshop and then hand out a survey to participants before they leave. You also give the same survey to individuals in the comparison group who have not participated in the workshop. You later compare survey results of program participants and comparison group.
• Good to use if a pre-test might bias the data collected or when unable to collect pre-test/baseline data.
• May be the only design option if you do not plan ahead and decide to evaluate once the program has already begun.
• Statistical analysis to compare both groups is fairly simple.
• Comparison group must be available.
• Target audience must be large enough to have an experimental and a comparison group.
• Weak design because you do not have baseline data to be able to determine change and whether differences between groups are actually due to the program.
• Example of potential validity threats: selection bias.
One group pre/post-test
• You collect data before and after the program or intervention takes place. Usually data is linked for each single individual to assess amount of change.
• Example: You implement a food safety workshop and survey participants before and after they participate in the workshop. A survey knowledge score is calculated for each individual participant to find out if scores improved after participating in the workshop.
• Able to identify change before and after the program.
• Generally easy to understand and calculate.
• Must be able to collect pre-test/baseline data.
• Without a comparison group this design is not very useful in understanding whether a change in the outcome variable is actually due to the program.
• Demonstrates greater evaluation rigor and validity when seeking funders or sharing outcome findings with partners than when relying on a single post- test or retrospective pre- and post-test [1].
• Examples of potential validity threats: testing affect and instrumentation.
One group, repeated measures or time series
• You collect data more than once before program implementation and at least two more times following intervention, over time. The optimal number of times to collect data is five times before and after the program, but this will vary depending on your sample and evaluation needs [4].
• Example: You implement a food safety workshop and survey participants a few times before and a few times after they participate in the workshop, over the following months. A survey knowledge score is calculated for each time participants took the survey to find out how scores improved after participating in the workshop and how much information was retained over time.
• Able to identify change before and after the program.
• By tracking change repeatedly over time you have a greater opportunity to observe external factors that might influence findings and address threats to internal validity.
• Generally beneficial for large aggregates like schools or populations.
• Must be able to collect pre-test/baseline data.
• Example of potential validity threats: history
Two group pre/post-test
• You collect data from program participants and a comparison group before and after the program takes place. Usually data is linked for each single individual to assess amount of change.
• Example: You implement a food safety workshop and survey workshop participants and the comparison group before and after the workshop takes place. A survey knowledge score is calculated for each individual participant in both groups to find out if or how scores changed and how scores of program participants’ and the comparison group differ.
• Able to identify change before and after program/intervention.
• Statistical analysis to compare both groups is fairly simple.
• Comparison group must be available.
• Target audience must be large enough to have n experimental and a comparison group.
• Must be able to collect pre-test/baseline data.
• Demonstrates greater evaluation rigor and validity when seeking funders or sharing outcome findings with partners than when relying on a single post- test or retrospective pre and post-test [1].
Two or more -group time series
• You collect data from program participants and at least one comparison group more than once before the program and at least two more times after the program, over time. The optimal number of times to collect data is five times before and after the program, but this will vary depending on your sample and evaluation needs [4].
• Example: You implement a food safety workshop and survey workshop participants and two comparison groups a few times before and a few times after workshop implementation, over the following months. A survey knowledge score is calculated for each time individuals in both groups took the survey to find out how scores change over time and how scores of program participants’ and the comparison group differ.
• Able to identify change before and after program/intervention.
• By tracking change repeatedly over time you have a greater opportunity to observe external factors that might influence findings and address threats to internal validity.
• Generally beneficial for large aggregates like schools or populations.
• Comparison group must be available.
• Must be able to collect pre-test/baseline data.
• Target audience must be large enough to have an experimental and a comparison group.
• Can require more complex statistical analysis to interpret data.
• Examples of potential validity threats: history and selection.
Two group pre-test/post-test, with random assignment
• You randomly assign who will participate in the workshop and who will not (comparison group). You then collect data from program participants and comparison group before and after the program takes place. Usually data is linked for each single individual to assess amount of change.
• Example: You randomly choose who will participate in the food safety workshop and who will be in the comparison group. You provide a survey to workshop participants and the comparison group before and after the workshop takes place. A survey knowledge score is calculated for each individual participant in both groups, to find out if or how scores changed and how scores of program participants’ and the comparison group differ.
• Able to identify change before and after program/intervention.
• Best option for outcome evaluation and preventing internal threats to validity.
• Demonstrates greater evaluation rigor and validity when seeking funders or sharing outcome findings with partners [1].
• Statistical analysis to compare both groups is fairly simple.
• Greater chance that comparison group and program participants are equivalent and reduced risk that differences between the groups might bias findings.
• Random assignment must be possible.
• Comparison group must be available.
• Target audience must be large enough to have an experimental and a comparison group.
• Must be able to collect pre-test/baseline data.
• When using random assignment you must consider ethical concerns about not including high-risk individuals who are more vulnerable to foodborne illness in the program if they wish to participate. A possible way to address this problem is to provide the comparison group with the opportunity to participate in the program once the post-test data is collected or when the evaluation is complete.
• Possible challenge: differences in attrition.

When selecting your evaluation design, consider what is feasible and realistic given resources, time, or staff support limitations, and accessibility to your target audience. Your evaluation questions and the outcome variables or indicators you wish to observe should also influence your decision. It is also important to think about the best timing to measure your indicators, because measuring too early or too late could lead to data and conclusions that are incorrect about how effective your program is.

"How They Did It" in a magenta box.

An exhibit about food safety and thermometer use was evaluated using a retrospective pre- and post-test evaluation about food safety knowledge and behavior. A retrospective pre-test was selected to examine change in self-reported food safety knowledge and skills. Prior to the evaluation, the survey was pilot tested at a statewide Extension conference poster session. Evaluation data was collected from 75 participants at three different events, a community hospital staff health fair, a county fair, and a county health fair for employees. Questions asked participants to rate their agreements with statements before seeing the exhibit and after the exhibit.

Survey results showed an increase in knowledge about thermometer use and planned behavior changes after seeing the exhibit. For example, when self-reporting retrospectively on knowledge about thermometers 55% of participants agreed with the statement “I knew about instant-read thermometers” before seeing the exhibit and 93% agreed with the statement “I know about instant-read thermometers” after seeing the exhibit. In addition, 29% of participants indicated that they agreed with the statement “I tested the internal temperature of hamburger patties with a food thermometer” before seeing the exhibit and 83% of participants agreed with the statement “I will test the internal temperature of hamburger patties with a food thermometer” after seeing the exhibit.”

McCurdy, S. M., Johnson, S., Hampton, C., Peutz, J., Sant, L, and Wittman, G. (2010). Ready-to-go exhibits expand consumer food safety knowledge and action. Journal of Extension, 48(5).

Sampling

How you collect your sample, the level of participation in the evaluation, and your sample size can influence external validity of your findings. 

External validity refers to how accurately your evaluation findings can be generalized to the general population or target audience [2]. It is generally better to have a large sample size and important to try to make your sample as representative, or similar, to the general population or target audience as possible.

To figure out how to collect your sample, it can be helpful to start by identifying who your theoretical population is, who within the theoretical population you have access to, and then how you will create a sampling frame from which you will select your final sample.

GRAPH Identifying Your Sample

[For a black and white version of this graphic, click here.]

Probability and non-probability sampling

There are two different types of sampling methods you could use to select your sample: probability or non-probability. A probability sample, one in which the members of your sample have an equal chance of being selected, is usually the best option for a rigorous evaluation and to assess a causal relationship between the program and the outcome variables [3]. However, probability samples are usually time consuming and expensive. In addition, when working with a small, very specific, or hard-to-reach target audience, a probability sample may not be possible. In this case, you would choose a non-probability sample in which members do not have an equal chance of being selected [3]. Non-probability samples are generally easier to select but are not as representative of the population, which can make research findings less generalizable.

Below are descriptions [based on information from: 3] of the different types of probability and non-probability methods you could use to collect your sample:

Probability sampling:

  • Simple random sampling refers to when each person in the sampling frame has an equal chance of being chosen. For example, you randomly select names from a hat or use a random number table. This method is easy to use when the sampling frame is homogenous and easy to access.
  • Systematic random sampling is when you create a list of individuals in your sampling frame and then select the Kth number throughout the list. For example, you select every fourth individual on the list.
  • Stratified random sampling refers to when you divide the sample into different subgroups based on factors of interest (factors that you think might influence food handling practices such as age, gender, or socioeconomic status). Each group will be homogeneous in regards to the factor or characteristic you choose and you then randomly select individuals from each of the groups. This can you help you ensure that each characteristic is represented in your sample.
  • Cluster area sampling is when you divide the accessible population into different subgroups or clusters (e.g. can be based on geography or different schools), then randomly select clusters.

Non-probability sampling:

  • Convenience – You select participants who are the easiest or most convenient to choose.
  • Purposive – You sample participants that are easy to access “purposively” with target characteristics in mind to address your evaluation needs.
    • Modal instance: you sample individuals who you think are “typical” of the target audience. It can sometimes be challenging to define what characteristics make up a typical or average case.
    • Expert: You recruit a team or panel experts on the topic of interest, such as food safety researchers with expertise in handwashing, to be included in your sample.
    • Quota: You pre-determine main characteristics of the target audience, and then proportionally or non-proportionally select individuals with those characteristics until the sample quotas are filled.
    • Heterogeneity: The main aim in this approach is to ensure your sample is diverse. You select individuals that represent different views or characteristics (factors that you think might influence food handling practices) without considering whether representation within the sample is proportional to the population.
    • Snowball: You find a few individuals that fulfill your pre-determined criteria to participate in the sample and ask them to suggest potential sample participants who are then contacted and recruited. This approach can be beneficial when the target audience is hard to access or reach.

Sample Size

One way to determine your sample size is to find out the minimum size needed to detect change with a certain degree of confidence by conducting a statistical power analysis. You can make the calculation using programs such as G Power (http://download.cnet.com/G-Power/3000-2054_4-10647044.html), SAS (http://www.sas.com/en_us/home.html), or work with a statistician to conduct the power analysis. If you do not have sufficient funds to pay a statistician consider still reaching out, explaining your situation and the purpose of your evaluation, and asking if a statistician might be willing to volunteer his or her time for this task. You can reach out to statistic professors, teachers, or even graduate students in the area. You can also ask program stakeholders or partners to find out if they or someone they know have any expertise or experience conducting a power analysis.

If you are not able to conduct a power analysis, you can determine your sample size based on the size of the population or target audience to which you are generalizing to by using the table displayed on the following website: http://www.foodsafetysite.com/educators/course/sampling.html. The table is based on a 5% error rate.

Nonresponse

Response levels can influence the external validity of your findings. One way to address nonresponse issues is to select a larger sample size that the minimum required. This can help you make up for nonresponse issues such as death, loss of follow up, or dropouts, which all contribute to attrition [2]. In addition, it is important to remember that not all members of the target audience will be eligible to participate in the program or evaluation, and not all of those eligible will be willing to participate. Keep this in mind as you recruit individuals for your sample and aim for a larger sample size to help prevent these challenges.

"How They Did It" in a magenta box.

To evaluate the impact of a food safety curriculum, Hands On: Real-World Lessons for Middle School Classrooms, researchers wanted to find out: 1. To what extent did the curriculum impact students’ self-efficacy of food safety and 2. To what extent did a relationship exist between self-efficacy and food safety behavior changes.

When selecting participants for the evaluation, special attention was paid to ensure the sample was diverse in terms of race and ethnicity, in order to promote external validity. A total sample of 1,743 students and 48 teachers participated in the evaluation. Participation was voluntary, dependent on parental informed consent, and did not include any incentives.

A previously validated assessment was used in a pre-test administered a week prior to implementing the program, a week post program implementation, and in a follow-up test 6-8 weeks after the program. To measure self-efficacy, a reliable scale with strong internal consistency, the Self-Efficacy of Food Safety Scale (SEFSS), was used. This scale was designed to measure students’ confidence in 6 areas: personal hygiene, sanitation, cross-contamination, cooking and cooling temperatures, foodborne illness, and high-risk behaviors. Knowledge was evaluated through 40 multiple choice questions that were included in a student assessment.  To assess food safety behavior, participants were asked how often they followed safe food handling practices on a 5-point Likert scale from Never to Always.

Findings demonstrated a strong predictive relationship between self-efficacy and positive behavior change and indicated that the Hands On program did increase students’ self-efficacy of food safety.

Beavers, A. S. , Murphy, L., & Richards, J. K. (2015). Investigating change in adolescent self-efficacy of food safety through educational interventions. Journal of Food Science Education, 14(2), 54-59.

In Summary,

when thinking about selecting an evaluation design and how to apply what you learned in this chapter to your program you may want to ask:

  • Is it feasible for me to select an experimental design for my program evaluation?
  • Do I have the resources needed to include a control/comparison group in the evaluation?
  • When will data be collected and how often?
  • How can I reduce threats to validity when designing my evaluation?
  • What is the best evaluation design option for my program evaluation given my evaluation needs, time, and resources? What are the benefits and limitations of this design?
  • Who will be included in my evaluation sample? List theoretical population, accessible population, sampling frame, and sample.
  • How will my sample be selected (sampling method)?
  • What sample size do I need for the evaluation? What sample size is feasible given my resources?
  • What can I do to reduce non-response rates?

References

  1. Betz, D. L., & Hill, L. G. (2006). Real world evaluation. Journal of Extension, 44(2).
  2. Issel L. (2004). Health Program Planning and Evaluation: A Practical, Systematic Approach for Community Health. London: Jones and Bartlett Publishers.
  3. Trochim, W. M. K. (2006). Sampling. Retrieved from: http://www.socialresearchmethods.net/kb/sampling.php
  4. Turkey, J. W. (1997). Exploratory data analysis. Reading, MA: Addison-Wesley.
Download the Full Guide PDF
Access the Toolkit Resources

Copyright © 2025 · Partnership for Food Safety Education

Facebook X-twitter Pinterest Linkedin Instagram Youtube Youtube Envelope
Privacy Policy | Terms and Conditions | Disclaimer