4. Manage development of the evaluation design

The design of a program evaluation sets out the combination of research methods that will be used to provide evidence for the key evaluation questions being addressed in the evaluation brief.

The design defines the data that is needed for the evaluation, and when and how it will be collected. The evaluation design needs to ensure that the evaluation will be as rigorous and and systematic as possible, while meeting needs for utility, feasibility and ethics.

This section outlines some evaluation design issues for process, outcome and economic evaluations.

Who should develop the evaluation design?

Good evaluation design is critical to the overall credibility, defensibility and utility of the evaluation.  While a program team may have the expertise to develop designs for smaller scale evaluations that mainly address descriptive questions, other designs may need evaluation expertise either from within evaluation units in the agency or cluster, elsewhere in government, or from external evaluation providers.  The manager responsible for the evaluation should also consider getting advice on the evaluation design including who should develop it from government evaluation units, or the steering committee or advisory group for the evaluation. These groups can also review the quality of proposed evaluation design.

Specialist expertise might be needed to gather data about hard to measure outcomes or from hard to reach populations, or to develop an evaluation design that adequately addresses causal attribution in outcome and/or economic evaluations. Further, experience is needed to understand the feasibility of applying particular designs within the context of a program, and being sensitive to likely ethical and cultural issues.

If you use external providers to develop the design, they might provide an initial evaluation design in their response to the RFT. This will be reviewed and revised in developing the workplan for the evaluation. Alternatively, the development of the design may be commissioned as a separate project, so that the design becomes part of the information included in the RFT.

What type of evaluation should be used?

The key evaluation questions will influence the type of evaluation and the methods for data collection and analysis.

Different evaluation types

The evaluation project may use one or more types of evaluation, including process, outcome or economic. The data and findings from individual types of evaluation should inform the other types.

For example, a process evaluation may consider interim outcome data to assess program implementation. An outcome evaluation may rely on evidence from implementation gathered during a process evaluation, to better understand how program delivery contributed to program outcomes. And findings from process and outcome evaluation may inform the design of economic evaluations.

Quantitative, qualitative or mixed methods

The methods for data collection and analysis should be appropriate to the purpose and scope of the evaluation. Most program evaluations will collect both quantitative data (numbers) and qualitative data (text, images) in a mixed methods design to produce a more complete understanding of a program.  A combination of qualitative and quantitative data can improve a program evaluation by ensuring that the limitations of one type of data are balanced by the strengths of another. It is important to plan in advance how these will be combined

Quantitative methods are used to measure the extent and pattern of outcomes across a program using surveys, outcome measures and administrative data. Qualitative methods use observation, in-depth interviews, and focus groups to explore in detail the behaviour of people and organisations and enrich quantitative findings. They help to understand the 'how and why' including explaining whether the program is likely to be the cause of any measured change. In cases where outcomes are not achieved, qualitative data can help understand whether this is a case of program failure or implementation failure.

Balancing rigour, utility, feasibility and ethics

An important part of evaluation design is investigating questions of rigour, utility, feasibility and ethical safeguards so that the final design is as rigorous as possible while delivering a useful, practical evaluation that protects participants from harm.

The evaluation design needs to balance these four elements and so design is often  an iterative process. For example, there may be a trade-off between rigour and utility. A very accurate and comprehensive evaluation might not be completed in time to inform key decisions. In this case it might be better to include both short-term and long-term outcomes, so that the initial assessment of whether a program is working can be followed up by a more comprehensive assessment – but this would require greater resources, especially for tracking client outcomes over time if these are not already being collected, which might make feasibility difficult.

There may also be a trade-off between rigour and feasibility. Decision-makers may be interested in the effectiveness of a new program and seek an outcome evaluation. This may not be feasible in the first two years of the program while program processes are still being developed and rolled out. A feasible scenario may be a process evaluation followed by a longer time frame for an outcome evaluation.

Designs for process evaluation

Process evaluations explore evaluation questions about program implementation.  They may describe implementation processes and the pattern of uptake of or engagement with services, check whether a program is being implemented as expected, and differentiate bad design (theory failure) from poor implementation (implementation failure).

Process evaluations can be used periodically to undertake cycles of program improvement by informing adjustments to delivery or testing alternative program delivery processes.  For pilots, new programs and innovations within a program, process evaluations document how the program is being implemented.

Key evaluation question Evidence required Possible methods or data sources
How well has the program been established?

Description of program development compared with client needs, timeframes.
Quality of governance, relationships.
Influence of different factors and contexts. Initial evidence of uptake.

Program reports, key informant interviews, consultations with managers or service providers, program forums.
How is the program being implemented? Description of implementation processes by different providers and in different circumstances.The extent that implementation processes met milestones and targets for outputs, timeliness, cost, participation and immediate outcomes.

The quality of outputs and immediate outcomes measured against standards and targets.

The pattern of outputs, uptake and immediate outcomes, by different sub-groups or in different contexts.

Client or customer satisfaction.

 Program monitoring data and other program records.Observation including photography and video.
Interviews, surveys or focus groups with managers, staff, program clients, referring agencies. Consultations with managers or service providers.
 Is the program being implemented well? As above plus information about good practice in implementation processes. Expert review of program documents, or observations during site visits.

Process evaluations are often designed using the program logic to collect evidence that describes the outputs and immediate outcomes. This may cover:

  • Program reach and uptake across intended target groups.
  • Actual implementation processes.
  • Participant satisfaction.
  • Standards of implementation such as quality, efficiency and cost.
  • The influence of different contexts and other factors on implementation.

A rigorous and systematic process evaluation should bring together evidence from different data sources to answer the evaluation questions. The design for a process evaluation will depend upon the size of the program, the scale of the evaluation, and the extent to which data on program implementation and uptake is collected through the program's monitoring system.

Process evaluation uses quantitative and qualitative data collection and analysis methods. Quantitative methods typically involve analysing program reach or staff/consumer experiences using surveys and administrative data. Qualitative methods include observation studies, interviews, group processes, audits, expert reviews, and case studies.

Designs for outcomes evaluation

Outcome evaluation (sometimes called impact or results evaluation) aims to determine whether the program caused demonstrable effects on the defined target outcomes. Most significant programs should seek a rigorous outcome evaluation to demonstrate that the investment in the program is worthwhile and that there are no major unintended consequences.

An outcome evaluation should identify the pattern of outcomes achieved (for whom, in what ways, and in what circumstances), and any unintended impacts (positive and negative).  It should examine the ways the program contributed to outcomes, and the influence of other factors.

Depending on the scale and maturity of the program, it may be possible to build in strong evaluation designs when the program itself is being designed. This is ideal as it can facilitate a more rigorous outcome evaluation after the program has become operational.

Before embarking on the design for an outcome evaluation, it may be helpful to work through the following key questions:

  • What are the outcomes the program aims to achieve? Step 1. Develop program logic and assess needs
  • Are there suitable existing data that actually measures the outcome(s) of interest? Consider an evaluability assessment. (Step 2. Develop the evaluation brief
  • If not, would it be possible to collect data on outcomes?
  • Can the counterfactual be estimated in some way? Is there scope to use data from a comparison group? If not, what alternative approach to causal inference should be used?

Key aspects of an outcome evaluation are:

  •      Measuring or describing the outcomes (and other important variables)
  •      Explaining whether the intervention was the cause of observed outcomes

Measuring or describing the outcomes (and other important variables)

An outcome evaluation relies on valid and systematic evidence for program outcomes. It is useful to identify any data already available from existing sources, such as program monitoring data, relevant statistics, and previous evaluation and research projects. Additional data can be gathered to fill in gaps or improve the quality of existing data using methods such as interviews (individual and group; structured, semi-structured or unstructured), questionnaires and direct measurement.

Descriptions of outcomes should not only report the average effect, but also how varied the results were, and in particular the patterns for key variables of interest, such as different participant characteristics. It is important to show in which contexts the program is more effective, which target groups benefit most, and what environmental settings influence the outcomes.

An outcome evaluation may rely on evidence from a process evaluation about program implementation and experiences to gain a better understanding of the drivers affecting program outcomes. Information is also needed about the different contexts in which the program was implemented to understand if a program only works in particular situations.

Explaining whether the intervention was the cause of observed outcomes

An important feature of outcome evaluation is that it does not simply gather evidence of outcomes, but seeks to assess and understand the program's role in producing them. The program is rarely the sole cause of changes; it usually works in combination with other programs or activities and other environmental factors. Therefore, 'causal attribution' does not usually refer to total attribution (that is, the program was the only cause), but to partial attribution or to analysing the program's contribution. This is sometimes referred to as 'plausible contributions'.

In agricultural research, for example, outcomes in terms of improved productivity can be due to a combination of basic and applied research, product development and communication programs. An investment in any one of these might not bear sole responsibility for the productivity outcomes. Each investment might be essential, but would not have been able to do so without the other programs. In other words, any one program may have been necessary but not sufficient to bring about that outcome.

Three approaches to investigating causal attribution or plausible contribution are:

  • the counterfactual – comparing the outcomes with an estimate of what would have happened in the absence of the program.
  • the factual – analysis of the patterns of outcomes, and comparing how actual results match what was expected
  • alternative explanations - investigate and rule out other explanations

In some cases all three approaches to causal attribution can be included in the same evaluation design. In complex situations, it might not be possible to estimate a counterfactual, and causal analysis will rely on other approaches. Selecting an outcome evaluation design involves systematically deciding between the options.

Designs for economic evaluation

When it is used for program evaluation, economic evaluation addresses questions of efficiency by standardising outcomes in terms of their dollar value, an approach sometime referred to as assessing value for money.

Economic evaluation is used in a summative way to determine whether the program has been cost-effective or whether the benefits exceed the costs, drawing upon the findings of outcome evaluation. Economic evaluation is also used with a formative purpose during the program design stage to compare different potential options, using modelling of the likely outputs and outcomes, referred to as ex ante evaluation.

Economic evaluation stands at the intersection of program evaluation and economic appraisal (PDF, 494 KB), and the concepts and terminology are sometimes used differently in the two fields. These differences are set out in a recent paper by the Productivity Commission (PDF, 518 KB).

The main forms of economic evaluation used in program evaluation are

  • efficiency analysis
  • cost-effectiveness analysis
  • cost-benefit analysis

The forms of economic evaluation each rely on costing or valuation studies to assign monetary costs to the range of program inputs. But the different forms of economic evaluation use measures of outputs, outcomes or monetised benefits.

Efficiency analysis focuses on the inputs-outputs relationships and can bring useful insights into delivery processes that can point to opportunities for cost-optimisation. For example, a program designed to reduce recidivism through a different number of clinical support models could use cost-efficiency analysis to compare the cost per person assisted for each of the support models.

Efficiency analysis can explore the factors associated with these differences in costs and establish benchmarks to monitor future costs for different delivery situations.

Cost-effectiveness extends the analysis to intended outcomes. Cost-effectiveness analysis is used where the outcomes are not readily measurable in monetary terms, for example in areas of health, education or social welfare. It can be used to compare the cost-effectiveness of different programs with the same outcomes, or to determine the most cost-effective delivery options within the same program.

For example, a program designed to reduce recidivism through a number of different clinical support models could use cost-effectiveness analysis to compare the cost of service delivery to the reduction in recidivism for each of the support models.

Cost benefit analysis is the most comprehensive of the economic appraisal techniques. It quantifies in money terms all the major costs and benefits of a program with a view to determining whether the benefits exceed the costs, and if so by how much (expressed as a ratio of benefits to costs). It compares the net present value (NPV) of the program's costs with the NPV of its benefits, using a discount rate to reduce the value of future costs or benefits to today's costs and benefits.

Cost benefit analysis is more readily applied to programs producing outputs that generate revenue (for example water supply and electricity), or else where the major benefits can be quantified fairly readily (for example roads).

One form of cost benefit analysis that is being used more commonly is a measure of the social return on investment.

Synthesizing evidence into an evaluative judgment

In any type of evaluation it is important to bring together all the relevant data and analysis to answer each evaluation question. It is rare to base the overall evaluative judgment on a single performance measure. It usually requires synthesising evidence about performance across different dimensions.

Methods of synthesis include:

  • weighted scale
  • global assessment scale or rubric
  • evaluative argument.

A weighted scale is where a percentage of the overall performance rating is based on each evaluative criterion. However, a numeric weighted scale (PDF, 1.38 MB) often has problems, including arbitrary weights and lack of attention to essential elements.

A global assessment scale or rubric can be developed with intended users and then used to synthesise evidence transparently. A rubric sets out clearly criteria and standards for assessing different levels of performance. The scale must include a label for each point ( example, "unsuccessful," "somewhat successful," "very successful") and a description of what each of these looks like.

A more general method of synthesis is evaluative argument, a reasoned approach to reaching conclusions about specific evaluation questions by weighing up the strength of evidence in line with the program theory to represent the causal links, and describing the degree of certainty of these conclusions. A related method is contribution analysis which is a systematic approach to developing a contribution story where a counterfactual has not been used. Evaluative argument is suited for the synthesis of findings of more complicated programs, where there are a number of external factors to consider, or where there is a mix of evidence and different degrees of certainty.

Research design issues

Research design refers to when and how data will be collected to address key evaluation questions and is critical to the rigour of the findings, and the feasibility of the methods for data collection. Two key issues are:

  • Sampling
  • Timing

Sampling - in some cases it may not be possible or desirable/appropriate to collect data from all sites, all people and all time periods. In these cases a systematic approach to sampling may be needed, so the sample data can be appropriately generalised. For outcome evaluations, the sample needs to be large enough for the results to be statistically valid. Power calculations provide an indication of the minimum sample size needed to assess the impact of a program.

Timing – an important aspect of research design is when data are collected:

  • Snapshot – collecting data at one point of time. It doesn't allow for analysis of changes over time, except by asking people to report these retrospectively.
  • Before and after – comparing baseline data to a later stage, such as health indicators before and after treatment, or program performance measures before and after a policy change. While this can provide evidence that a change has occurred, by itself it doesn't answer questions about the effect of a program. Without a comparison/counterfactual we have no way of knowing whether changes would have occurred anyway.
  • Time series – collecting data at multiple points over time.

Ethical and cultural issues

Ethics in program evaluation refers to the potential risk of harm to people participating in the evaluation, whether as informants or as evaluators. The types of harm can range from loss of privacy or benefits to program participants, damage to vulnerable groups, or physical or mental harm to informants or researchers. Ethics in program evaluation comes under the broader topic of ethics in human research.

The Australasian Evaluation Society has produced Guidelines for the Ethical Conduct of Evaluation.

The potential risk of harm varies with different evaluation designs, and is an important consideration for the quality and ethics of the evaluation project.

During the evaluation design step, it is critical to identify:

  • Whether external ethics review is required? Does the agency have a policy or guidelines relating to external ethics review?
  • Are there vulnerable or culturally distinct groups involved?
  • Is there linked data involved, with different consent and privacy issues?

An application for an external ethics review can involve substantial work and time. Research involving animal and human participants requires approval from a recognised ethics committee. Other data collection may be considered "continuous improvement" rather than research and not require external ethics approval. In each case it is important to consider the potential benefits as well as how an ethics approval process this will impact on the cost and timeframe.

An associated issue is the cultural appropriateness of an evaluation, particularly in relation to services and programs for minority or vulnerable groups such as Aboriginal people or refugees. For example, there are guides and standards for involving Aboriginal communities in research projects that cover community participation, culturally appropriate methods, and providing suitable feedback to the community. You should look at guides and standards within your agency and from peak organisations in the relevant policy field. You should also consider engaging an Aboriginal or CALD consultant to be involved in the evaluation at an appropriate level, including evaluation design, planning, data collection and facilitating or co-facilitating consultations.

Taking account of cultural issues can influence the rigour and feasibility of an evaluation project. For example, additional logistics may need to be factored in, such as using interpreters, or time spent working face to face with remote communities.

Sources of advice for evaluations to meet ethical and cultural standards for working with Aboriginal communities include:

Evaluation design setting out how data will be collected, analysed and reported to answer key evaluation questions.