Go to content

Nordic Economic Policy Review 2026

Learning to Experiment: Practical Lessons from Finland’s Two-Year Preschool Experiment


Ramin Izadi and Matti Sarvimäki

Abstract

This paper describes how Finland implemented a nationwide randomised field experiment to evaluate a potential reform of its early childhood education and care (ECEC) system. The two-year preschool experiment involved 37,357 children at 956 centres and was enabled by temporary legislation enacted specifically for this purpose. The trial was designed and executed in close collaboration between researchers, civil servants, and teachers, with two successive governments committing to wait for the results before making policy decisions. We document how the experiment was planned, legislated, and implemented, and analyse the institutional arrangements that made it possible. We highlight the value of sustained co-operation between policymakers and researchers, the benefits of a multidisciplinary research team, and how the experiment both relied on and reinforced Finland’s data infrastructure. We conclude by discussing lessons for other countries seeking to embed rigorous evaluation into their policy processes.
Acknowledgements: We would like to thank Morten Visby Krægpøth, Ingeborg Foldøy Solli, Roope Uusitalo, Antti Kauhanen, and participants at the Advancing Policy Through Randomized Experiments, Nordic Economic Policy Review Conference, for helpful comments and suggestions. All remaining errors are our own. We gratefully acknowledge financial support from the Research Council of Finland (#358924, #358946).

1 Introduction

In 2021, Finland launched an exceptionally large and ambitious policy experiment. The initiative reflected a long-standing debate about the role of early childhood education and care (ECEC) and preschool as the foundation for lifelong learning. Among policymakers, a broad consensus had emerged that extending preschool from one to two years could be a promising way to improve children’s skills and promote equality of opportunity. In practice, this would mean that preschool would begin at age five rather than six, while children would still start primary school at age seven.
Before implementing such a reform, the government decided to generate evidence on its potential effects. As a result, we implemented a large-scale randomised field experiment designed to assess the impacts of extending preschool on children’s academic and socioemotional skills, as well as on educational practices, parental experiences, and the functioning of early education centres. Preparations for the experiment began in spring 2020, and it was launched in autumn 2021. The government-commissioned final report, focusing on effects measured at the start of primary education, was published in January 2026. However, the project is planned to continue far into the future. Skill assessments are planned until the end of comprehensive schooling in the early 2030s, and register-based follow-up will allow long-term analysis of children’s eventual educational and labour market outcomes. 
The trial covered 37,357 children born in 2016 and 2017 at 956 early education centres in 148 municipalities, corresponding to roughly one-third of all Finnish children in those birth cohorts. To enable the experiment, the Finnish Parliament enacted special temporary legislation that specified the randomisation procedure, mandated municipal participation, and authorised the collection of extensive data. The government also committed to taking the results of this experiment into account when deciding whether to reform preschool. This commitment was subsequently upheld by the succeeding government from the opposite end of Finland’s political spectrum, making the initiative an exceptional example of policy continuity across governments from opposing political coalitions.
This paper documents how the experiment was planned, legislated, and implemented. Section 2 provides background on how Finnish policymakers gradually moved from a cautious attitude toward randomisation to accepting rigorous experimentation as a legitimate tool for policy development. We describe how early academic trials and often discouraging discussions with government officials eventually gave way to a change. This account is necessarily somewhat subjective and incomplete, but we hope it will inform and inspire similar developments in other countries.
Section 3 describes our collaboration with the Ministry of Education and Culture in setting up the two-year preschool experiment. The process began with a single sentence fragment in the 2019 government programme of Prime Minister Antti Rinne (later replaced by Sanna Marin), which stated that the government would “pilot two-year preschool.” We outline how these few words were first translated into a research proposal and subsequently into temporary legislation. Again, we offer a necessarily partial and subjective account of this process, which we hope will be useful for both policymakers and researchers seeking to undertake similar initiatives.
Section 4 details our randomisation process. We discuss the challenges of moving from the simple textbook idea of random assignment to implementing it in a complex real-world policy setting. In our case, the treatment had to be assigned at the centre level, even though children are often moved between centres, and some did not attend centre-based early education at the time of randomisation. These features created complications that required both technical and administrative solutions, which we describe in some detail. We also share several practical lessons learned along the way, such as how re-randomisation procedures can be used not only to improve covariate balance but also to ensure that the experiment remains within budget and administratively feasible. Finally, we discuss how randomisation can be communicated to participating municipalities and parents in ways that maintain transparency and trust. In particular, we emphasise the importance of allowing some degree of non-compliance to facilitate smooth implementation. 
In Section 5, we discuss the data collected as part of the experiment. Working closely with colleagues from educational sciences, psychology, and learning analytics, we conducted large-scale assessments of children’s academic and socioemotional skills at ages 5, 6, and 7. Academic skills were measured through direct assessments, in which each child completed a structured session with a teacher using a tablet-based test in ECEC and a computer-assisted test with headphones in primary school. Socioemotional skills were assessed by teachers using standardised rating instruments. These assessments were used to construct six preregistered measures: two capturing academic skills (numeracy and literacy) and four capturing socioemotional skills (task performance, social skills, emotional regulation, and peer relationships).
These data were linked to extensive administrative records covering children, their parents, teachers, peers, schools, and early education centres. For each individual, we observe detailed information on education, labour market outcomes, demographics, and family composition. In addition, our colleagues collected rich survey data that provided a much more detailed understanding of how the treatment was implemented in practice than would have been possible using register data alone. Finally, we assembled new administrative information that should ideally be incorporated into national registers – such as data on which teacher was assigned to which group – but is not yet systematically recorded. 
Section 6 describes the nature of the treatment. Using large-scale survey and administrative data collected as part of the experiment, we show that the reform altered children’s learning environments in measurable but modest ways. Treatment centres had more highly qualified teachers, more homogeneous age groups, and slightly more guided learning time, while overall resources, group sizes, and daily schedules remained largely unchanged. 
Section 7 summarises the main findings on children’s skills. We show that extending preschool increased participation in formal early education at age five by shifting a small group of children from home care into centre-based provision. For these children, two-year preschool generated sizeable, short-term gains in academic and socioemotional skills, but the persistence of these gains remains uncertain. For the large majority of children, however, the relevant alternative was standard daycare. For this group, extending preschool had no effect on skills at school entry, and the scale of the experiment allows us to rule out even modest policy-relevant impacts with confidence. As a result, average effects at the population level are essentially zero.
We conclude with a discussion of broader lessons from the experiment, emphasising how similar large-scale randomised evaluations could be embedded into policy processes in other countries.

2 Randomised field experiments in Finland

This section places the two-year preschool experiment within Finland’s gradual shift toward government-sponsored randomised policy trials. Over the past decade, legal and administrative frameworks have been adapted to support experimentation. While gaps remain, rigorous evaluation is now broadly accepted, and even encouraged, across major political parties. We present a brief discussion of how this shift occurred, highlighting details likely to be useful to researchers, civil servants, and policymakers in contexts where randomisation is still considered problematic.

2.1 Early steps

Around the turn of the millennium, economists in Finland, much like their peers elsewhere, began advocating more forcefully for policy evaluation based on credible research designs. Much of the early push focused on bringing quasi-experimental methods to administrative data and programme evaluation, but there were also early calls to run government-sponsored randomised experiments (e.g., Hämäläinen and Uusitalo 2005).
For much of the 2000s and early 2010s, however, progress was limited, and randomised policy trials were often dismissed as constitutionally problematic. This objection appeared to be largely based on a cursory reading of the Constitution’s equality clause rather than on careful legal analysis. Instead, ministries and agencies tended to favour regional pilots selected administratively or via applications by willing participants. Although these pilots also led to differential treatment for citizens depending on where they lived, they were not generally perceived as raising the same constitutional concerns as randomisation. This interpretation of the law appeared illogical and was deeply frustrating for the research community, particularly given that such designs were unlikely to yield credible evidence.
While most policymakers remained cautious, a few government agencies took part in relatively light-touch randomised experiments. In the late 1990s, the Ministry of Labour and public employment offices collaborated with psychologists at the Finnish Institute of Occupational Health on a series of small-scale randomised field experiments evaluating job-search training programmes (Vuori et al. 2002; Hämäläinen et al. 2008). In 2012, the Finnish Tax Administration participated in a study that randomised whether firms received letters providing information about VAT rules (Kosonen and Ropponen 2015). In 2014, Finnish Customs partnered with researchers on an experiment in the used-car import sector, in which randomly selected potential importers were informed that odometer readings would be verified through inspections (Harju et al. 2020). Meanwhile, academics continued to implement RCTs within their own research programmes. A prominent example is the KiVa antibullying programme, evaluated in a large RCT in 2007–2009 and later scaled nationally (Kärnä et al. 2011; Salmivalli and Poskiparta 2012). Another example is an information intervention in high schools in 2011–2012 (Pekkala Kerr et al. 2020).

2.2 The basic income experiment

The launch of the basic income experiment marked an important step toward governmental involvement in randomised trials. The aim was to examine whether simplifying the benefit system and eliminating benefit withdrawal by raising income would increase employment. The experiment was part of Prime Minister Juha Sipilä’s government programme and became the first randomised experiment to be implemented under a dedicated piece of legislation. It also led the Constitutional Law Committee to spell out the constitutional preconditions for such experiments. The Ministry of Justice later codified these principles in a 2020 guide for drafting legislation enabling experiments. The topic of this paper, the two-year preschool experiment, is the second implemented via this legislative route, which is why we discuss the first – the basic income experiment – in some detail.
The final design of the basic income experiment was narrower than the one proposed by the researchers commissioned to prepare a proposal for it (Kangas and Pulkka 2016). As a result, it was widely viewed as ill-suited to evaluating basic income as a comprehensive reform (e.g., De Wispelaere et al. 2018; Hämäläinen and Verho 2022). Nevertheless, it created a clean research design for assessing how long-term unemployed individuals respond to stronger financial work incentives and for studying the welfare effects of streamlining the welfare system.
For the experiment, 2,000 individuals on the minimum unemployment benefit were randomly assigned to a tax-free basic income of €560 per month for 2017–2018. The comparison group comprised the remaining eligible beneficiaries (about 175,000 individuals). The basic income equalled the minimum unemployment benefit, meaning that disposable income while unemployed remained largely unchanged. Importantly, however, the benefit did not depend on earnings and job-search requirements. As a consequence, participation tax rates for full-time work fell by an average of roughly 23 percentage points.
The dramatic reduction in the participation tax rate had little effect on employment. During the first year, the estimated impact on annual days employed was statistically insignificant, with a point estimate of only 1.5 days (95% confidence interval: –2.3 to 5.4 days) relative to a baseline of 49 days in the control group (Verho et al. 2022). In the second year, the estimated effect was 6.6 days and statistically significant. However, the research design for this second year was compromised by another policy reform that affected only the control group.
On the other hand, survey evidence indicated greater life satisfaction, lower perceived stress, and fewer bureaucratic hassles for recipients (Tuulio-Henriksson and Simanainen 2020). However, the low overall response rate and the substantial difference in response rates between the treatment (31%) and control (20%) groups warrant caution in interpreting these differences as causal effects. These findings are nevertheless consistent with a follow-up study using administrative health registers, which found that receiving basic income reduced the use of psychotropic medication by 8–11% (Hämäläinen et al. 2025).

2.3 Towards a culture of experimentation

The basic income experiment was part of a broader ambition of the Sipilä government to foster a “culture of experimentation”. The government’s strategic programme emphasised experimentation and established a dedicated Policy Experimentation Unit in the Prime Minister’s Office to coordinate and support pilot projects. Under this agenda, a large number of trials were launched in several policy areas.
While this approach reflected strong institutional enthusiasm for experimentation, systematic evaluation was often omitted. According to an official review of the experimentation agenda, many projects labelled as experiments did not meet the criteria of genuine experimentation: they lacked clear hypotheses, comparison groups, and plans for scaling or policy learning (Antikainen 2019). As such, despite the broad rhetoric of experimentation, few initiatives were implemented in ways that could generate robust evidence. Beyond the basic income experiment, we are aware of only two other randomised field experiments conducted during this period in collaboration with government agencies: (i) an experiment with the Finnish Tax Administration that first identified suspected landlords using administrative register data and then randomised them into receiving informational letters about rental income taxation and enforcement (Eerola et al. 2025), and (ii) a trial of a new approach to immigrant integration services with the Ministry of Economic Affairs and Employment (Karinen et al. 2024; Pesola et al. 2025). Both were implemented independently of the Prime Minister’s Office and the experimentation agenda, but were in line with the increasing openness to randomised experiments.
The subsequent Marin government made less fuss but continued to develop government-sponsored RCTs. Most notably, it launched the Two-Year Pre-primary Education Trial that is the subject of this paper. Other examples include a collaboration with the Ministry of Economic Affairs and Employment, in which randomly selected entrepreneurs received a subsidy for hiring their first employee (Einiö and Nivala 2025), and a collaboration with the Prime Minister’s Office and the Ministry of Justice, in which text messages were sent to randomly selected young voters to encourage electoral participation (Hirvonen et al. 2024).
Government-sponsored randomised trials have continued under Prime Minister Petteri Orpo’s government. A recent example is an experiment conducted in collaboration with the Ministry of Social Affairs and Health, in which access to a digital nurse reception and chat service was randomised among roughly 170,000 residents to evaluate how such services affect the use of primary and emergency care (Haaga et al. 2025). Similarly, the City of Helsinki launched a trial in which second-grade classrooms were randomly assigned to receive an additional teacher for four weekly Finnish lessons (Holvio et al. 2024).

3 Setting up the two-year preschool experiment

The origins of the two-year preschool experiment trace back to the negotiations to form a government after the general election in spring 2019. In the run-up to the elections, extending preschool featured in the platforms of most major political parties. The elections resulted in a left-leaning coalition that adopted an ambitious agenda for educational reform. Its main initiative was to extend compulsory schooling from 16 to 18, while the plan to expand preschool was reduced to a brief line in the government programme stating that the government would “pilot two-year preschool.” Although this reference might have seemed minor at the time, it provided the political mandate that ultimately enabled a nationwide randomised experiment.

3.1 From mandate to proposal

Once the new government was formed, preparations for implementing its programme began. As part of this process, the Ministry of Education and Culture invited representatives from the Aalto Economic Institute in February 2020 to discuss how the pilot could be designed and evaluated in practice. These initial meetings developed into a series of detailed discussions between researchers and civil servants. The ministry subsequently requested a formal research proposal, which we submitted as (Izadi et al. 2020).
The proposal set out a preliminary research design, a power calculation, and an initial plan for measuring children’s skills. It also explained why strict compliance was neither necessary nor desirable. We aimed to avoid situations where participation would clearly conflict with a child’s best interests and to allow reasonable flexibility when logistical considerations or parents’ beliefs about the suitability of two-year preschool strongly conflicted with the results of the randomisation. We also stressed that non-compliance reduces the statistical power and representativeness, underscoring the importance of making compliance as easy and pleasant as possible.
We further noted that the standard way of conducting pilots at the ministry, based on voluntary participation, detracts from the credibility of causal inference. By contrast, random assignment would ensure comparability between treatment and control groups and provide a sound basis for evaluating the policy’s effects. To maximise statistical power, we proposed assignment at the centre level rather than at the municipality level.

3.2 From proposal to legislation

Following the research proposal, the Ministry of Education and Culture, with assistance from the research team, prepared a government bill for a fixed-term act to enable the experiment. The draft was circulated for public comments in summer 2020 via the standard consultation portal and received 56 written statements. The government then submitted its proposal (HE 149/2020) to Parliament, introducing a temporary act and minor amendments to the Early Childhood Education Act. The bill specified the randomisation protocol, defined the 2016–2017 birth cohorts as the target group, and authorised the collection and linkage of evaluation and administrative data for research purposes.
The proposal proceeded through the ordinary committee process. The Constitutional Law Committee (PeVL 37/2020) concluded that the experiment was compatible with the constitutional provisions on equality and the right to education because participation was voluntary, the intervention was time-limited, all children retained access to standard one-year preschool, and the differential treatment rested on an objective and acceptable reason, i.e., producing evidence to inform future policy. The Committee also emphasised the need for clear statutory definitions of participants’ rights and obligations and for adequate data-protection provisions.
The Parliament’s Education and Culture Committee (SiVM 11/2020) similarly endorsed the bill with minor technical amendments. Its report highlighted the value of evaluating major education reforms before permanent adoption, supported randomisation, and stressed the importance of explicit provisions enabling rigorous evaluation alongside safeguards for personal data. Finally, in the plenary debate (PTK 154/2020), members expressed broad support across party lines, with the only critical remarks coming from the Finns Party. Parliament adopted the law, and the Temporary Act 1046/2020 entered into force on 23 December 2020.

4 Experimental design

This section describes the design and implementation of the experiment and explains why randomised controlled trials offer the most reliable basis for assessing policy impacts. We also discuss several challenges we encountered and how we addressed them.

4.1 Rationale for randomisation

The primary aim in designing the experiment was to assess how implementing legislation that would extend preschool to two years in Finland would affect children’s development. In practice, this required comparing children who attended two years of preschool with similar children who remained in the current system of regular daycare (or home care) and then entered the existing one-year preschool programme. Without an experimental design, such comparisons would very likely be biased, as families who voluntarily choose longer preschool are likely to differ systematically from those who do not.
Random assignment solves this problem. By randomly allocating children to either the new two-year preschool or the standard one-year system, the trial ensured that, on average, the groups were equivalent in all relevant respects before the intervention. Any subsequent differences in outcomes can, therefore, be attributed to the policy itself rather than to pre-existing differences between families.

4.2 Randomisation design

The goal was to build a credible and administratively feasible research design while minimising the burden on municipalities and families. Treatment assignment needed to be implemented in a way that ensured all children were treated fairly and lawfully. The final design was a randomised field experiment in which municipalities or centres were selected by lottery.

4.2.1 Practical constraints that shaped the design

  1. The research design had to satisfy several practical and institutional requirements:
  2. The intervention could only be implemented at the centre level, but each child's treatment status needed to be defined. Parents had to be personally informed whether their child was assigned to the treatment or control group and asked to enrol their child accordingly.
  3. The target population encompassed all children born in 2016 and 2017, including those not enrolled in centre-based care.
  4. Only centres offering both preschool and early childhood education were deemed eligible so that children could remain in the same centre throughout the day. Arrangements in which preschool was provided in schools were, therefore, excluded from the experiment.
The design needed to accommodate municipalities of all sizes, including those with only a few centres.

4.2.2 Solutions to 1–3

Every child had to be linked to a specific early childhood education centre before randomisation so that treatment status could be determined unambiguously. For enrolled children, the link was determined by their current centre. The remaining 15% were not yet enrolled but were of particular interest to policymakers, who sought to increase participation. To balance participation with capacity constraints, we first assigned each unenrolled child to their nearest centre based on straight-line distance. Municipalities then selected as many children as each centre could accommodate. Children not selected were excluded from the experimental sample, as were those assigned to ineligible centres. Randomisation was conducted at the centre level, and the treatment group comprised children assigned to treatment centres. Similarly, the control group consisted of children assigned to a control centre. Figure 1 illustrates this approach.
Figure 1. Experimental design in large municipalities
Panel A: Pre-randomisation  
Panel B: Post-randomisation
Note: Illustration of the experimental design in municipalities with more than five eligible daycare centres. Children are assigned to the centre they are enrolled in or the centre closest to their home. Eligible daycare centres are randomised into treatment and control groups, and children are invited to participate.

4.2.3 Solution to 4

Finnish municipalities vary greatly in size, and some have only a handful of eligible centres. Because participation in the treatment involved a fixed cost at the municipal level, it would have been unreasonable to ask municipalities to bear this cost if they had only a few treatment centres. At the same time, nationwide coverage was deemed important. To address this, we divided municipalities into two groups. In large municipalities with five or more eligible centres, randomisation occurred between centres within each municipality, as discussed above. Roughly 40% of centres were assigned to treatment and 60% to control. In small municipalities with one to four eligible centres, the entire municipality was randomised to treatment or control. Treatment municipalities implemented two-year preschool in all eligible centres, while control municipalities continued with the existing system. Panel (a) of Figure 2 shows the participating municipalities on the map. It also shows that many municipalities lacked eligible centres, which excluded them from the experiment. (These are municipalities where preschool takes place in primary schools.) Additionally, to stay within budget, some large municipalities had to be excluded from the trial. Panel (b) shows which randomly sampled large municipalities were included in the trial and which were excluded. To ensure broad regional coverage, sampling was stratified geographically by Regional State Administrative Agency (AVI) area and municipality size.
Figure 2. Sampling frame
Panel A: Municipalities with eligible daycare centers
Panel B: Large municipalities' sampling
Lessons Learned: Making Better Use of Oversubscription
Many centres were capacity-constrained when admitting unenrolled children. In retrospect, rather than asking municipalities to select children from a pre-assigned nearest-centre list, we could have (i) collected only seat capacities from municipalities, and then (ii) randomised unenrolled children individually to fill those seats. Children who could not be accommodated would have formed a natural control group. This alternative would have increased effective sample size and statistical power without adding administrative burden.

4.3 Randomisation

Randomisation was based on administrative data from spring 2021 covering children born in 2016 and their families. To avoid unnecessary organisational disruption, centres randomised to treatment in 2021 (for the 2016 birth cohort) retained their status the following year when the 2017 cohort began the first year of two-year preschool. A foreseeable consequence was that some families of children born in 2017 could be sorted into treated centres. We anticipated this risk at the design stage and committed to addressing any resulting selection empirically in the analysis (see balance results below).

4.3.1 Data

Constructing the data infrastructure required to implement the experiment was itself a major policy innovation. Finland’s renowned centralised administrative registers did not yet cover early childhood education, so creating a national sampling frame required building one from scratch. Working with the Ministry of Education and Culture, we compiled enrolment lists from every municipality and verified them locally. This collaboration effectively transformed a fragmented set of local records into Finland’s first national ECEC register. The temporary legislation authorising the experiment also allowed us to use children’s social security numbers directly for randomisation, which greatly facilitated the process of informing parents about their child’s treatment status. These personal identifiers also made it possible to link the enrolment data to the national population register, allowing us to assign nearby centres to unenrolled children and ensure complete coverage. The entire sampling process was carefully documented and archived in the National Archives for future replication and research use. 

4.3.2 Re-randomisation

We implemented a stratified re-randomisation procedure as per Morgan and Rubin (2012) and Johansson et al. (2021). Specifically, we simulated one million potential treatment allocations, retained the 2,000 that achieved the best balance across key covariates, verified that these allocations were not deterministic, and randomly selected the final assignment from among them. The procedure served three purposes. First, it ensured good covariate balance between treatment and control centres with respect to characteristics observed at the time of randomisation. Second, it allowed us to calculate exact -values, as we know all possible treatment assignments that could have been selected. Third, and most importantly, it enabled us to impose a strict budget constraint: because government reimbursements to municipalities depended on centre size and the expected enrolment of unenrolled children, we excluded allocations that would have exceeded the fixed €30 million budget.

4.4 Assessing randomisation balance

Random assignment should, in expectation, create treatment and control groups that are identical in average terms. In practice, chance variation can lead to small imbalances, which is why randomised evaluations typically begin by testing whether background characteristics are balanced between groups. In our case, the fact that centres were randomised only during the first year of the experiment provides an additional rationale for checking balance.
Table 1. Covariate balance by cohort
 
2016 Cohort
2017 Cohort
 
Control
Trt – Ctrl
Control
Trt – Ctrl
Mother’s earnings (€)
23,253
167
23,617
992**
 
(294)
(399)
(282)
(446)
Father’s earnings (€)
43,372
–14
44,064
928
 
(621)
(863)
(621)
(876)
Mother’s education (years)
13.33
0.10
13.34
0.15**
 
(0.05)
(0.07)
(0.05)
(0.07)
Father’s education (years)
12.83
0.10
12.86
0.10
 
(0.05)
(0.07)
(0.05)
(0.07)
Single parent
0.20
0.00
0.20
0.00
 
(0.01)
(0.01)
(0.01)
(0.01)
Number of siblings
1.59
–0.03
1.65
–0.09**
 
(0.04)
(0.04)
(0.04)
(0.04)
Girl
0.49
0.00
0.48
0.01*
 
(0.00)
(0.01)
(0.00)
(0.01)
Mother’s domestic language
0.84
0.02*
0.83
0.02**
 
(0.01)
(0.01)
(0.01)
(0.01)
Father’s domestic language
0.84
0.01
0.83
0.02**
 
(0.01)
(0.01)
(0.01)
(0.01)
Child’s domestic language
0.88
0.01
0.87
0.02**
 
(0.01)
(0.01)
(0.01)
(0.01)
Immigrant parents
0.13
–0.01
0.14
–0.02**
 
(0.01)
(0.01)
(0.01)
(0.01)
Foreign-born child
0.03
0.00
0.03
0.00
 
(0.00)
(0.00)
(0.00)
(0.00)
Immigration age (years)
1.98
–0.08
2.05
0.09
 
(0.07)
(0.11)
(0.06)
(0.11)
Exact Wald -value
 
 
 
 
Separate regressions
0.409
0.040
Joint regression
0.562
0.084
Asymptotic Wald -value
0.629
0.073
Sample size
10,899
18,932
10,461
18,305
Note: Standard errors in parentheses. * , ** , *** .
Table 1 reports the balance checks. For the 2016 cohort, randomisation produced highly comparable treatment and control groups. For the 2017 cohort, some modest but statistically significant differences emerged: children in the treatment group tended to come from slightly higher-income and more educated families and were more often native speakers of Finnish, Swedish, or Sámi. These minor deviations likely arose because treatment and control centres retained their status between cohorts. Families of children born in 2016 could not influence their assignment, as randomisation took place after their preschool enrolment decisions. In contrast, families of 2017-born children could, in principle, move their children to a treated or control centre before their children’s treatment status was determined.
Although such selection could threaten causal interpretation, its magnitude appears limited. Our research design also allows us to assess this possibility directly. Most importantly, we can compare estimates using data on the 2016 and 2017 birth cohorts separately. In addition, we can examine how sensitive the results are to the inclusion of observable characteristics in the specifications. In practice, these analyses suggest that potential selection bias in the 2017 cohort is unlikely to affect our conclusions.

5 Children’s skills assessments

5.1 Purpose and rationale

Once the research design was finalised and randomisation completed, the next challenge was to measure child development to examine how starting preschool a year earlier affected the foundations for learning and social participation at school entry. To answer this question credibly, the study needed precise, developmentally appropriate measures that could be administered at scale across almost a thousand early education centres. 
While economists could design the experiment and analyse the results, assessing five-year-olds required a different kind of expertise, and the research group was extended to include early childhood education specialists, developmental psychologists and learning analysts. This multidisciplinary collaboration was essential to ensure that the chosen instruments captured genuine developmental progress.
The assessment design sought to cover the full range of abilities within each age group. To follow children over time, each test wave became slightly more demanding, with overlapping items linking the results into a consistent developmental scale. In short, the assessments were designed to provide reliable evidence on how children learn and grow during the preschool years, and to do so in a way that could inform national policy decisions about early education.

5.2 Assessment design and timeline

Children’s skills were assessed three times between the ages of five and seven to capture both immediate and lasting effects of the two-year preschool programme. The assessments took place in the autumns of 2021–2024 and followed the progression of the two birth cohorts included in the experiment. Each wave used the same core domains, enabling consistent tracking of changes over time.
The first assessment served as a baseline. It was conducted when treatment-group children began preschool at age five and control-group children remained in daycare or at home. The second round took place a year later, when all treatment children had completed their first preschool year, and most control children had just entered the regular one-year preschool. The final assessment was carried out at school entry, when all children were seven years old and enrolled in first grade. Figure 3 shows the timeline.
Figure 3. RCT Timeline

5.3 Assessment at scale

Assessing more than 30,000 children across the country required a system that was both uniform and practical for everyday use. The solution was to rely on teachers to conduct the assessments and to use the ViLLE learning analytics platform developed by the Turku Research Institute for Learning Analytics at the University of Turku. The platform enabled standardised administration, automatic scoring, and secure data transfer, substantially reducing both workload and error compared with paper-based testing.
Figure 4. Data collection
Figure 4 summarises the data collection method. Teachers supervised the sessions in familiar classroom settings. In preschool, assessments were conducted one-on-one with a tablet, while in first grade, entire classes completed them during two sessions under teacher supervision using tablets/laptops and audio instructions from headphones. Each child received a personal login linked to their national identifier, enabling seamless linkage to register data for later analysis. To make participation easy, even for young children, the system featured a picture-based sign-in interface that allowed them to log in independently. 
Before testing, teachers completed short online training modules and received detailed written and video instructions. The platform provided immediate quality checks, time stamps, and error logs, ensuring consistent implementation across municipalities. Teachers recorded socioemotional ratings directly within the same system, making ViLLE a single digital hub for all parts of the assessment.
This digital infrastructure turned what could have been a logistical bottleneck into a strength. It enabled uniform, high-quality measurement across hundreds of centres while creating a permanent, linkable dataset that will support long-term follow-up studies.

5.4 What was measured

The assessments covered both socioemotional and academic skills using instruments developed or adapted in collaboration with Finnish and international experts. Table 2 lists all instruments and their sources.
Table 2. Assessment instruments
Acronym
Source
Description / Notes
Social-Emotional Skills (Teacher Rated)
BSRS
Onatsu-Arvilommi and Nurmi (2000)
Measures task avoidance and engagement strategies.
MASCS
Merrell (1993)
Short version assessing prosociality; excluded in grade 1.
CBRS
Bronson et al. (1990)
Short version assessing self-regulation.
SDQ
Goodman (1997)
Five subscales (emotional symptoms, conduct, hyperactivity, peer problems, prosocial behaviour).
SSES
OECD (2021)
Measures curiosity and happiness (three items each).
Language Skills
PPVT
Dunn and Dunn (2007)
Picture vocabulary assessment.
ARMI
Lerkkanen et al. (2006)
Finnish early learning battery covering phonological awareness, letter knowledge, and reading skills.
Numeric Skills
Project-specific
Counting numbers (two tasks at age 5, four at age 6); quantity comparison (45-second limit per task); quantity production (quantities 3, 7, 13, 21); arithmetic operations (four tasks at age 5, six at age 6).
General Cognitive Skills
WJ
Woodcock and Johnson (1977)
Spatial relations test with 3-minute time limit.
Note: All measures were administered through an online platform. Teacher ratings use Likert-type scales. Project-specific measures were developed and validated for this study. Time limits are indicated where applicable. Acronyms: BSRS = Behavioural Strategy Rating Scale; MASCS = Multisource Assessment of Social Competence; CBRS = Child Behaviour Rating Scale; SDQ = Strengths and Difficulties Questionnaire; SSES = Survey on Social and Emotional Skills; PPVT = Peabody Picture Vocabulary Test; ARMI = Finnish early learning assessment battery; WJ = Woodcock–Johnson Spatial Relations Test.

5.4.1 Academic development

The academic assessments were designed to capture the skills that form the foundation of early learning: understanding quantities, manipulating numbers, recognising letters and words, and linking sounds to text. They drew on established Finnish early-learning batteries (ARMI) and international measures. Each task was piloted and calibrated to span the entire skill distribution. The most basic items asked children to identify letters or count small sets of objects, while more advanced ones required reading simple words or solving short arithmetic problems. As children grew older, the tasks became more challenging, with overlapping items anchoring results across waves to ensure comparability. Items were presented as short, game-like tasks that children could complete within 20–25 minutes. Automated scoring provided precise and comparable indicators for every child.

5.4.2 Socioemotional development

To capture children’s behavioural and emotional adjustment, teachers completed standardised rating scales covering self-regulation, task focus, empathy, co-operation, and emotional well-being. The instruments combined well-validated international measures from several sources (see Table 2). Teachers entered their responses directly into the ViLLE platform using each child’s individual account, ensuring consistency and secure linkage to academic outcomes.

5.5 Pre-analysis plan: Defining outcomes and analysis in advance

We filed a pre-analysis plan outlining the main research questions, outcome measures, and statistical methods in advance. The plan identified six primary outcomes: two for academic skills (numeracy and literacy) and four for socioemotional development (task skills, social skills, peer relations, and emotion regulation). Each outcome combined several underlying test items or teacher-rated scales into a single, interpretable measure. These definitions were agreed upon and registered before the first child was assessed at age six, ensuring that the outcome categories were fixed in advance rather than shaped by the data.
The value of this approach is to improve credibility. A pre-analysis plan does not alter the results, but it strengthens confidence that the findings reflect what the experiment was designed to test rather than what later appeared interesting. Our plan also specified the main estimation strategy, subgroup analyses, and robustness checks to be conducted. Pre-analysis plans are increasingly standard in experiments because they reduce the scope for selective reporting and make analytical decisions transparent. They serve both scientific and policy purposes by facilitating replication, guiding interpretation, and signalling that the evaluation followed a clearly articulated design rather than post hoc choices. 

5.6 Participation and data quality

Assessments covered all children born in 2016 and 2017 who were enrolled in participating treatment and control centres, with families retaining the right to opt out. Participation was high with response rates ranging from 83% to 91%, depending on the cohort and skill domain. Table 3 reports detailed figures by age and cohort. The median completion time per child was 22 minutes for the cognitive tasks and 8 minutes for the teacher evaluations. Children not enrolled in any centre at the time of data collection, primarily five-year-olds in home care, were not assessed. High completion rates were maintained through careful field coordination. A centralised call centre staffed by trained research assistants monitored real-time completion data from the ViLLE platform and followed up systematically with reminders to centres with unfinished assessments. This approach, combined with digital administration and standardised procedures, minimised missing data and ensured consistent quality across municipalities.
Attrition was modest and largely unrelated to treatment assignment. The main reasons for missing data were relocation outside participating municipalities or temporary absences during the testing window. From age six onward, when both treatment and control groups attended mandatory preschool or school, differential attrition effectively disappeared.
Table 3. Participation in Child Skill Assessments by Age, Cohort, and Domain
Age group
Skill domain
2016 cohort
2017 cohort
 
 
N
Coverage (%)
N
Coverage (%)
Age 5
Literacy
16,401
91.3
15,197
84.7
Numeracy
16,373
91.2
15,172
84.5
Socioemotional
16,200
90.2
14,931
83.2
Age 6
Literacy
19,854
87.6
19,283
87.5
Numeracy
19,816
87.4
19,274
84.5
Socioemotional
19,226
84.8
18,638
84.6
Grade 1 (Age 7)
Literacy
31,492
91.9
30,512
94.3
Numeracy
30,324
88.5
29,361
90.7
Socioemotional
29,983
87.5
29,120
90.0

5.7 Building a long-term research asset

Although the current analyses focus on short-term effects, the data were designed to serve as the foundation for a long-term research asset. Recognising the scale of the effort, both researchers and policy partners saw the project as an opportunity to build a national resource: a longitudinal dataset on early childhood development that combines rich skills assessments with administrative records. Importantly, the subsequent government incorporated this objective into its policy agenda and committed to continuing the skill assessments through the end of comprehensive education. As a result, the value of the data will extend far beyond the immediate experiment, enabling future research on how early skills shape later outcomes.
Lessons Learned: Putting the Field into Field Experiments
Running an experiment in real classrooms means stepping away from the desk. Members of our team visited daycare centres, watched the new structured morning sessions, and even helped run assessments during pilot rounds. These visits offered a glimpse of how the reform took shape beyond the spreadsheets and gave context to what the numbers would later show. We also held seminars and workshops with teachers, principals, and municipal coordinators across the Helsinki region to hear how the two-year preschool worked in practice.

6 Treatment

To interpret the estimated effects on children’s skills, it is essential to understand how the experiment altered the everyday experiences of children participating in the two-year preschool relative to the existing system. To document these changes, the research team conducted large-scale surveys and collected and analysed administrative records, as well as national and local curricula. As a result, we can describe the treatment in considerably greater detail than is typically possible in policy evaluations. We next discuss these findings.

6.1 Legislation

The intervention advanced the start of preschool by one year, so that children in the treatment group attended preschool at ages five and six instead of only at six. The Act on the Two-Year Preschool Trial (1046/2020) defined the target population, randomisation, and implementation rules. Participation was mandatory in principle: families were expected to ensure that their child achieved preschool learning goals, either by attending the assigned programme or by other means.
The new two-year preschool followed a national curriculum prepared by the Finnish National Agency for Education and implemented by municipalities. Preschool was free of charge for four hours each morning during the school year, with optional afternoon daycare subject to income-based fees. Children living more than five kilometres away were entitled to free transportation.

6.2  Curricular content and pedagogical approach

The two-year preschool curriculum built directly on Finland’s existing framework for six-year-olds. It remained play-based and child-centred, and in most respects closely resembled the early childhood education and care (ECEC) curriculum followed in the control group. The main goals were similar to those in ordinary daycare, but the two-year preschool curriculum placed somewhat greater emphasis on purposeful group activities led by preschool-qualified teachers. Daily sessions typically combined guided play with early literacy and numeracy tasks. Teachers were encouraged to integrate learning into play, and classroom-style teaching was virtually absent.
Responsibility for implementation followed Finland’s three-tier governance model. The National Agency for Education set national objectives; municipal education boards translated them into local curricula; and centres developed their own pedagogical plans. This structure promoted local ownership but also introduced variation across municipalities. Some centres emphasised academic readiness, others focused more on social interaction and emotional growth, reflecting differences in local priorities and staff expertise.

6.3  Changes in educational inputs

The experiment altered children’s learning environments in measurable ways. Survey and administrative data reveal systematic differences in teacher qualifications, group composition, and daily time use between treatment and control centres. Table 4 summarises the main contrasts.
Table 4. Changes in educational inputs: Treatment (preschool) vs. Control (daycare)
 
Treatment
Control
Difference
 
(Preschool)
(Daycare)
(T – C)
A. Highest qualification of teacher(s) in group
Bachelor’s or Master’s degree in ECEC (%)
65.4
44.3
+21.1
Bachelor’s in social pedagogy/social services (%)
25.1
40.5
15.4
No tertiary degree in ECEC (%)
9.5
15.2
5.7
B. Staffing
Number of teachers in group (avg.)
1.32
1.23
+0.09
C. Child group’s age composition
Only 5-year-olds (%)
45.3
19.3
+26.0
5–6-year-olds (%)
21.6
13.9
+7.7
5-year-olds and younger children (%)
27.3
53.0
25.7
5–6-year-olds and younger (%)
5.8
13.7
7.9
D. Weekly time use
Guided activities (hours/week)
7.6
6.5
+1.1
Routines (hours/week)
12.0
14.0
2.0
Note: Differences are expressed in percentage points (pp) or hours. Routines include dressing, meals, rest, and similar activities.
Teacher qualifications improved in treatment centres as a direct consequence of the reform. Owing to randomisation, there were no systematic differences between treatment and control centres prior to the experiment. During the experiment, however, treatment centres had about a six–percentage-point lower share of children taught by staff without any formal qualifi­cations. The composition of qualified teachers also differed between treatment and control groups: children in the treatment group were 21 percentage points more likely to have a teacher with a university degree in early childhood education and 15 percentage points less likely to have a teacher with a degree in social pedagogy from a university of applied sciences. These differences reflected both new hires and the reshuffling of teachers within municipalities. Several municipalities reported reallocating their most experienced staff to treatment classrooms. This response also revealed the limits of national capacity, as only about half of municipalities were able to fill all posts with fully qualified teachers.
Group composition also changed. Treatment centres largely shifted from mixed-age daycare groups (ages three to five) to age-specific preschool groups, typically composed of five-year-olds only. Children were about 26 percentage points more likely to have only same-age peers, enabling a more uniform pace for activities and reducing the time spent assisting younger children. Instructional time increased modestly: teacher time-use surveys indicate roughly one additional guided session per week, or about a 15–20% rise, alongside a two-hour weekly reduction in routine activities such as dressing, eating, and rest. The overall length of the day remained unchanged.
Aside from these dimensions, most features of provision remained stable. Group sizes and child–teacher ratios did not change, daily schedules continued to alternate between play, meals, and rest, and centres relied on the same facilities and materials as before. In sum, the reform primarily reallocated existing resources toward more qualified teachers, more homogeneous age groups, and slightly more guided learning time. These adjust­ments should be understood as an integral part of the treatment and highlight that a nationwide transition to two-year preschool—if undertaken without a substantial expansion in the supply of qualified teachers—would likely also affect staffing and pedagogical conditions in groups serving younger children.

7 Effects on children’s skills

We now turn to the effects on children’s skills. This section provides a brief overview of the main findings, drawing on the final evaluation report published (in Finnish) by the Ministry of Education and Culture (Sarvimäki et al. 2026) and a companion academic paper. We refer readers interested in methodological details and a full presentation of the estimates to these companion papers.

7.1 Intention-to-treat estimates

The primary objective of the experiment was to assess whether a reform replacing voluntary early childhood education and care (ECEC) at age five with a first year of mandatory two-year preschool would better prepare children for primary education. The experiment was designed to approximate as closely as possible the overall effects of such a reform. The scale of the reform was central to this aim: in addition to providing statistical power, it allowed us to study outcomes under realistic conditions in which the estimates capture the combined effects of increased participation, changes in the content of ECEC, and the challenges associated with implementing the reform within the existing system. We begin with a treatment–control comparison that estimates intention-to-treat effects on children’s skills. The next subsections trace these differences to changes in participation, content, and implementation.
Table 5 reports estimates of the effects of assignment to the treatment group on indices of academic and socioemotional skills. At age six, children in the treatment group scored slightly higher in academic skills and slightly lower in socioemotional skills than those in the control group. Both estimates are individually marginally statistically significant, with adjusted p-values of 0.069 that account for multiple hypothesis testing across both outcomes. In addition, we reject the joint null hypothesis of no effect on either skill, with a p-value of 0.004. 
Taken at face value, the estimates suggest that in the short term, the reform may have involved a trade-off between academic and socioemotional skills. However, we caution against drawing strong conclusions regarding socioemotional outcomes at age six due to the challenges involved in quantifying them. Unlike academic skills, which were assessed using computer-based tests administered directly to children, socioemotional skills were rated by teachers. At age six, treated children were typically evaluated by teachers who had known them for substantially longer than control children, and teachers generally evaluated children from only one experimental condition, either treatment or control. In addition, teachers in treatment centres were directly involved in implementing the experiment, which may have shaped their perceptions or reporting in ways that are difficult to disentangle from true skill differences. As a result, observed differences may partly reflect differential familiarity or reference effects rather than underlying changes in socioemotional skills.
Table 5. Intention-to-treat estimates
 
Outcomes
 
Age 6
Age 7
 
Academic
Socioemotional
Academic
Socioemotional
Treatment group
0.031
-0.039
-0.012
0.003
 
(0.014)
(0.020)
(0.014)
(0.016)
Cluster p-value
0.032
0.051
0.382
0.846
Exact p-value
0.037
0.059
0.403
0.847
Romano-Wolf p-value
0.069
0.069
0.633
0.847
Wald p-value
0.004
0.666
N
30,913
29,969
30,940
29,454
Note: This table reports estimates from regressions of children’s assessed skills on an indicator for assignment to the treatment group, parental socioeconomic and demographic characteristics, and block-by-cohort fixed effects. Standard errors (in parentheses) are clustered at the level of randomisation (municipality or administrative region). Exact -values are based on permutation tests. Romano–Wolf adjusted -values account for testing two outcomes. The Wald -value tests the joint null hypothesis of no effect on either skill.
A more important observation is that the age-six estimates are statistically significant only because of the high statistical power of the experiment. The point estimate is 0.03 standard deviations for academic skills and  standard deviations for socioemotional skills. As a benchmark, effects below 0.05 standard deviations are typically classified as small in the education literature (Kraft 2020). Another way to interpret the magnitude is that an effect of 0.03 standard deviations in academic skills corresponds to roughly two weeks of typical developmental progress at this age.
Most importantly, by the start of primary education, the skills of children assigned to the treatment and control groups had evolved identically. At age seven, the point estimates for average effects are -0.012 and 0.003 standard deviations for academic and socioemotional skills, respectively, and neither estimate is statistically significant. This is not simply an absence of evidence: the estimates are sufficiently precise to conclude that any actual effect is minimal. For the academic skills index, the 95% confidence interval ranges from -0.04 to 0.02 standard deviations. For the socioemotional skills index, we can rule out effect sizes larger than 0.03 standard deviations in either direction.
In our companion work, we also find no evidence that the experiment achieved its stated objective of levelling the playing field at the start of primary education. The intention-to-treat effects remain statistically insignificant, with small point estimates, across subgroups defined by family composition and parental income, education, country of birth, and employment status. We also examined directly whether two-year preschool compressed the distribution of skills. While there is modest evidence of compression at age six, this pattern does not persist to school entry. If anything, the variance of academic skills at the start of primary education is slightly larger in the treatment group than in the control group. In short, we show that extending preschool did not reduce disparities in skills at school entry.
Interestingly, our findings differ markedly from those of Rege et al. (2024), who study a broadly similar expansion of early education for five-year-olds in Norway. In that study, the estimated effect on a composite index combining all measured skills was 0.12 standard deviations at age six and 0.13 standard deviations at age seven. These estimates are statistically significantly and substantially larger than those observed in our setting. Several factors may account for these differences. First, the relevant counterfactual (standard provision for five-year-olds) may differ between Norway and Finland. Second, the interventions themselves were not identical: the Norwegian reform involved new curricula that appear to have been more detailed and prescriptive than those used in Finland’s two-year preschool trial. Third, the scale and mode of implementation differed sharply. The Norwegian experiment was conducted as a pilot involving 30 municipalities, of which 15 chose to participate, with randomisation across 71 units. By contrast, Finland’s experiment was implemented in 148 randomly selected municipalities and 956 centres. As a result, our estimates reflect outcomes under conditions in which challenges related to system-wide scaling are fully present, including implementation in municipalities and centres that would not have volunteered to participate. This distinction is important given recent discussions on how treatment effects identified in small-scale pilots may attenuate when policies are implemented at scale (e.g., List 2022).
Lessons Learned: When Null Results Are Informative
A common misconception in interpreting statistical results is that the absence of statistically significant effects implies that a policy has no impact. In many settings, this inference is unwarranted: small or underpowered studies with insignificant results often yield wide confidence intervals and therefore cannot rule out meaningful effects. In our case, however, the situation is different. The large scale of the experiment yields narrow confidence intervals, providing sufficient statistical power to rule out even modest effects. When designing future trials, investing in sample size pays off not only when effects are detected, but especially when tight confidence intervals allow policymakers to conclude that meaningful effects are unlikely.

7.2 Effect on childcare choices

The intention-to-treat estimates discussed above reflect both how the experiment altered participation across different forms of childcare and the relative effectiveness of two-year preschool compared with the existing alternatives. To better understand these findings, we first document how the experiment affected childcare arrangements and then discuss the implications for interpreting the effects on children’s skills. 
Figure 5. Childcare choices through ages 4-6. Treatment N = 15,272; Control N = 20,475 
Figure 5 shows that before the experiment, centre-based ECEC was already common: in the spring preceding assignment, about 87% of children in our sample were enrolled in ECEC, while the remaining 13% were in home care. The experiment strongly shifted children into two-year preschool. In the autumn, 85% of children assigned to the treatment group enrolled in two-year preschool, compared with only 1.5% of children assigned to the control group. Correspondingly, regular ECEC participation at age five was much more common in the control group (92.3%) than in the treatment group (10.7%). Importantly, the experiment also moved some children from home care into centre-based provision: 4.2% of children in the treatment group did not participate in ECEC at age five, compared with 7.7% in the control group. By age six, participation in preschool was almost universal in both groups.
These patterns are informative about who was affected by the reform and how. Roughly 84% of children were compliers (comparison 1 in figure 5), meaning that their ECEC form was determined by the randomisation. Among these compliers, about 95% would otherwise have attended standard ECEC at age five, while only around 5% would have remained in home care. This distinction is central to interpreting the results: for the vast majority of children, two-year preschool replaced an already common and in many ways similar form of ECEC rather than a home environment.
Nevertheless, it is important to note that the experiment increased centre-based ECEC participation at age five by about 3.5 percentage points (comparison 2 in figure 5) from a baseline of roughly 92%. This increase was concentrated among children with Finnish-born parents; we find no corresponding rise in participation among children with immigrant backgrounds. The participation response was also stronger in the second year of implementation: assignment increased age-five participation by about two percentage points for the 2016 cohort and five percentage points for the 2017 cohort. Despite this increase, we find no effects on parental employment.
Taken together, these participation patterns highlight that the experiment changed childcare arrangements in different ways for different children. For most participants, two-year preschool replaced standard ECEC at age five, while for a smaller minority, it replaced home care. This distinction is crucial for interpreting the effects on children’s skills: treatment effects depend not only on what the intervention provides, but also on what it replaces. In the next subsection, we therefore examine effects separately by counterfactual care environment, rather than relying solely on average treatment effects.

7.3 Counterfactual specific estimates

Figure 6 reports counterfactual-specific treatment effects for the two indices discussed above, as well as for the six preregistered subcomponents of academic and socioemotional skills. When the counterfactual is standard ECEC, the results are clear and closely mirror the intention-to-treat estimates discussed above. At age seven, the point estimate for the academic skills index is -0.01 standard deviations, with a 95% confidence interval ranging from -0.05 to 0.03 standard deviations. The corresponding point estimate for the socioemotional skills index is -0.002, and the confidence interval rules out effect sizes larger than 0.04 standard deviations in either direction. This similarity to the intention-to-treat results is expected, given the high compliance with treatment assignment and the fact that, for most children, two-year preschool replaced standard ECEC rather than home care. In short, replacing standard ECEC with two-year preschool did not alter the pace of children’s skill development. 
The results are very different for the small group of children whose alternative to two-year preschool would have been home care at age five. For this group, the point estimates at age six suggest sizeable and positive effects on both academic and socioemotional skills. The point estimate for the academic skills index is 0.36 standard deviations, corresponding to roughly half a year of typical developmental progress. The corresponding point estimate for the socioemotional skills index is very similar at 0.35 standard deviations, driven primarily by improvements in peer relations and emotion regulation.
Figure 6. Treatment effects relative to daycare (Panel A) and home alternatives (Panel B) 
By the start of primary education, however, these short-term differences appear to fade. At age seven, the point estimates are -0.04 and 0.15 standard deviations for the indices of academic and socioemotional skills, respectively. Importantly, however, statistical power is limited for this group because the number of children transitioning from home care to two-year preschool was relatively small, and the resulting estimates are imprecise. For the academic skills index, the 95% confidence interval ranges from -0.36 to 0.28 standard deviations, while for the socioemotional skills index, it ranges from -0.19 to 0.49 standard deviations among children whose counterfactual was home care. In short, the data do not allow firm conclusions about whether the sizeable initial gains for this group persist at school entry.
 Lessons Learned: Counterfactuals Matter
Our results illustrate a general principle: the effect of an intervention depends critically on what it replaces. Two-year preschool produced large, short-term gains for children who would otherwise have been in home care, but essentially zero effects for children who would have attended standard daycare. Since 95% of the affected children fell into the latter category, the overall effect was close to zero. Future evaluations of early childhood programmes should anticipate this pattern and, where possible, design studies that can separately identify effects for different counterfactual margins.

8 Discussion

This paper discussed the practical implementation and the key results of Finland’s nationwide randomised experiment on extending preschool from one to two years. The reform increased participation in formal early education at age five, moving some children from home care into centre-based provision. For these children, participating in two-year preschool had large positive short-term effects on both academic and socioemotional skills, although we cannot determine whether these gains persist into primary school. However, the potential of this channel is limited by the already high baseline enrolment rate. For most children, two-year preschool replaced standard daycare. For this large majority, extending preschool did not affect academic or socioemotional skills at school entry. Importantly, the scale of the experiment allows us to rule out even modest policy-relevant impacts for this group with confidence.
Beyond the specific question of preschool duration, this experiment provides a concrete example of how ambitious, system-wide reforms can be evaluated rigorously before permanent adoption. Randomised field experiments are not feasible or appropriate for every policy decision. However, this case illustrates that large-scale randomisation can be both feasible and informative even in complex policy environments such as early childhood education, in which implementation spans hundreds of centres and tens of thousands of children.
The aim of this paper was to support the design and implementation of similar efforts in the future. In our view, the main practical lessons from Finland’s two-year preschool experiment are the following. First, the experiment required complementary expertise from multiple disciplines. Economists devised the research design and conducted most of the statistical analysis; educational scientists and psychologists developed the assessments, surveys, and documentation of pedagogical change; learning analytics experts implemented the large-scale testing infrastructure; and civil servants contributed essential legal, administrative, and institutional knowledge. This division of labour was critical for ensuring that the experiment was both policy-relevant and reliable. 
Second, the experiment underscored the importance of close collaboration with civil servants and teachers. This co-operation went beyond securing formal consent or compliance. The trial succeeded because researchers and the civil servants jointly developed the design, and municipalities and teachers implemented the reform under real operational constraints. This shared ownership supported high participation and realistic implementation at hundreds of centres. Communication was itself a key input into design quality: clear explanations of randomisation to parents and staff, transparency about participation requirements, and tolerance for limited non-compliance when children’s perceived interests or logistical constraints required it helped preserve trust while maintaining sufficient compliance for informative estimates.
Third, this trial illustrates what statistical power and careful design can buy. Many interventions are evaluated with samples too small to distinguish “no effect” from “the evidence is inconclusive”, which leaves decision-makers free to project their prior beliefs onto noisy results. In contrast, the scale of this experiment delivered sufficiently tight confidence intervals to rule out effects that would be meaningful for the vast majority of children. The design also highlights the importance of targeting the right estimands: because treatment effects depend on what the intervention replaces, our counterfactual-specific estimates show that the relevant policy question is not whether preschool works in the abstract, but relative to which alternative and for which children. For countries considering similar reforms, the key implication is to design experiments that distinguish impacts relative to the main counterfactuals faced by the affected population.
We believe that these features have also been important for how the results have been received. Before the trial, extending preschool to two years enjoyed broad political support in Finland and featured in the campaign platforms of most major political parties. Parents and teachers likewise reported largely positive experiences with the two-year preschool during the experiment. Against this backdrop, the absence of improvements in children’s skills relative to the existing system was probably an unwelcome surprise to many. Although the results have been public for only a month at the time of writing, early indications suggest that they are already shaping the policy debate. Importantly, the dominant response has not been to dismiss the findings, but rather to acknowledge the value of having conducted the experiment and of subjecting even popular and well-received reforms to rigorous evaluation.
Finally, we emphasise that large-scale experiments can strengthen data infrastructure even in countries like Finland, where researchers already have access to exceptionally rich administrative data. As a by-product of the trial, we developed detailed measures of children’s skills and new information on educational inputs that had not previously been systematically recorded. The value of these data has been recognised by the current government, which has committed to continuing the assessments through primary and lower secondary school. This will transform the experiment into the foundation of a new cohort study that can be maintained at relatively low marginal cost. As in other Nordic countries, the ability to link these new data to existing administrative registers also makes it possible to track children over the course of their lives. Such long-term research assets will inform not only the reform studied in this article, but a broad range of future educational and social policy questions.

References

Abdulkadiroğlu, A., Angrist, J. D., Hull, P. D., & Pathak, P. A. (2016). Charters without lotteries: Testing takeovers in New Orleans and Boston. American Economic Review, 106(7), 1878–1920. Retrieved from: https://doi.org/10.1257/aer.20150479
Antikainen, R. (2019). Kokeilukulttuuri Suomessa – nykytilanne ja kehittämistarpeet (Report 2/2019). Valtioneuvoston kanslia. Retrieved from: https://julkaisut.valtioneuvosto.fi/bitstream/handle/10024/161281/2-2019-KOKSU_raportti_.pdf
Bronson, M. B., Goodson, B. D., Layzer, J. I., & Love, J. M. (1990). Child behavior rating scale. Abt Associates.
De Wispelaere, J., Halmetoja, A., & Pulkka, V.-V. (2018). The rise (and fall) of the basic income experiment in Finland. CESifo Forum, 19, 15–19.
Dunn, L. M., & Dunn, D. M. (2007). Peabody Picture Vocabulary Test–Fourth Edition (PPVT-4). Pearson.
Eerola, E., Kosonen, T., Kotakorpi, K., & Lyytikäinen, T. (2025). Tax compliance in the rental housing market: Evidence from a field experiment. American Economic Journal: Economic Policy.
Einiö, E., & Nivala, A. (2025). Rekrytointitukikokeilun loppuarviointi. Retrieved from: https://julkaisut.valtioneuvosto.fi/bitstream/handle/10024/166364/TEM_2025_28.pdf
Goodman, R. (1997). The strengths and difficulties questionnaire: A research note. Journal of Child Psychology and Psychiatry, 38, 581–586.
Haaga, T., Kortelainen, M., Nokso-Koivisto, O., Saxell, T., Seppä, M., & Sääksvuori, L. (2025). Ostrobothnia digital clinic experiment. ClinicalTrials.gov. U.S. National Library of Medicine. Retrieved from: https://clinicaltrials.gov/study/NCT06904469
Hämäläinen, K., Simanainen, M., & Verho, J. (2025). Health effects of cash transfers: Evidence from the Finnish basic income experiment. Journal of Public Economics, 250, 105480. Retrieved from: https://doi.org/10.1016/j.jpubeco.2025.105480
Hämäläinen, K., & Uusitalo, R. (2005). Kannattaisi kokeilla: Kokeelliset menetelmät työvoimapoliittisten toimenpiteiden vaikutusten arvioinnissa. Unpublished manuscript.
Hämäläinen, K., Uusitalo, R., & Vuori, J. (2008). Varying biases in matching estimates: Evidence from two randomized job search training experiments. Labour Economics, 15(4), 604–618. Retrieved from:  https://doi.org/10.1016/j.labeco.2008.04.009
Hämäläinen, K., & Verho, J. (2022). Design and evaluation of the Finnish basic income experiment. National Tax Journal, 75(3), 573–596. Retrieved from: https://doi.org/10.1086/720737
Harju, J., Kosonen, T., & Slemrod, J. (2020). Missing miles: Evasion responses to car taxes. Journal of Public Economics, 181, 104108. Retrieved from: https://doi.org/10.1016/j.jpubeco.2019.104108
Hirvonen, S., Lassander, M., Sääksvuori, L., & Tukiainen, J. (2024). Mobilizing young voters with short text messages in nationwide field experiments. Retrieved from: https://ace-economics.fi/kuvat/dp166.pdf
Holvio, A., Karhunen, H., Lerkkanen, M.-K., Nokso-Koivisto, O., Sarvimäki, M., & Torppa, M. (2024). Reading gains through teacher pairs: A co-teaching experiment. Open Science Framework (OSF). Retrieved from: https://osf.io/es3ng/overview
Hull, P. (2018). Isolating: Identifying counterfactual-specific treatment effects with cross-stratum comparisons. SSRN Working Paper. Retrieved from: https://ssrn.com/abstract=2705108
Izadi, R., Luukkonen, E., Nokso-Koivisto, O., & Sarvimäki, M. (2020). Kaksivuotinen esiopetus -kokeilu: Tutkimussuunnitelma. Unpublished manuscript.
Johansson, P., Rubin, D. B., & Schultzberg, M. (2021). On optimal rerandomization designs. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2), 395–403.
Kangas, O., & Pulkka, V.-V. (2016). From idea to experiment: Report on universal basic income experiment in Finland.Kela Working Papers No. 106. The Social Insurance Institution of Finland (Kela). Retrieved from: https://helda.helsinki.fi/server/api/core/bitstreams/7cf55de2-261e-481a-9242-adbf7f032a5f/content.
Karinen, R., Koivula, N., Koivula, T., et al. (2024). Impacts and practical implementation of the Koto-SIB experiment.Retrieved from: https://julkaisut.valtioneuvosto.fi/handle/10024/166065
Kärnä, A., Voeten, M., Little, T. D., Poskiparta, E., Kaljonen, A., & Salmivalli, C. (2011). A large-scale evaluation of the KiVa anti-bullying program: Grades 4–6. Child Development, 82(1), 311–330. Retrieved from: https://doi.org/10.1111/j.1467-8624.2010.01557.x
Kosonen, T., & Ropponen, O. (2015). The role of information in tax compliance: Evidence from a natural field experiment. Economics Letters, 129, 18–21. Retrieved from: https://doi.org/10.1016/j.econlet.2015.01.026
Kraft, M. A. (2020). Interpreting effect sizes of education interventions. Educational Researcher, 49(4), 241–253.
Lerkkanen, M.-K., Poikkeus, A.-M., & Ketonen, R. (2006). ARMI: Luku- ja kirjoitustaidon arviointimateriaali 1. luokalle. Helsinki: WSOY.
List, J. A. (2022). The voltage effect: How to make good ideas great and great ideas scale. Crown Currency.
Merrell, K. W. (1993). Using behavior rating scales to assess social skills and antisocial behavior in school settings: Development of the School Social Behavior Scales. School Psychology Review, 22(1), 115–133.
Morgan, K. L., & Rubin, D. B. (2012). Rerandomization to improve covariate balance in experiments. The Annals of Statistics, 40(2), 1263–1282. Retrieved from:  https://doi.org/10.1214/12-AOS1008
OECD. (2021). Beyond academic learning: First results from the survey of social and emotional skills. Paris: OECD.
Onatsu-Arvilommi, T., & Nurmi, J. E. (2000). The role of task-avoidant and task-focused behaviors in the development of reading and mathematical skills during the first school year: A cross-lagged longitudinal study. Journal of Educational Psychology, 92, 478–491.
Pekkala Kerr, S., Pekkarinen, T., Sarvimäki, M., & Uusitalo, R. (2020). Post-secondary education and information on labor market prospects: A randomized field experiment. Labour Economics, 66, 101888. Retrieved from: https://doi.org/10.1016/j.labeco.2020.101888
Pesola, H., Sarvimäki, M., & Virkola, T. (2025). Randomization as an incentive device: Evidence from public procurement of immigrant integration services. Unpublished manuscript.
Rege, M., Størksen, I., Solli, I. F., et al. (2024). The effects of a structured curriculum on preschool effectiveness: A field experiment. Journal of Human Resources, 59(2), 576–603.
Salmivalli, C., & Poskiparta, E. (2012). KiVa anti-bullying program: Overview of evaluation studies based on a randomized controlled trial and national rollout in Finland. International Journal of Conflict and Violence, 6(2), 294–302. Retrieved from: https://doi.org/10.4119/ijcv-2920
Sarvimäki, M., Holvio, A., Kuusiholma-Linnamäki, J., et al. (2026). Kaksivuotisen esiopetuksen kokeilu: Loppuraportti.Unpublished manuscript.
Tuulio-Henriksson, A., Simanainen, M., & Simanainen, M. (2020). Koettu terveys, psyykkinen hyvinvointi ja kognitiivinen toimintakyky. In O. Kangas, S. Jauhiainen, M. Simanainen, & M. Ylikännö (Eds.), Suomen perustulokokeilun arviointi (Reports and Memorandums of the Ministry of Social Affairs and Health 2020:15). Ministry of Social Affairs and Health. Retrieved from: https://julkaisut.valtioneuvosto.fi/bitstream/handle/10024/162219/STM_2020_15_rap.pdf
Verho, J., Hämäläinen, K., & Kanninen, O. (2022). Removing welfare traps: Employment responses in the Finnish basic income experiment. American Economic Journal: Economic Policy, 14(1), 501–522. Retrieved from:  https://doi.org/10.1257/pol.20200143
Vuori, J., Silvonen, J., Vinokur, A. D., & Price, R. H. (2002). The Työhön job search program in Finland: Benefits for the unemployed with risk of depression or discouragement. Journal of Occupational Health Psychology, 7(1), 5–19. Retrieved from: https://doi.org/10.1037/1076-8998.7.1.5
Woodcock, R. W., & Johnson, M. B. (1977). Woodcock-Johnson psycho-educational battery. Riverside Publishing.