Skip to content Skip to navigation

Data Science for Education

Educational institutions and learning process entail rich data, and they concern weighty problems of great importance to society and the social good, so education is an especially well-suited domain for data science.

Educational data spans K-12 school and district records, digital archives of instructional materials and gradebooks, as well as student responses on course surveys. Data science of actual classroom interaction is also an increasing interest and reality – there one can capture how classroom management and instruction is accomplished. As video and voice recordings grow more prevalent, it may be a prime data source to analyze via novel computational means. The richness of educational data extends to the higher education realm where an increasing number of online courses are being employed. It even extends to the private sector where online forums, threads, and distributed forms of problem-solving are used to educate employees and resolve task problems.

All these new data sources are replete with information on communication (text), relations(links), and accruing behavioral profiles (careers). All of this information can be mined and analyzed in an effort to understand and solve persistent educational problems.

  • Educational data science would train educators - broadly conceived - to study these forms of educational data and to make sense of educational systems, their problems and potential solutions, and to develop a deeper understanding and empirically established forms of solutions.
  • Educational data science would empower educators to perform data visualization, data reduction and description, and prediction tasks.
  • Data visualization can render information more intuitive and immediately digestible for practitioners.
  • Data reduction can be used to make sense of many complex records and fields of data on students (e.g., grade books and assignments, etc).

But even basic descriptions of the educational system like tracking structures, key turning points in careers, etc, also fit. All are feasible with data science of school records.

A variety of educational problems suggest potential modeling and prediction tasks (random forest sort of thing). For example: student attrition & dropout / student attendance; detentions / office referrals / arrests; learning delays / progression problems / failure; assignment completions; bias / prejudice in grading; the diversification bonus; etc.

Looking forward, there are a number of areas where the combination of data sciences and education is particularly promising.

Psychology, education research and the learning sciences often have nuanced theories of human cognition and learning that make general guidelines on the types of pedagogical activities that are effective. However, such guidelines do not always translate to concrete implementable strategies. Data science from prior data (e.g. using counterfactual reasoning methods) can be used to parameterize generic guidelines into grounded strategies that can be used (Chi et al. 2011; Khajah et al. 2016): as a fake but illustrative example, if blocking practice on a particular skill is thought to be beneficial, data science can be used to help determine for this particular school’s seventh grade population, how many practices on this skill is most effective.

Rich contextualization of education -- what works for whom, when and where -- is also an exciting possibility. Much of educational and learning sciences research takes place in laboratory settings or is conducted in a limited set of schools. Sometimes findings of such studies fail to replicate in other environments. Large scale data collection, across MOOCs and classrooms, could help us uncover what are the key features that correlate and cause the effectiveness of an educational intervention. As has long been recognized, online platforms also make it easier to offer personalized interactions to students as we identify which factors are important in the pedagogical activities.

An interesting challenge on the data science methodology side is that many new educational platforms offer mixed autonomy. The learner may have a large amount of flexibility in how she proceeds through the course, and yet the potential for the teaching system to provide recommendations or guide the student’s learning path is considerable. Creating systems that can most effectively navigate this challenge, which may also have significant impacts on student motivation, is an important open issue.

M. Chi, VanLehn, K, Litman, D. & Jordan, P. (2011). Empirically evaluating the application of reinforcement learning to the induction of effective and adaptive pedagogical tactics. User Modeling and User Adapted Instruction (UMUAI), 21, 1-2, pp. 137-180.

Khajah, M., Roads, B. D., Lindsey, R. V., Liu, Y.-E., & Mozer, M. C. (2016). Designing engaging games using Bayesian optimization. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (pp. 5571-5582). New York: ACM.


Emma Brunskill
Computer Science
Dan McFarland