Skip to content Skip to navigation


Flagship Projects

Privacy Preserving Internet of Things - Analytics for Human Behavior Interventions

Philip Levis, Noah Diffenbaugh, Dan Boneh, Mark Horowitz

The high-level, long-term goal is to research how to use the Internet of Things to collect data on human behavior in a manner that preserves privacy but provides sufficient information to allow interventions which modify that behavior.

Learn More >>

Mapping the “Social Genome"

Jure Leskovec, Michael Bernstein, Amir Goldberg, Dan Jurafsky, Dan McFarland, Christopher Potts

The initial research plan is built around three interrelated levels of analysis: individual, group, and society. At each level, we are investigating the interplay between static and dynamic properties, and paying special attention to the ethical and economic issues that arise when confronting major scientific challenges like this one.

Learn More >>

Data Science for Personalized Medicine

Michael Snyder, David Tse, Euan Ashley, Mohsen Bayati, Dan Boneh, Andrea Montanari, Ayfer Ozgur, Tsachy Weissman

Recent technological advances have enabled collection of diverse health data at an unprecedented level. Omics information of genomes, transcriptomes, proteomes and metabolomes, DNA methylomes, and microbiome as well as electronic medical records and data from sensors and wearable devices provide detailed...

Learn More >>

DeepDive—a High-Performance Inference and Learning Engine

Christopher Ré, Michael Cafarella

DeepDive is a system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing software. DeepDive helps bring dark data to light by creating structured data (SQL tables) from unstructured information...

Learn More >>

Small Projects

Food Prices and Mortality

Sanjay Basu, Eran Bendavid, Sze-chuan Suen

In the late 2000’s, the prices of many staple crops sold on markets in low- and middle-income African countries tripled.  Higher prices may compromise households’ ability to purchase enough food, or alternatively increase incomes for food-producing households.

Learn More >>

Use of Electronic Phenotyping and Machine Learning Algorithms to Identify Familial Hypercholesterolemia Patients in Electronic Health Records

Joshua Knowles, Nigam Shah

FIND FH (Flag, Identify, Network and Deliver for Familial Hypercholesterolemia) aims to pioneer new techniques for the identification of individuals with Familial Hypercholesterolemia (FH) within electronic health records (EHRs).

Learn More >>

Real-Time Large-Scale Neural Identification

E.J. Chichilnisky, Andrea Montanari

Electronic interfaces to the brain are increasingly being used to treat incurable disease, and eventually may be used to augment human function. An important requirement to improve the performance of such devices is that they be able to recognize and effectively interact with the neural circuitry to which they are connected. 

Learn More >>

Physics Event Reconstruction at the Large Hadron Collider

Lester Mackey, Ariel Schwartzman

The aim of this proposal is to develop and apply advanced data science techniques to address fundamental challenges of physics event reconstruction and classification at the Large Hadron Collider (LHC). The LHC is exploring physics at the energy frontier, probing some of the most fundamental questions about the nature of our universe. 

Learn More >>

Inferring the Mass Map of the Observable Universe from 10 Billion Galaxies

Risa Wechsler, Phil Marshall

Mapping the Universe is an activity of fundamental interest, linking as it does some of the biggest questions in modern astrophysics and cosmology: What is the Universe made of, and why is it accelerating? How do the initial seeds of structure form and grow to produce our own Galaxy? Wide field astronomical surveys, such as that planned with...

Learn More >>

AMELIE:  Making genetic diagnositcs accessible, reproducible, ubiquitous

Gill Bejerano, Christopher Ré

Mendelian diseases are caused by single gene mutations. In aggregate, they affect 3% (~250M) of the world’s population. The diagnosis of thousands of Mendelian disorders has been radically transformed by genome sequencing. 

Learn More >>

Big Data for Agricultural Risk Management in the United States

David Lobell, Stefano Ermon

This project aims to improve in-season predictions of yields for major crops in the United States, as well as a related goal of mapping soil properties across major agricultural states. The project uses a combination of graphical models, approximate Bayesian computation, and crop simulation models to make predictions based on weather and satellite data.

Learn More >>

Algorithms and Foundations for Valid Data Exploration

James Zou

Modern data science is highly exploratory in nature. A typical data analyst does not sit before a computer with a fixed set of hypotheses, but rather arrives at the most interesting questions and patterns after getting his/her hands dirty exploring the data. This exploration process creates complex selection biases in the reported findings and violates the standard independence assumption of statistics and machine learning.  

Learn More >>

MyHeart Counts

Euan Ashley

The MyHeart Counts study – launched in the spring of 2015 on Apple’s ResearchKit platform – seeks to mine the treasure trove of heart health and activity data that can be gathered in a population through mobile phone apps. Because the average adult in the U.S. checks his/her phone dozens of times each day, phone apps that target cardiovascular health are a promising tool to quickly gather large amounts of data about a population's health and fitness, and ultimately to influence people to make healthier choices.

Learn More >>


Dan Boneh

Boneh’s lab is working on an efficient mechanism for confidential transactions on the block chain (joint work with Benedikt Buenz). Confidential transactions (CT) is a way for two parties to transact on the block chain without revealing the amount of money that one party is paying the other. 

Learn More >>

Stanford Distributed Clinical Data Project and MS Azure

Philip Lavori, Balasubramanian Narasimhan, Daniel Rubin

A system has been developed at Stanford that enables using confidential healthcare data among distant hospitals and clinics for creating decision support applications without requiring sharing any patient data among those institutions, thus facilitating multi-institution research studies on massive datasets. This collaboration between Microsoft and Stanford will develop a MS Azure application based on this, thus providing a solution that is robust, usable, and deployable widely at many healthcare institution.

Learn More >>