Updated for 2017-18
The high-level, long-term goal is to research how to use the Internet of Things to collect data on human behavior in a manner that preserves privacy but provides sufficient information to allow interventions which modify that behavior. We are exploring this research question in the context of water conservation at Stanford: how can smart water fixtures collect data on how students use water, such that dormitories can make interventions to reduce water use, while keeping detailed water use data private?
Towards this end, we have deployed a water use sensing network in Stanford dormitories. Our pilot deployment in the winter of 2017 showed several interesting results, such that the average and median shower length for men and women is the same. More importantly, using this network we have been able to determine that placards suggesting using less water, placed within the showers, are correlated with 10% shorter showers. Furthermore, while the average shower length is 8.8 minutes, there is an extremely long tail, with 20% of showers being longer than 15 minutes.
We have been asked to deploy the network again in order to measure water use within a larger dormitory. Currently, the network is purely observational: our next goal is to augment the network with real-time feedback to users, such as blinking a red light when a shower is running long. This will allow us to explore the relative efficacy of delayed (signs on doors, messages to dormitories), immediate (placards in showers) and real-time (indicator lights) interventions. Can the system do this in a way that provides the aggregate results without revealing the behavior of individuals or individual water use events?
The initial research plan is built around three interrelated levels of analysis: individual, group, and society. At each level, we are investigating the interplay between static and dynamic properties, and paying special attention to the ethical and economic issues that arise when confronting major scientific challenges like this one. Our ultimate goal is to identify ways in which scientists, engineers, community builders, and community leaders can contribute to the development of more productive, vibrant, and informed teams, online and offline communities, and societies.
The goal of this project is to develop data science tools and statistical models that bring networks and language together in order to make more and better predictions about both. Our focus is on joint models of language and network structure. This brings natural language processing and social network analysis together to provide a detailed picture not only of what is being said in a community, but also who is saying it, how the information is being transmitted through the network, how that transmission affects network structure, and, coming full circle, how those evolving structures affect linguistic expression. We plan to develop statistical models using diverse data sets, including not only online social networks (Twitter, Reddit, Facebook), but also hyperlink networks of news outlets (using massive corpora we collect on an ongoing basis) and networks of political groups, labs, and corporations.
Leskovec maintains a large collection of network and language data sets at the website for the Stanford Network Analysis Project SNAP (http://snap.stanford.edu). The pilot work described in general terms here relies mainly on resources that have been posted on SNAP for public use. (In some cases, privacy or business concerns preclude such distribution.) Moreover, we have access to several powerful, comprehensive data sets: (i) cell phone call traces of entire countries; (ii) complete article commenting and voting from sites like CNN, NPR, FOX, and similar; (iii) a near complete U.S. media picture: 10 billion blog posts and news articles (5 million per day over last six years); (iv) complete Twitter, LinkedIn, and Facebook data (through direct collaboration with these companies); (v) five years of email logs from a medium-sized company.
Recent technological advances have enabled collection of diverse health data at an unprecedented level. Omics information of genomes, transcriptomes, proteomes and metabolomes, DNA methylomes, and microbiome as well as electronic medical records and data from sensors and wearable devices provide detailed view of disease state, physiological, and behavioral parameters at the individual level. Availability of such massive-scale digital footprint of an individual’s health opens the door to numerous opportunities for monitoring and accurately predicting the individual’s health outcomes in addition to customizing treatments at individual level, hence realizing the goal of personalized medicine. A major challenge is how to efficiently collect, store, secure and most importantly, analyze such massive-scale and highly private data so that accuracy of outcome predictions and treatment analysis is not impacted. The “Data Science for Personalized Health” flagship project will design a system that will address this challenge and validate it on several personalized medicine tasks. Specifically we will 1) devise new algorithms for sampling, for imputation of missing data and for joint processing of multiple measurements; 2) build novel frameworks to house and manage complex data in a useful and secure fashion; 3) devise new tools for the analysis of and the prediction from high dimensional, complex, longitudinal data. Using a unique dataset on 70 pre-diabetic participants, we devise a personalized and highly accurate early detection method for diabetes and analyze the consequences of weight change, physical activity, stress, and respiratory viral infection on individuals’ digital health footprint and ultimately predict the effect of such perturbations on individuals’ health outcomes. The research is led by an interdisciplinary team of faculty with expertise in medicine, genetics, machine learning, security and information theory, and the tools developed will be of broad interest to other data science problems as well.
DeepDive is a system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing software. DeepDive helps bring dark data to light by creating structured data (SQL tables) from unstructured information (text documents) and integrate such data with an existing structured database.
DeepDive is a new type of data management system that enables users to tackle extraction, integration, and prediction problems in a single system, allowing them to rapidly construct sophisticated end-to-end data pipelines. By allowing users to build their system end-to-end, users focus on the portion of their system that most directly improves the quality of their application. In contrast, previous pipeline-based systems require developers to build extractors, integration code, and other components—without any clear idea of how their changes improve the quality of their data product. This simple insight is the key to how DeepDive systems produce higher quality data in less time. DeepDive-based systems are used by users without machine learning expertise in a number of domains from paleobiology to genomics to human trafficking.
Updated for 2017-18
In the late 2000’s, the prices of many staple crops sold on markets in low- and middle-income African countries tripled. Higher prices may compromise households’ ability to purchase enough food, or alternatively increase incomes for food-producing households. Despite these different potential effects, the net impact of this “food crisis” on the health of vulnerable populations remains unknown.
We extracted data on the local prices of four major staple crops – maize, rice, sorghum, and wheat – from 98 markets in 12 African countries (2002-2013), and studied their relationship to under-five mortality from Demographic and Health Surveys. Using within-country fixed effects models, distributed lag models, and instrumental variable approaches, we used the dramatic price increases in 2007-2008 to test the relationship between food prices and under-five mortality, controlling for secular trends, gross domestic product per capita, urban residence, and seasonality.
The prices of all four commodities tripled, on average, between 2006 and 2008. We did not find any model specification in which the increased prices of maize, sorghum, or wheat were consistently associated with increased under-five mortality. Indeed, price increases for these commodities were more commonly associated with (statistically insignificant) lower mortality in our data. A $1 increase in the price per kg of sorghum, a common African staple, was associated with 0.07-4.50 fewer child deaths per 10,000 child-months, depending on the specification (p=0. 0.25-0.98). In rural areas where higher food prices may benefit households that are net food producers, increasing maize prices were associated with lower child mortality compared with urban households (12.4 fewer child deaths per 10,000 child-months with each $1 increase in the price of maize; p=0.04).
We did not detect a significant overall relationship between increased prices of maize, rice, sorghum or wheat and increased under-5 mortality. There is some suggestion that food-producing areas may benefit from higher prices, while urban areas may be harmed.
Use of Electronic Phenotyping and Machine Learning Algorithms to Identify Familial Hypercholesterolemia Patients in Electronic Health Records
FIND FH (Flag, Identify, Network and Deliver for Familial Hypercholesterolemia) aims to pioneer new techniques for the identification of individuals with Familial Hypercholesterolemia (FH) within electronic health records (EHRs). FH is a common but vastly underdiagnosed, inherited form of high cholesterol and coronary heart disease that is potentially devastating if undiagnosed but can be ameliorated with early identification and proactive treatment. Traditionally, patients with a phenotype (such as FH) are identified through rule-based definitions whose creation and validation are time consuming. Machine learning approaches to phenotyping are limited by the paucity of labeled training datasets. In this project, demonstrate the feasibility of utilizing noisy labeled training sets to learn phenotype models from the patient's clinical record. We will search both structured and unstructured data with in EHRs to identify possible FH patients. Individuals with possible FH will be “flagged” and contacted in a HIPAA compliant manner to encourage guideline-based screening and therapy. Algorithms developed will be broadly applicable to several different EHR platforms and the principles can be applied to multiple conditions thereby extending the utility of this approach. The project is in partnership with the FH Foundation (www.thefhfoundation.org), a non-profit organization founded and led by FH patients that is dedicated to improving the awareness and treatment of FH.
Updated for 2017-18
Electronic interfaces to the brain are increasingly being used to treat incurable disease, and eventually may be used to augment human function. An important requirement to improve the performance of such devices is that they be able to recognize and effectively interact with the neural circuitry to which they are connected. An example is retinal prostheses for treating incurable blindness. Early devices of this form exist now, but only deliver limited visual function, in part because they do not recognize the diverse cell types in the retina to which they connect. We have developed automated classifiers for functional identification of retinal ganglion cells, the output neurons of the retina, based solely on recorded voltage patterns on an electrode array similar to the ones used in retinal prostheses. Our large collection of data—hundreds of recordings from primate retina over 18 years—made an exploration of automated methods for cell type identification possible for the first time. We trained classifiers based on features extracted from electrophysiological images (spatiotemporal voltage waveforms), inter-spike intervals (auto-correlations), and functional coupling between cells (cross-correlations), and were able to routinely identify cell types with high accuracy. Based on this work, we are now developing the techniques necessary for a retinal prosthesis to exploit this information by encoding the visual signal in a way that optimizes artificial vision.
Updated for 2017-18
The aim of this proposal is to develop and apply advanced data science techniques to address fundamental challenges of physics event reconstruction and classification at the Large Hadron Collider (LHC). The LHC is exploring physics at the energy frontier, probing some of the most fundamental questions about the nature of our universe. The datasets of the LHC experiments are among the largest in all science. Each particle collision event at the LHC is rich in information, particularly in the detail and complexity of each event picture (consisting of 100 million pixels images taken 40 million times a second), making it ideal for the application of modern machine learning techniques to extract the maximum amount of physics information. Up until now, most of the methods used to extract useful information from the large datasets of the LHC have been based on physics intuition built from existing models. During the last several years, spectacular advances in the fields of artificial intelligence, computer vision, and deep learning have resulted in remarkable performance improvements in image classification and vision tasks, in particular through the use of deep convolutional neural networks (CNN). Representing LHC collision events as images, a novel concept developed by SDSI, has enabled, for the first time, the application of computer vision and deep learning methods for event classification and reconstruction, resulting in impressive gains in the discovery potential of the LHC. We plan to continue to improve physics event interpretation at the LHC by the development and application of advanced machine learning algorithms to some of the most difficult and exciting challenges in physics event reconstruction at hadron colliders, such as the identification of Higgs bosons, and the mitigation of pileup -- many overlapping collisions in a single event. These developments will have important implications in extracting knowledge in high energy physics. The problem also provides a setting for more general exploration of tools to find subtle correlations embedded in a large dataset. Below is the list of publications funded by SDSI: Jet-Images: Computer Vision Inspired Techniques for Jet Tagging, JHEP 02 (2015) 118, https://arxiv.org/abs/1407.5675
Mapping the Universe is an activity of fundamental interest, linking as it does some of the biggest questions in modern astrophysics and cosmology: What is the Universe made of, and why is it accelerating? How do the initial seeds of structure form and grow to produce our own Galaxy? Wide field astronomical surveys, such as that planned with the Large Synoptic Survey Telescope (LSST), will provide measurements of billions of galaxies over half of the sky; we want to analyze these datasets with sophisticated statistical methods that allow us to create the most accurate map of the distribution of mass in the Universe to date. The sky locations, colors and brightnesses of the galaxies allow us to infer (approximately) their positions in 3D, and their stellar masses; the distorted apparent shapes of galaxies contain information about the gravitational effects of mass in other galaxies along the line of sight. Our proposed work is to take the first step in using all of this information in a giant hierarchical inference of our Universe's cosmological and galaxy population model hyper-parameters, after explicit marginalization of the parameters describing millions - and perhaps billions - of individual galaxies. We will need to develop the statistical machinery to perform this inference, and implement it at the appropriate computational scale. Training and testing will require large cosmological simulations, generating plausible mock galaxy catalogs; we plan to make all of our data public to enable further investigations of this type.
Updated for 2017-18
Mendelian diseases are caused by single gene mutations. In aggregate, they affect 3% (~250M) of the world’s population. The diagnosis of thousands of Mendelian disorders has been radically transformed by genome sequencing. The potential of changing so many lives for the better, is held back by the associated human labor costs. Genome sequencing is a simple, fast procedure costing hundreds of dollars. The mostly manual process of finding which, if any of the patient’s sequenced variants is responsible for their phenotypes against an exploding body of literature, makes genetic diagnosis 10X more expensive, unsustainably slow and incompletely reproducible.
Our project, a unique collaboration between Stanford’s Computer Science department and Stanford’s children’s hospital Medical Genetics Division, aims to develop and deploy a first of a kind computer system to greatly accelerate the clinical diagnosis workflow and additionally derive novel disease gene hypotheses from it. This effort will produce a proof of principle workflow and worldwide deployable tools to significantly improve diagnostic throughput, greatly reduce the time spent by expert clinicians to reach a diagnosis and associated costs thereby making genomic testing accessible, reproducible and ubiquitous.
A first flagship analysis web portal for the project has launched at https://AMELIE.stanford.edu
Updated for 2017-18
This project aims to improve in-season predictions of yields for major crops in the United States, as well as a related goal of mapping soil properties across major agricultural states. The project uses a combination of graphical models, approximate Bayesian computation, and crop simulation models to make predictions based on weather and satellite data.
This work has been published in the AAAI and recognized by the best student paper award and an award from the World Bank Big Data Challenge competition
Updated for 2017-18
Modern data science is highly exploratory in nature. A typical data analyst does not sit before a computer with a fixed set of hypotheses, but rather arrives at the most interesting questions and patterns after getting his/her hands dirty exploring the data. This exploration process creates complex selection biases in the reported findings and violates the standard independence assumption of statistics and machine learning. A consequence, for example, is that the reported p values are no longer valid. Therefore biases due to data exploration present a major challenge for reproducible data science. This project aims to develop a rigorous mathematical foundation to quantify and reduce exploration bias in general settings. In parallel, we will investigate new algorithms and platforms that allow for flexible data exploration while maintaining the validity of the results.
Adaptively collected data
From scientific experiments to online A/B testing, adaptive procedures for data collection are ubiquitous in practice. The previously observed data often affects how future experiments are performed, which in turn affects which data will be collected. Such adaptivity introduces complex correlations between the data and the collection procedure, which has been largely ignored by data scientists.
We prove that under very general conditions, any adaptively collected data has negative bias, meaning that the observed effects in the data systematically underestimate the true effect sizes. As an example, consider an adaptive clinical trial where additional data points are more likely to be tested for treatments that show initial promise. Our result implies that the average observed treatment effects would be smaller than the true effects of each treatment. This is quite surprising, because folklore says that, if anything, we might over-estimate the true effect due to Winner’s Curse. We prove that the opposite is true. Moreover, we develop algorithms that effectively reduce this bias and improve the usefulness of adaptively collected data.
Paper. Nie, Tian, Taylor and Zou. https://arxiv.org/abs/1708.01977.
Large-scale experiments to measure the science of data science.
Despite the tremendous growth of data science, we lack systematic and quantitative understandings of how data science is done in practice. For example, if we ask 10,000 data scientists to independently explore the same dataset, how different would be their sequence of analysis steps and their findings? Are certain paths of analysis more likely to lead to biases and false discoveries? What resources and trainings can we provide to the analyst to improve the analysis accuracy? We are answering these and many other fundamental questions in the largest controlled experiments to get at the heart of the science behind data science. In collaboration with several of the most popular data science MOOCs, we are recruiting thousands of analysts to an online platform where we provide them specific datasets and track each step of their analysis. The results of these experiments could lead to insights that improve the robustness and reproducibility of data science.
The MyHeart Counts study – launched in the spring of 2015 on Apple’s ResearchKit platform – seeks to mine the treasure trove of heart health and activity data that can be gathered in a population through mobile phone apps. Because the average adult in the U.S. checks his/her phone dozens of times each day, phone apps that target cardiovascular health are a promising tool to quickly gather large amounts of data about a population's health and fitness, and ultimately to influence people to make healthier choices. In the first 8 months, over 40,000 users downloaded the app. Participants recorded physical activity, filled out health questionnaires, and completed a 6-minute walk test. We applied unsupervised machine learning techniques to cluster subjects into activity groups, such as those more active on weekends. We then developed algorithms to uncover associations between the clusters of accelerometry data and subjects' self-reported health and well-being outcomes. Our results, published in JAMA Cardiology in December 2016, are in line with the accepted medical wisdom that more active people are at lower risk for diabetes, heart disease, and other health problems. However, there is more to the story, as we learned that certain activity patterns are healthier than others. For example, subjects who were active throughout the day in brief intervals had lower incidence of heart disease compared to those who were active for the same total amount of time, but got all their activity in a single longer session. In the second iteration of our study, we aim to answer more complex research questions, focusing on gene-environment interactions as well as discovering the mechanisms that are most effective in encouraging people to lead more active lifestyles. The app now provides users with different forms of coaching as well as graphical feedback about their performance throughout the duration of the study. Additionally, we know it’s not just environmental factors that affect heart health, so MyHeart Counts in collaboration with 23andMe has added a function in the app to allow participants who already have a 23andMe account to seamlessly upload their genome data to our servers. This data is coded and available to approved researchers. Combining data collected through the app with genetic data allows us to do promising and exciting research in heart health. Download the MyHeart Counts app from the iTunes store and come join our research!
Updated for 2017-18
Boneh’s lab is working on an efficient mechanism for confidential transactions on the block chain (joint work with Benedikt Buenz). Confidential transactions (CT) is a way for two parties to transact on the block chain without revealing the amount of money that one party is paying the other. This capability is absolutely necessary if the block chain is ever going to be used for business. Current CT mechanisms have a number of drawbacks, most notably, CT transaction size is much larger than non-CT transactions. Our construction greatly shrinks the overhead for CT transactions.
A system has been developed at Stanford that enables using confidential healthcare data among distant hospitals and clinics for creating decision support applications without requiring sharing any patient data among those institutions, thus facilitating multi-institution research studies on massive datasets. This collaboration between Microsoft and Stanford will develop a MS Azure application based on this, thus providing a solution that is robust, usable, and deployable widely at many healthcare institution.