Cloudera Powers Opt-In Machine Learning Project for Real-Time Identification of Suicide Risk Factors in Military Veterans
Patterns and Predictions’ Durkheim Project Uses Predictive Analytics Across Data Sources
PALO ALTO, CA – September 25, 2013 – Cloudera, the leader in enterprise analytic data management powered by Apache Hadoop™, today announced that Patterns and Predictions, a predictive analytics company, partnered with Cloudera for an ongoing initiative applying machine learning to the identification of key correlations between military veterans’ communications and suicide risk. The Durkheim Project, as it is called, entails opt-in monitoring across a variety of online and mobile data channels to predict which military veterans are at the highest risk of suicide. It is powered by a real-time risk detection framework co-developed with Cloudera and built on CDH (Cloudera’s Distribution Including Apache Hadoop), Cloudera Impala and Cloudera Search.
“The promise of the Durkheim Project is expressed in its ability to collect, monitor and deliver insights from a diverse repository of complex data, including mobile and social media signals, with the hope of eventually providing real-time triage of interventional actions upon detection of a critical event,” said Patterns and Predictions founder Chris Poulin. “Cloudera's unique software and expertise enable us to make risk assessments faster and across larger data sets, resulting in better clinical outcomes.”
Applied Machine Learning Identifies and Predicts Mental Health Risk Factors
Patterns and Predictions' founder Chris Poulin began working with Dartmouth researchers in 2010 to address the problem of high suicide rates among veterans. Suicide rates among U.S. veterans are approximately twice that of the general population, a challenging phenomenon facing the U.S. Department of Veterans Affairs (VA).
With support from the Defense Advanced Research Project Agency (DARPA), a research arm of the Department of Defense (DoD), and Dartmouth College, the suicide risk prediction project includes a database of more than 100,000 U.S. veterans, all of whom have volunteered their participation. By mining these veterans’ social media posts and other indicators, Patterns and Predictions – together with a team of experts in artificial intelligence, medical professionals from private companies, and the U.S. Department of Veterans Affairs (VA) – developed a set of predictive indicators of suicidal risks for military veterans.
The tightly integrated machine learning system was trained by feeding in isolated statistical indicators – keywords, word patterns and other linguistic clues known to be associated with people who needed help – from a variety of veterans’ data sources. Words and linguistic patterns that veterans post online are data-mined for indicators of suicidal behavior and the system identifies useful clues in real data to establish a risk “score.”
With so many veteran participants, the data sets are very large. The veterans who opt into the project receive a unique Facebook app and a mobile app for either the iOS or Android operating system; these are designed to capture posts, Tweets, mobile uploads and geographic location. Additional profile data is captured as well, including physician information and clinical notes. To ensure compliance with various privacy and HIPAA regulations, all captured data is stored in a secure environment behind a medical firewall.
Open Source Hadoop Infrastructure Delivers Operational Efficiency for Critical Research
The Durkheim Project has a highly complex workflow, requiring foundational infrastructure and predictive modeling that supports big data collection and analysis at scale. Moreover, the team wanted to access all of the machine learning through search interfaces, which can get expensive since all of the machine learning is indexed.
The technical objective for building the machine learning data fabric underpinning the initiative was maximum speed at minimum cost. Poulin found most big data solutions to be low performance in terms of accuracy, or highly complex in implementation and/or in integration with Patterns and Predictions’ existing IT environment. Poulin chose to build on Apache Hadoop for its abstraction of underlying data set complexity and selected Cloudera for its category leadership and subject matter expertise in the Hadoop framework, open source and big data infrastructure. CDH, the market-leading, 100% open source distribution of Hadoop and related projects, as the cornerstone technology of the Durkheim Project. Using Cloudera Impala and Cloudera Search, the ingestion of data on Hadoop is markedly more efficient, delivering lower costs, better computational throughput and reduced complexity of IT support.
Patterns and Predictions engaged Cloudera Professional Services to co-develop code in the area of real-time prediction on CDH, called Bayesian Counters. The use of text analytics against the continuously fed large data pool delivers an exponential number of variables which can then be compared and analyzed, resulting in a real-time assessment of the participant’s mental health. The computational processing to analyze that data requires a big data fabric, and the benefit is that the output is much more informative.
In the Future, Data Could Help Veterans in Crisis
In February 2013, an investigation conducted by Patterns and Predictions, Dartmouth and the VA determined that the accuracy of this risk-prediction data model was statistically significant, with "consistent accuracies" of 65% percent or higher in predicting suicide risk in a veteran control group.
Still in its initial phases, the Durkheim Project is authorized only to monitor and analyze data. While the project has delivered statistically valid results that accurately predict suicide risk in a control group of veterans, its critical research is restricted, at least for the time being, to a non-interventional protocol. Using Cloudera, the project’s continued scaling of risk classifiers will help to establish the necessary confidence in the project’s ability to assess risk in real time, as they currently apply for an interventional study.
About Patterns and Predictions
Patterns and Predictions is a predictive analytics firm. Its core Centiment® technology provides unstructured and linguistics driven prediction. It is the technology powering the Durkheim Project’s ‘big data’ analytics network for the assessment of mental health risks. Partners include Bloomberg, The Geisel School of Medicine at Dartmouth, Cloudera, and Attivio. Funding sources include the U.S. Government (DARPA), and customers include Global 100 companies.
Cloudera is revolutionizing enterprise data management by offering the first unified Platform for big data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera's open source big data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 40,000 individuals worldwide. Over 1,700 partners and a seasoned professional services team help deliver greater time to value. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production.
Connect With Cloudera
Follow us on Twitter: http://twitter.com/cloudera
Visit us on Facebook: http://www.facebook.com/cloudera
Join the Cloudera Community: http://cloudera.com/community
Cloudera, Cloudera's Platform for Big Data, Cloudera Enterprise Data Hub Edition, Cloudera Enterprise Flex Edition, Cloudera Enterprise Basic Editionand CDH are trademarks or registered trademarks of Cloudera Inc. in the United States, and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners.