Big Data @ Microsoft

Raghu Ramakrishnan, Microsoft



Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.

Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.

I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack.


Raghu Ramakrishnan is a Technical Fellow in the Cloud and Enterprise (C&E) at Microsoft Corp. He focuses his work on big data and integration between C&E’s cloud offerings and the Online Services Division’s platform assets. He has more than 15 years of experience in the fields of database systems, data mining, search and cloud computing.
Ramakrishnan has been chief scientist for three divisions at Yahoo! Inc. over the past five years (Audience, Cloud Platforms, Search), as well as a Yahoo! Fellow leading applied science and research teams in Yahoo! Labs. He led the science teams for major Yahoo! initiatives, including the CORE personalization project, the PNUTS geo-replicated cloud service platform and the creation of Yahoo!’s Web of Objects through Web-scale information extraction. Before joining Yahoo! in 2006, he was member of the computer science faculty at University of Wisconsin-Madison since 1987, and was founder and chief technical officer of QUIQ, a company that pioneered crowd-sourced question-answering communities.
His work in database systems has influenced query optimization in commercial database systems and the design of window functions in SQL:1999. He has written the widely used text “Database Management Systems” (with Johannes Gehrke). Ramakrishnan has received several awards, including the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award and the 10-Year Test-of-Time Award, a Distinguished Alumnus Award from IIT Madras, and a Packard Foundation Fellowship. He is a Fellow of the ACM and IEEE, and serves on the steering committee of the ACM Symposium on Cloud Computing and the board of directors of ACM SIGKDD, and is a past chair of ACM SIGMOD and member of the board of trustees of the Very Large Data Base Endowment.
He earned a bachelor’s degree in electrical engineering from the Indian Institute of Technology Madras and a doctorate in computer science from the University of Texas at Austin. Ramakrishnan and his wife, Apu, brought up their two sons in Madison, Wis., where he taught for many years. He can attest that while you might freeze there, you would do so with a smile on your face. He likes to play tennis (both the lawn and table varieties) and read fiction, and tries to stay fit, with middling success.