Exploiting the Data Code: Duality Applying Modern Software Development Practices to Data with Dali

apply()2021 - 10 minutes

Most large software projects in existence today are the result of the collaborative efforts of hundreds or even thousands of developers. These projects consist of millions of lines of code and leverage a plethora of reusable libraries and services provided by third parties. Projects of this scale would not be possible without the tools and processes that now define the practice of modern software development: language support for decoupling the interface from the implementation, version control, semantic versioning of artifacts, dependency management, issue tracking, peer review of code, integration testing, and the ability to tie all of these things together with comprehensive code search and dependency tracking mechanisms.

We have observed similar forces at play in the world of big data. At LinkedIn the number of people who produce and consume data, the number of datasets they need to manage, and the rate at which these datasets change are all growing at an exponential rate. This has resulted in a host of problems: rampant duplication of business logic and data, increasingly fragile and hard to maintain data pipelines, and schemas that are littered with deprecated fields due to the prohibitive costs of making backward incompatible changes. In order to cope with these challenges the team built Dali, a unified data abstraction layer for offline (Hadoop, Spark, Presto, etc) and nearline (Kafka, Samza) systems that enables data engineers to benefit from the same processes and infrastructure that are already used by LinkedIn’s software engineers.

In this talk, Carl explains how Dali employs virtual SQL views to decouple the API of a dataset from the details of its implementation, describe how view versioning and dependency tracking allow us to make backward incompatible changes without breaking downstream consumers, and review the ways we have integrated Dali with the rest of LinkedIn’s software development ecosystem. Finally, he discusses how he leverages Dali in several company-wide initiatives including the redesign of the LinkedIn mobile app and GDPR.


Carl Steinbach

Senior Staff Software Engineer

LinkedIn

Carl Steinbach is a software engineer and member of the Big Data Platform group at LinkedIn. He is the tech lead for the Grid Platform team and the architect of Dali, LinkedIn’s unified, virtualized data access layer for batch analytics. Before joining LinkedIn Carl was an early employee at Cloudera. He is an ASF member and former PMC chair of the Apache Hive Project.