Join us for a live tech talk and learn about architecting for data quality in the lakehouse with delta Lake and PySpark. After the presentation, we’ll have time for questions. Excited to have you join us!
From null values and duplicate rows to modeling errors and schema changes, data can break for millions of reasons. To combat this, teams are increasingly adopting best practices from DevOps and software engineering to identify, resolve, and even prevent this "data downtime" from happening in the first place. Join Prateek Chawla and Ryan Kearns as they walk through how data and ML engineers can solve for data quality across the data lakehouse by applying data observability techniques. Topics to be discussed include: how to optimize for data reliability across your lakehouse's metadata, storage, and query engine tiers, building your own data observability monitors with PySpark, and the role of tools like Delta Lake to scale this design.
Founding Engineer and Technical Lead
Founding Data Scientist
Prateek Chawla is a founding engineer and technical lead at Monte Carlo, where he drives the technical strategy for their data observability platform. Previously, he served as a technical lead at Barracuda, working on email fraud prevention technologies. He graduated Summa Cum Laude with a B.S. in Computer Science and Engineering from the University of California, Santa Cruz. In his free time, Prateek enjoys watching Broadway shows, flying airplanes, reading, and exploring new places.
Ryan Kearns is a founding data scientist at Monte Carlo, where he develops machine learning algorithms for the company’s data observability platform. Together with CEO and co-founder Barr Moses, he instructed the first ever course on data observability with O'Reilly media, the first tutorial on the subject using out-of-the-box SQL. He's currently a student at Stanford University studying computer science and philosophy. In his spare time, Ryan loves traveling, trying new restaurants, and skiing.
Community Program Manager
Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.