Data Council Blog

Data Council Blog

Halo Tech - Featured Startup SF '18

In this blog series leading up to our SF18 conference, we invite our featured startups to tell us more about their data engineering challenges. Today, we speak with Halo Tech, an early-stage startup that analyzes complex data to accelerate medical advancements.

PipelineAI - Featured Startup SF '18

In this blog series leading up to our SF18 conference, we invite our featured startups to tell us more about their data engineering challenges. Today, we speak with PipelineAI, a startup helping you to continuously train, optimize and host deep learning models at scale.

Instrumental - Featured Startup SF '18

In this blog series leading up to our SF18 conference, we invite our featured startups to tell us more about their data engineering challenges. Today, we speak with Instrumental, an early-stage company building data systems to monitor and improve manufacturing line performance.

Pachyderm - Featured Startup SF '18

In this blog series leading up to our SF18 conference, we invite our featured startups to tell us more about their data engineering challenges. Today, we speak with Pachyderm, an early-stage company building a data platform for data science.

Redshift versus Snowflake versus BigQuery / Part 1: Performance

Fivetran is a data pipeline that syncs data from apps, databases and file stores into our customers’ data warehouses. The question we get asked most often is “what data warehouse should I choose?” In order to better answer this question, we’ve performed a benchmark comparing the speed and cost of three of the most popular data warehouses — Amazon Redshift, Google BigQuery, and Snowflake.

Functional Data Engineering — a modern paradigm for batch data processing


Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot.

In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. This post distills fragments of wisdom accumulated while working at Yahoo, Facebook, Airbnb and Lyft, with the perspective of well over a decade of data warehousing and data engineering experience.

ETL and the Question of Happiness

 

No one is happy with fragile ETL pipelines. But it doesn't need to be that way.

One might surmise that data "analysis" is, first and foremost, about data "access." It goes without saying that someone in the analyst's role must first obtain access to the data they wish to analyze. And with data being spread all over the inside, and now outside, of the enterprise (think of both your on-premises data stores, plus all the cloud and SaaS vendors you're currently using) modern day analysts face deeper challanges than ever before in obtaining access to the data they need.

And of course, techno-philosophical concepts like "democratizing acess to data" do nothing at all to help one overcome any of the actual technical integration challenges required to practically enable such unfettered access to one's data.

| |

Data Science in the Media

 

 

The past few years have been an interesting time for data science everywhere, and the media in particular! We’ve seen some incredible new technologies emerge, like open-source machine learning platforms, as well as machine learning services. These developments have opened the door for new consumer products, like conversational AIs, and new technologies in the media and advertising industries.

How Data Has Evolved at The New York Times

 

Whether you love or hate their paywall, the Times successfully balances competeing business frictions using a deep view of data. 

Since our initial DataEngConf in 2015, The New York Times has been a key supporter of the conference. The very first ever DataEngConf talk was a keynote given by Chris Wiggins, the Times' Chief Data Scientist, who presented a broad yet fascinating perspective on "Data Science at The New York Times" (video here).

In the years since, we've had deeply technical talks from both data engineers and data scientists at the Times, and I'm excited that their involvement in DataEngConf this year is as large as it's ever been.

How Dremio Uses Apache Arrow to Increase the Performance

 

(Image source: http://arrow.apache.org/)

What if all the best open-source data platforms could easily share, ("ahem,") data with each other?

As data has proliferated and open-source software (OSS) has continued to dominate both the stacks and the business models of the top tech companies in the world, the number of different types of data platforms and tools we've seen emerge has accelerated.

Having a hard time keeping up with the differences between Kudu, Parquet, Cassandra, HBase, Spark, Drill and Impala? You're not alone, and obviously this is one of the reasons we bring together top OSS contributors to these platforms to share at DataEngConf.

But there's one new innovation that attempts to bind all the above projects together by enabling them to share a common memory format. It's a new top level Apache Project called Arrow that aims to dramatically decrease the amount of wasted computation that occurs when serializing and deserializing memory objects. The serialization pattern is commonly used when building analytics applications that interact between data systems which have their own internal memory representations.