Data Council Blog

Data Council Blog

Introducing our Data Startups Track

 

Machine Learning, Neural Nets, "AI" and Computer Vision are changing the world. Discover the data startups that matter.

As an engineer turned founder I've been passionate for years about helping other technical founders succeed. There are a unique set of challenges faced by founders, and building support communities to help them successfully overcome their obstacles helps move innovation forward. 

More broadly speaking, I'm also a proponent of bringing engineers together - hence our efforts in the data community via meetups, our conference series and via organizing other, smaller, events for engineers, data scientists and CTOs through Hakka Labs for the past 5 years.

This is why I'm so excited to be introducing the intersection of these two efforts - supporting startups and supporting the data community - into our upcoming DataEngConf NYC.

To Shard or Not to Shard (PostgreSQL)

 

Wouldn't the world be a simpler place if we could easily scale our RDBMS? (gasp!)

What do you do when you find yourself in a situation where you need to scale out your RDBMS to support greater data volumes than you originally anticipated? Traditionally, one would either need to vertically scale their infrastructure by putting their database on more powerful (costlier) machines or sharding their data across multiple workers.

Rolling Your Own Distributed Column Store

 

When solving your customers' technical challenges push you to break the rules

A re-wording of one of the key maxims for startup success could be "KISS" - "keep it simple, stupid." If you've ever run your own startup, you also know the mantras of "focus" and "fail fast," and the critical reminder of how your product should be a "pain-killer not a vitamin."

How Big Data Can Help Improve the Meteorological Risk Models That Are Out of Date

According to a recent article published in The New York Times, water damage from hurricane Harvey extended far beyond flood zones. Now that the rescue efforts are underway, it’s clear that much of the damage occurred outside of the typical boundaries drawn on official FEMA flood maps.

A Day in the Life: What's it like Being an Engineer at Stripe?

Alyssa Frazee tells us about the unicorn data skills she's honed on the job.

One thing that Alyssa Frazee loves about her work at Stripe is that, like someone with traditional data science skills, she gets to build machine learning models. "Oh, the rapture," cries Alyssa the data scientist!
| |

Rebuilding Open Source Analytics @ Airbnb

How open source allowed Airbnb to rebuild their expensive BI tool in less than one developer year

Granted Maxime Beauchemin isn't your average data engineer. As any Bay Area engineer worth their salt knows, anyone who worked on data at Facebook receives (deserves) a certain outsized respect from their peers.

Pushing Kafka to the Limit at Heroku

How Everyone's Favorite PaaS Operates Kafka at Scale

Scale presents unique challenges for engineers, particularly those at companies who have the largest number of users throwing off the most data exhaust, resulting in the fattest data pipelines with the gnarliest problems. For example, Heroku, arguably the most popular platform as a service (PaaS), who last year decided to offer Apache Kafka to their customers as a hosted service, quickly realized they would need to support a large number of distinct users, each with varying use cases. This put them on a challenging path to attempt to minimize the operational headaches that come inherently with running this kind of infrastructure at scale.
 
| |

Fighting Fraud in Cryptocurrency using Machine Learning

Coinbase is on the front-lines of discovering advanced cryptocurrency and payment fraud techniques. Hear about how they use machine learning to help them fight the war.

Building a Column-Oriented, Distributed Data Store for Analytics - The Story of Druid

 

Druid is a modern data store built for analytics use-cases. As the volume of data has exploded, and companies have sought deeper insights from their data, ad-hoc analytics have become difficult as more data is buried in distributed systems like Hadoop & Spark. The query model for these systems can result in long latencies making them sub-optimal for interactive analytics applications.

How to Build a Data Pipeline That Handles Hundreds of Different Inputs

How many different file formats does your ETL system need to parse? For many data pipelines, several well-defined formats will suffice. Things break, and at times require manual intervention, but not so often that a couple engineers can't keep tabs on the system and keep things running relatively smoothly.