You have your Hadoop cluster, and you are ready to fill it up with data, but wait: Which format should you use to store your data? Should you store it in Plain Text, Sequence File, Avro, or Parquet? (And should you compress it?) HDFS or Block/Object Store? Which query engine? This talk will take a closer look at some of the trade-offs, and will cover the How, Why, and When of choosing one format over another.
Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others. Adding to the data formats selection is which query engine works best for the data format & workload. Oh lets not forget the question: “Do I store that in HDFS or a block/object store?”
This talk will take a closer look at some of these trade-offs. Attendees will learn, based on a few real world use cases, the How, Why, and When of choosing one format over another (and will your choice of query engine affect this.). Covering the four major data formats (Plain Text, Sequence Files, Avro, and Parquet) we will provide insight into what they are and how to best use and store them in HDFS or a block/object store.
A leading expert on big data architecture and Hadoop, Stephen brings over 20 years of experience creating scalable, high-availability, data and applications solutions. A veteran of WalmartLabs, Sun and Yahoo!, Stephen leads data architecture and infrastructure.
With a background in computer engineering and visual analytics, Silvia has worked on several projects helping clients explore and analyze their data. She is interested in building and optimizing the infrastructure and data pipelines used to gather insights from various datasets.