The emergence of ‘data lakes’ has been all over the tech press over the last six months. As the interest in big data methods gains traction, it’s a concept data and analytics professionals need to be able to accurately explain to customers, with many non_IT specialists unsure about what the term really means. The person who coined the term, Pentaho CIO James Dixon, compares a data warehouse to a prepackaged bottle of water and a data lake to a natural body of water, allowing users to dip or dive in at will.
We weigh up the more established data warehouses versus shiny new data lakes to help you make an informed choice about you or your clients’ big data management:
A data warehouse excludes data it doesn’t consider relevant, spending considerable amounts of time analysing sources and profiling data in the process. This allows it to save expensive disk storage space and to perform as efficiently as possible, but is far from perfect – it can easily exclude data that might be useful in the future. A data lake, on the other hand, retains ALL data – structured, semi-structured and unstructured – for all time, enabling professionals to go back to any point in time to perform an analysis. This is down to its powerful combination of off-the-shelf servers and economical storage, so expanding a data lake to petabytes becomes much more affordable.
While data warehouses are only able to store traditional data types that have already been processed (known as a ‘schema on write’ approach), data lakes are much more flexible. A lake can store data sensor data, social network activity and web server logs – information that data warehouses have to ignore. It only gives the data its ‘shape’ when you’re ready to use it (a ‘schema on read’ approach), a huge draw for those of you replacing an existing system or establishing a new one.
Although data warehouse technology is advancing rapidly, they remain very difficult and time-consuming if you want to change them – munching through a developer’s hours and leading to waiting times many businesses can ill afford. As a data lake stores everything in its raw form and is accessible to all users, data can be explored by an analytics professional relatively easily in new ways and a more generally user friendly schema developed if necessary (and discarded easily with no consumption of resources if an exploration turns out to be a dead end.)
While a data warehouse is usually ideal for the majority of ‘operational’ users within an organisation, there often remains a considerable pool of more advanced users who want to go beyond its capabilities. Data scientists and analysts might want to use data that’s not in the source systems, or conduct deeper analyses that mixes different types of data to answer new questions, and a warehouse simply isn’t up to scratch. A lake, though, can be used by all – those who want to access existing, more structured data and those who want to work more experimentally with much larger data sets.