Our Blog

Our team of specialists wanted to share some articles on technologies, services, trends and news of our industry in the era of digital transformation.

Key aspects of Data Lakes

Key aspects of Data Lakes

Data has become a vital element for digital companies, and a key competitive advantage. However, the volume of data that organizations currently have to manage is very heterogeneous and its growth rate is exponential. This creates a need for storage and analysis solutions that offer scalability, speed and flexibility to help manage these massive data volumes. How can you store and access data quickly while maintaining cost effectiveness? A Data Lake is a modern answer to this problem.

This series of articles will look into the concept of Data Lakes, the benefits they provide, and how we can implement them through Amazon Web Services (AWS).

What is a Data Lake?

A Data Lake is a centralized storage repository that can store all types of structured or unstructured data at any scale in raw format until needed. When a business question arises, the relevant information can be obtained and different types of scans can be carried out through dashboards, visualizations, Big Data processing and machine learning to guide better decision-making.

A Data Lake can store data as is, without having to structure it first, with little or no processing, in its native formats, such as JSON, XML, CSV, or text. It can store file types: images, audio, video, weblogs, data from sensors, IoT devices, social networks, etc. Some file formats are better than others, such as Apache Parquet, which is a compressed column format that provides very efficient storage. Compression saves disk space and I/O access, while the format allows the query engine to scan only the relevant columns, reducing column time and costs.

Using a distributed file system (DFS), such as AWS S3, allows to store more data at a lower cost, providing multiple benefits:

  • Data replication
  • Very high availability
  • Low costs at different price ranges and multiple types of storage depending on the recovery time (from immediate access to several hours)
  • Retention policies, allowing to specify how long to keep data before it is automatically deleted

 

datalakes

Data Lake versus Data Warehouse

Data Lakes and Data Warehouses are two different strategies for storing Big Data, in both cases without being tied to a specific technology. The main difference between them is that, in a Data Warehouse, the data scheme is pre-established; you must create a scheme and schedule your queries. Powered by multiple online transactional applications, data has to be converted via ETL (extract, transform and load) to conform to the predefined scheme in the data warehouse. In contrast, a Data Lake can host structured, semi-structured, and unstructured data and has no default scheme. Data is collected in its natural state, requires little or no processing when saved, and the scheme is created during reading to meet the processing needs of the organization.

Data Lakes are a more flexible solution adapted to users with more technical profiles, with advanced analytical needs, such as Data Scientists, since a level of skill is needed to be able to classify the large amount of raw data and easily extract its meaning. A data warehouse focuses more on Business Analytics users, to support business inquiries from specific internal groups (Sales, Marketing, etc.), by owning the data already curated and coming from the company’s operating systems. In turn, Data Lakes often receive both relational and non-relational data from IoT devices, social media, mobile apps, and corporate apps.

When it comes to data quality, Data Warehouses are highly curated, reliable, and considered the core version of the truth. On the other hand, Data Lakes are less reliable since data could come from any source in any condition, be it curated or not.

A Data Warehouse is a database optimized to analyze relational data, coming from transactional systems and business line applications. They are usually very expensive for large volumes of data, although they offer faster query times and higher performance. Data Lakes, by contrast, are designed with a low storage cost in mind.

Some of the legitimate criticism Data Lakes have received is:

  • It is still an emerging technology compared to the strong maturity model of a Data Warehouse, which has been in the market for several years.
  • Data Lakes could become a “swamp”. If an organization has poor management and governance practices, it can lose track of what exists at the “bottom” of the lake, causing it to deteriorate and making it uncontrolled and inaccessible.

Due to these differences, organizations can choose to use both a Data Warehouse and a Data Lake in a hybrid deployment. One possible reason would be adding new sources or using the Data Lake as a repository for everything that is no longer needed in the main data warehouse. Data Lakes are often an addition or evolution to an organization’s current data management structure rather than a replacement. Data Analysts can use more structured views of the data to get the answers they need and, at the same time, Data Science can “go to the lake” and work with all the raw information as necessary.

Data Lake Architecture

The physical architecture of a Data Lake may vary, since it is a strategy applicable by multiple technologies and providers (Hadoop, Amazon, Microsoft Azure, Google Cloud). However, there are 3 principles that make it stand out from other Big Data storage methods, and they make up its basic architecture:

  • No data is rejected. They are loaded from multiple source systems and preserved.
  • Data is stored in an untransformed or nearly untransformed condition, as received from the source.
  • Data is transformed and a scheme is adapted during analysis.

While information is largely unstructured or geared to answering specific questions, it must be organized as to ensure that the Data Lake is functional and healthy. Some of these features include:

  • Tags and/or metadata for classification, which can include type, content, usage scenarios, and groups of potential users.
  • A hierarchy of files with naming conventions.
  • An indexed and searchable Data Catalog.

Conclusions

Data Lakes are becoming increasingly important to business data strategies. They respond much better to today’s reality: much larger volumes and types of data, higher user expectations and a greater variety of analytics, both business and predictive. Both Data Warehouses and Data Lakes are intended to coexist with companies that want to base their decisions on data. Both are complementary, not substitute, and can help any business to better understand both markets and customers, as well as promote digital transformation efforts.

Our next article will delve into how we can use Amazon Web Services and its open, secure, scalable, and cost-effective infrastructure to build Data Lakes and analytics on top of them.

Benefits of Cloud vs On Premise

Benefits of Cloud vs On Premise

More and more businesses are turning to cloud computing for their IT needs. The question is why they are opting for cloud services instead of the traditional on-premise solution. Below are the benefits of using cloud over on-premise solutions.

Flexibility and Scalability

The majority of businesses that utilize the traditional on-premise solutions find it challenging and time-consuming to adopt new software or hardware projects in both implementation as well as user adoption. Additionally, increasing and decreasing the IT solution to cater for the number of users was tasking as well. For cloud solutions, setting up new software and hardware is a relatively easy and simple process. Increasing and decreasing the scale of the cloud solution to cater sudden changes in the number of users requiring them is also a straightforward and rapid process to implement. This gives business the flexibility to seamlessly adapt their cloud solution to meet their current needs.

CapEx and OpEx

On-Premise IT solutions are linked to capital expense or CapEx while cloud services usually run on operational expenses or OpEx. CapEx involves making a purchase for an asset whereas OpEx entails incurring an ongoing regular cost usually linked to a contract. Operational expenses are highly transparent due to the ease of determining the return on investment which makes managing IT expenses for the firm more convenient.

Backup, Recovery, and Security

The cloud service provider is usually responsible for the backup, updates, and recovery of software, hardware, and data. On-premise solutions involve onsite storage of data which is a risk since it is susceptible to physical and digital attacks or damage. The cloud service provider is also vulnerable to the same risks however they have put specific measures to ensure that data is highly secure. A good cloud service provider also has multiple backups of hardware, software and client data. This means that if something was to happen to one of their storages or the customer’s data, then it is possible to recover the lost or damaged elements.

Infrastructure as a Service

Businesses with their on-premise solutions typically install their own IT infrastructure around their premises. Management of onsite infrastructure is time-consuming for the IT department which must dedicate a portion of their resources. Furthermore, planning for the future is difficult. For example, assuming a firm has 300 employees, does that mean that the server capacity should cater for 300 users or should it be more or less in case 6 months from now the company decides to restructure or take on a new project? Firms with cloud-based solutions may utilize Infrastructure as a Service (IaaS) while allows them to lease various IT resources such as storage space and processing capacity.

IaaS also frees up time for the IT department which can focus on meeting the business objectives of the firm. IaaS is scalable meaning which enables the company to increase or decrease the leased infrastructure resources to meet its current and future needs in an easy, quick, and convenient method.