The article Key Concepts about Data Lakes delved into the importance of Data Lakes, their architecture and how they compare to Data Warehouses. This article will focus on deployment using Amazon Web Services (AWS), Amazon’s cloud platform. We will look into the overallflow, the different services available and, finally, AWS Lake Formation, a tool specially designed to facilitate this task.
Data Lakes support the needs of our applications and analytics, without the need to constantly worry about increasing storage and computing resources as the business grows and the data volume increases. However, there is no magic formula creating them. Generally, they involve dozens of technologies, tools and environments. The diagram below shows the overallflowofdata, from collection, storage and processing, to the use of analytics via Machine Learning and Business Intelligence techniques.
Services supported by AWS
AWS provides a comprehensive set of managed services that help build Data Lakes. Proper planning and design are necessary to migrate a data ecosystem to the Cloud, and understanding Amazon’s offerings is critical. Below are only a few of the most important tools at each stage of the flow.
The first step is to analyze the goals and benefits you want to achieve with the implementation of an AWS-based Data Lake. Once the plan is designed, data must be migrated to the Cloud, taking into account its volume. You can easily accelerate this migration with services such as Snowball and Snowcone (edge devices for storage and computing) or DataSync and TransferFamily, to simplify and automate transfers.
In this step, you can operate in 2 modes: Batch or Streaming.
In Batch Loading, AWS Glue is used to extract information from different sources, at periodic intervals, and move them into the Data Lake. It usually involves some degree of minimal transformation (ELT), such as compression or data aggregation.
For Streaming, data generated continuously from multiple sources, such as logging files, telemetry, mobile applications, IoT sensors and social networks, are collected. They can be processed during a circular time window and channeled into the Data Lake.
Real-time analytics provides useful information for critical business processes that rely on streaming data analysis, such as Machine Learning algorithms for anomaly detection. AmazonKinesisDataFirehose helps perform this process from hundreds of thousands of sources in real time, rather than uploading data for hours and processing it at a later stage.
Storage and Processing
The core service in any AWS Data Lake is Amazon S3, which provides high scalability storage, excellent costs and security levels, thus offering a comprehensive solution for different processing models. It can store unlimited data and any type of file as an object. It allows you to create logical tables and hierarchies from folders (for example, by year, month, and day), allowing the partition of data in volume. It also offers a wide set of security functions, such as access controls and policies, encryption at rest, registration, monitoring, among others. Once the data is uploaded, it can be used anytime, anywhere, to address any need. The service supports a wide range of storage classes (Standard, Smart, Rare Access), each with different capacities, recovery times, security and cost.
AWS Glacier is a service for secure archiving and backup management at a fraction of the cost of S3. File recoveries can take from a few minutes to 12 hours, depending on the storage class selected.
AWS Glue is a managed ETL and Data Catalog service that helps find and catalog metadata for faster queries and searches. Once Glue points to the data stored in S3, it analyzes it using automatic trackers and records its schemes. Glue is designed to perform transformations (ETL/ELT) using Apache Spark, Python scripts and Scala. Glue has no server; therefore, there is no infrastructure configured, which makes it more efficient.
If the contents of Data Lake need to be indexed, AWS DynamoDB (NoSQL database) and AWS ElasticSearch (text search server) can be used. In addition, by using AWS Lambda features, activated directly by S3 in response to events such as uploading new files, processes can be triggered to keep your Catalog up to date.
Analytics for Machine Learning and Business Intelligence
There are several options for massive Data Lake information.
Once data has been catalogued by Glue, different services can be used in the client layer for analytics, visualizations, dashboards, etc. Some of these are Amazon Athena, an interactive serverless service for ad hoc exploratory queries using standard SQL; Amazon Redshift, a Data Warehouse service for more structured queries and reports; Amazon EMR (Amazon Elastic MapReduce), a managed system for Big Data processing tools such as ApacheHadoop, Spark, Flink, among others; and Amazon SageMaker, a Machine Learning platform that allows developers to create, train and implement Machine Learning models in the cloud.
With Athena and Redshift Spectrum, you can directly query the Data Lake in S3 using the SQL language in the AWS Glue Catalog, which contains metadata (logical tables, schemes, versions, etc.). The most important aspect is that you only pay for the queries executed, depending on the scanned data volume. Therefore, you can achieve significant performance and cost improvements by compressing, partitioning, or converting data into a column format (such as Apache Parquet), as each of those operations reduces the amount of data Athena or Redshift Spectrum should read.
AWS Lake Formation
Building a Data Lake is a complex, multi-step task, including:
Create the necessary buckets in S3 to store data with the applicable policies.
Create the ETLs that will carry out the necessary transformations and the corresponding administration of audit policies and permits.
Allow Analytics services to access Data Lake information.
AWS Lake Formation is an attractive option that allows users (both beginners and experts) to immediately start with a basic Data Lake, eliminating complex technical details. It allows real-time monitoring from a single point, without having to go through multiple services. One strong aspect is cost: AWS Lake Formation is free. You will only be charged for the services you invoke from it.
It allows loading from various sources, monitoring flows, configuring partitions, enabling encryption and key management, defining transformation jobs and monitoring, reorganizing data in column format, configuring access control, eliminating redundant data, relating linked records, gaining access and auditing access.
These 2 articles looked into the definition of Data Lakes, what makes them different from Data Warehouses and how they can be deployed on the Amazon platform. CTO can be significantly reduced by moving your data ecosystem to the cloud. Suppliers such as AWS add new services continuously, while improving existing ones and reducing costs.
Huenei can help you plan and execute your Data Lake initiative in AWS, in the process of migrating your data to the cloud and implementing the analytics tools necessary for your organization.
Data has become a vital element for digital companies, and a key competitive advantage. However, the volume of data that organizations currently have to manage is very heterogeneous and its growth rate is exponential. This creates a need for storage and analysis solutions that offer scalability, speed and flexibility to help manage these massive data volumes. How can you store and access data quickly while maintaining cost effectiveness? A Data Lake is a modern answer to this problem.
This series of articles will look into the concept of Data Lakes, the benefits they provide, and how we can implement them through Amazon Web Services (AWS).
What is a Data Lake?
A Data Lake is a centralized storage repository that can store all types of structured or unstructured data at any scale in raw format until needed. When a business question arises, the relevant information can be obtained and different types of scans can be carried out through dashboards, visualizations, Big Data processing and machine learning to guide better decision-making.
A Data Lake can store data as is, without having to structure it first, with little or no processing, in its native formats, such as JSON, XML, CSV, or text. It can store file types: images, audio, video, weblogs, data from sensors, IoT devices, social networks, etc. Some file formats are better than others, such as Apache Parquet, which is a compressed column format that provides very efficient storage. Compression saves disk space and I/O access, while the format allows the query engine to scan only the relevant columns, reducing column time and costs.
Using a distributed file system (DFS), such as AWS S3, allows to store more data at a lower cost, providing multiple benefits:
Very high availability
Low costs at different price ranges and multiple types of storage depending on the recovery time (from immediate access to several hours)
Retention policies, allowing to specify how long to keep data before it is automatically deleted
Data Lake versus Data Warehouse
Data Lakes and Data Warehouses are two different strategies for storing Big Data, in both cases without being tied to a specific technology. The main difference between them is that, in a Data Warehouse, the data scheme is pre-established; you must create a scheme and schedule your queries. Powered by multiple online transactional applications, data has to be converted via ETL (extract, transform and load) to conform to the predefined scheme in the data warehouse. In contrast, a Data Lake can host structured, semi-structured, and unstructured data and has no default scheme. Data is collected in its natural state, requires little or no processing when saved, and the scheme is created during reading to meet the processing needs of the organization.
Data Lakes are a more flexible solution adapted to users with more technical profiles, with advanced analytical needs, such as Data Scientists, since a level of skill is needed to be able to classify the large amount of raw data and easily extract its meaning. A data warehouse focuses more on Business Analytics users, to support business inquiries from specific internal groups (Sales, Marketing, etc.), by owning the data already curated and coming from the company’s operating systems. In turn, Data Lakes often receive both relational and non-relational data from IoT devices, social media, mobile apps, and corporate apps.
When it comes to data quality, Data Warehouses are highly curated, reliable, and considered the core version of the truth. On the other hand, Data Lakes are less reliable since data could come from any source in any condition, be it curated or not.
A Data Warehouse is a database optimized to analyze relational data, coming from transactional systems and business line applications. They are usually very expensive for large volumes of data, although they offer faster query times and higher performance. Data Lakes, by contrast, are designed with a low storage cost in mind.
Some of the legitimate criticism Data Lakes have received is:
It is still an emerging technology compared to the strong maturity model of a Data Warehouse, which has been in the market for several years.
Data Lakes could become a “swamp”. If an organization has poor management and governance practices, it can lose track of what exists at the “bottom” of the lake, causing it to deteriorate and making it uncontrolled and inaccessible.
Due to these differences, organizations can choose to use both a Data Warehouse and a Data Lake in a hybrid deployment. One possible reason would be adding new sources or using the Data Lake as a repository for everything that is no longer needed in the main data warehouse. Data Lakes are often an addition or evolution to an organization’s current data management structure rather than a replacement. Data Analysts can use more structured views of the data to get the answers they need and, at the same time, Data Science can “go to the lake” and work with all the raw information as necessary.
Data Lake Architecture
The physical architecture of a Data Lake may vary, since it is a strategy applicable by multiple technologies and providers (Hadoop, Amazon, Microsoft Azure, Google Cloud). However, there are 3 principles that make it stand out from other Big Data storage methods, and they make up its basic architecture:
No data is rejected. They are loaded from multiple source systems and preserved.
Data is stored in an untransformed or nearly untransformed condition, as received from the source.
Data is transformed and a scheme is adapted during analysis.
While information is largely unstructured or geared to answering specific questions, it must be organized as to ensure that the Data Lake is functional and healthy. Some of these features include:
Tags and/or metadata for classification, which can include type, content, usage scenarios, and groups of potential users.
A hierarchy of files with naming conventions.
An indexed and searchable Data Catalog.
Data Lakes are becoming increasingly important to business data strategies. They respond much better to today’s reality: much larger volumes and types of data, higher user expectations and a greater variety of analytics, both business and predictive. Both Data Warehouses and Data Lakes are intended to coexist with companies that want to base their decisions on data. Both are complementary, not substitute, and can help any business to better understand both markets and customers, as well as promote digital transformation efforts.
Our next article will delve into how we can use Amazon Web Services and its open, secure, scalable, and cost-effective infrastructure to build Data Lakes and analytics on top of them.
The revolution of Serverless Computing is here to stay, and this is because this new technology enables application development without having to go through the management and administration of a server. Under this model, applications can be grouped and loaded onto a platform and then run and scaled as demand for them increases.
Although “Serverless Computing” does not suppress the use of servers when executing a code, it does eliminate all activities related to its maintenance and updating. This creates an efficient model where developers manage to disassociate themselves from those routine tasks to focus on more productive activities, thus increasing the company’s operational efficiency.
What is Function as a Service (FaaS)?
Function as a Service (FaaS) is a model that allows for the execution of several computing actions based on events, and thanks to it, developers can manage applications, “bypassing” the need for servers during their management.
In the world of computing, functions are in charge of managing the states of a server, therefore the FaaS model develops a new logic that is later executed in other containers located in the cloud.
In general terms, FaaS allows us to design applications in a new architecture where the server works in the background and the execution of codes based on events becomes the fundamental pillar of the model. This means that the underlying processes that normally occur on a server do not run continuously, but are available when needed.
This becomes a clear advantage of the FaaS model, allowing developers to scale dynamically, that is, implement application automation so that it decreases or increases based on actual demand.
In addition to the above, FaaS increases the efficiency and profitability of operations, since providers will not bill the company when no activity is detected.
All this makes the FaaS model an innovative element within the recent field of serverless architecture by minimizing investment in infrastructure, and leveraging the competitive advantages of Cloud Computing.
The evolution of Serverless Computing
With the advent of the cloud in the first decade of the 2000s, people had the opportunity to store and transfer data online, which eliminated the need for hard drives.
This undoubtedly created important advantages for users, who had the opportunity to immediately access their information online from any device.
However, developers were missing an element in this equation, i.e., the place where applications or software were implemented. In this sense, a “Virtual Machine” model was implemented which allowed to point to a “Simulated Server”, creating significant flexibility in updates and migrations, and with this, the problems associated with hardware variations were left behind.
Despite this progress, “virtual machines” had some limitations in terms of operation, and this led to the creation of containers, a new technology that allowed administrators to section the operating system in order to keep several applications active simultaneously, without one interfering with the other.
Considering this reality, we can see that all these technologies maintain the paradigm of “where an application runs” as their fundamental structure. Under this scenario, Serverless Computing emerged, promising a new level of abstraction focused on the code itself that diminished the importance of the place where code was stored.
With the advent of Amazon’s AWS Lambda service at the end of 2014, a milestone in serverless architecture was achieved, as developers could finally focus their efforts on creating software without having to worry about hardware, OS maintenance, the location of the application, as well as its level of scalability.
Use Cases for Serverless Computing
Below are some successful cases of companies that applied serverless technology, or Serverless Computing, within their organizations:
Case 1. Major League Baseball Advanced Media (MLBAM)
Major League Baseball has used serverless computing technology to provide all its fans with real-time baseball game data through its “Statcast” product. This acquisition has increased MLBAM’s processing speed, as well as the ability to handle more data.
Case 2. T-Mobile US
T-Mobile US is a mobile phone company with a strong presence in the North American market. The company decided to bet on serverless technology, achieving significant benefits in terms of resource optimization, scaling simplicity, and the reduction of computer patches, thus increasing its real capacity to respond in a much more efficient way to all its customers.
Case 3. Autodesk
Autodesk is a company that develops software for the architecture, construction and engineering industries. Recently this organization decided to apply serverless technology in order to manage its development, as well as the time-to-market of all its products. In keeping with this policy, Autodesk created the “Tailor” application as an efficient response for managing its clients’ accounts.
Case 4. iRobot
iRobot is a company that designs and manufactures robotic devices intended for use within the home and in industrial settings. Since the organization decided to get involved with Serverless Computing technology, the data processing capacity of its robots has increased substantially, also allowing the capture of data streams in real time. The new serverless architecture allows them to focus on their customers and not on operations.
Case 5. Netflix
Netflix has become one of the world’s largest online media on-demand content providers. In line with its innovative spirit, this company has decided to use Serverless Computing to generate an architecture that helps optimize the encoding processes of its audiovisual files, as well as the monitoring of its resources.
When we look at the evolution of Serverless Computing and how it has managed to significantly impact computing processes in general, we understand that this new system will quickly become the next step in the world of cloud computing, fostering a promising future focused on adopting a multimodal operational approach.
Innovation in the world of computing occurs at a startling pace in each and every area, generating important progress in the processes related to “Serverless Computing”, also known as “Serverless Architecture”.
In this context, an increasing number of companies are turning to the “Cloud” as a way to optimize the creation and execution of applications and processes, minimizing the use of servers. This is where Serverless Computing comes in as a key element for the proper development of internal software architecture.
Although Serverless Computing reduces the use of a server, the server does not disappear in its entirety; it is simply optimized and reassigned by the cloud provider, who will ultimately be responsible for all the routine activities associated with the servers’ maintenance.
In the beginning, creating a web application required the use of hardware that would allow the execution of a server, sometimes resulting in a complicated and expensive process. Later on, when the cloud came along, companies and developers had the possibility to rent spaces on remote servers to carry out their activities.
However, this process was not entirely efficient either, since companies ended up buying more space than necessary in order to ensure the system would remain stable in case of very high demand peaks, thus incurring in additional expenses. This is why developers began to see the need for a platform that would allow them to pay only for the space used.
In this sense, the story of Serverless Computing is recent, the first reports of this technology being found in an article by the specialist in decentralized applications and serverless development, Ken Fromm, published in October 2012, titled “Why the Future of Software and Apps is Serverless.”
By November 2014, the Amazon company launched its “AWS Lambda” service, which allows developers to execute code and automatically organize resources without the need to manage the underlying infrastructure during.
A year later, in July 2015, Amazon created “API Gateway”, a service for the creation and maintenance of API REST, HTTP and WebSocket, where developers can generate Application Programming Interfaces that access Amazon or other Web Services, as well as data stored in the cloud. Finally, in October 2015, “Serverless Framework” was born as the first framework developed for creating applications on AWS Lambda.
Serverless architecture overview
Serverless Computing, or serverless architecture, does not imply the total absence of a server as such; what this system actually seeks is for the cloud provider to adequately and efficiently manage all processes related to the server.
In this sense, one of the outstanding features of Serverless Computing is the ability to let go of the traditional way of managing servers in a company, replacing it with automated management by the cloud provider.
This means that the cloud provider is responsible for managing all organizational resources during the execution of a particular activity, leaving behind the old administrative action carried out by users within the organization.
Under this new scheme, a company’s IT activities are billed according to the need for resources for each particular task, thus creating a clear contrast with the old model where often unused spaces were hired: this allows for major capital savings, since the company only pays for what is actually used.
In addition to the above, the Serverless Computing model eliminates the need to make server reservations. As a result, developers no longer need to access the server through an Application Programming Interface (API) to add resources, since the cloud provider is now responsible for doing this automatically.
Serverless Computing has a number of advantages when compared to the traditional model, including the following:
It significantly reduces developer operating costs by allowing developers to pay only for used space.
Higher productivity for companies, with the possibility to assign tasks related to the administration of servers to third parties, and thus focus directly on application development.
Serverless Computing platforms reduce the time associated with marketing, since developers will have the option of gradually modifying or adding code.
Providers of this new service can manage everything related to code scaling under real demand.
Ability to focus on unifying software development and its operational capacities, that is, adopting “DevOps” system engineering practices.
Optimized application development incorporating essential components of the BaaS model offered by other providers.
Regarding the disadvantages or downsides of Serverless Computing, the following may be mentioned:
Significant restriction on the interactive capacity of cloud providers, directly affecting system customization and flexibility.
Dependence on service providers.
It could cause some problems associated with the lack of control of the company’s own servers.
Access to virtual machines and operating systems is limited.
Implementing a serverless architecture implies an economic effort, since it typically requires updating the systems to meet the provider’s demands.
What role does the cloud provider play in Serverless Computing?
Cloud providers play a fundamental role in serverless architecture, since they are in charge of running the servers and allocating resources for developers at the same time.
In this sense, cloud providers offer two main methods within the Serverless Computing scheme, called “Function as a Services” (FaaS) and “Backend as a Services” (BaaS).
The first method, “Function as a Services” (FaaS), allows developers to apply micro services when writing and updating different codes to be implemented in the cloud, thereby simplifying the incorporation of data, reducing execution times, as well as ensuring a timely management of the supplier.
On the other hand, the “Backend as a Services” (BaaS) method is based on the provision of services to third parties based on the Application Programming Interface (API) established by the provider, such as databases, authentication services, and encryption processes.
Finally, it is worth noting that large cloud providers work under the “Function as a Services” (FaaS) mode, such as AWS Lambda from Amazon, Azure Functions from Microsoft, IBM Cloud Functions and Google Cloud.
Serverless Computing has certainly had a significant impact in the world of computing, allowing developers to focus on creating software without having to worry about the application management or production code, since the cloud provider is in charge of efficiently managing the resources necessary for this important activity.
Would you like to learn more about this subject? Please visit our IT Continuity page to learn more about the services we offer related to infrastructure and custom Software Development.
More and more businesses are turning to cloud computing for their IT needs. The question is why they are opting for cloud services instead of the traditional on-premise solution. Below are the benefits of using cloud over on-premise solutions.
Flexibility and Scalability
The majority of businesses that utilize the traditional on-premise solutions find it challenging and time-consuming to adopt new software or hardware projects in both implementation as well as user adoption. Additionally, increasing and decreasing the IT solution to cater for the number of users was tasking as well. For cloud solutions, setting up new software and hardware is a relatively easy and simple process. Increasing and decreasing the scale of the cloud solution to cater sudden changes in the number of users requiring them is also a straightforward and rapid process to implement. This gives business the flexibility to seamlessly adapt their cloud solution to meet their current needs.
CapEx and OpEx
On-Premise IT solutions are linked to capital expense or CapEx while cloud services usually run on operational expenses or OpEx. CapEx involves making a purchase for an asset whereas OpEx entails incurring an ongoing regular cost usually linked to a contract. Operational expenses are highly transparent due to the ease of determining the return on investment which makes managing IT expenses for the firm more convenient.
Backup, Recovery, and Security
The cloud service provider is usually responsible for the backup, updates, and recovery of software, hardware, and data. On-premise solutions involve onsite storage of data which is a risk since it is susceptible to physical and digital attacks or damage. The cloud service provider is also vulnerable to the same risks however they have put specific measures to ensure that data is highly secure. A good cloud service provider also has multiple backups of hardware, software and client data. This means that if something was to happen to one of their storages or the customer’s data, then it is possible to recover the lost or damaged elements.
Infrastructure as a Service
Businesses with their on-premise solutions typically install their own IT infrastructure around their premises. Management of onsite infrastructure is time-consuming for the IT department which must dedicate a portion of their resources. Furthermore, planning for the future is difficult. For example, assuming a firm has 300 employees, does that mean that the server capacity should cater for 300 users or should it be more or less in case 6 months from now the company decides to restructure or take on a new project? Firms with cloud-based solutions may utilize Infrastructure as a Service (IaaS) while allows them to lease various IT resources such as storage space and processing capacity.
IaaS also frees up time for the IT department which can focus on meeting the business objectives of the firm. IaaS is scalable meaning which enables the company to increase or decrease the leased infrastructure resources to meet its current and future needs in an easy, quick, and convenient method.
Get directly to your mail the latest trends and news in Software Development, Mobile Development, UX / UI Design and Infrastructure Services, as well as in the management of Dedicated Teams and Turnkey Projects remotely.
Subscribe to our mail and start receibing all of our information.