Definition

data engineer

By

Alexander S. Gillis, Technical Writer and Editor
Ben Lutkevich, Site Editor
Jack Vaughan

What is a data engineer?

A data engineer is an IT professional whose primary job is to prepare data for analytical or operational uses. This occupation includes duties such as designing and building systems for collecting, storing and analyzing data.

Data engineers are typically responsible for building data pipelines to bring together information from different source systems. These software engineers integrate, consolidate and cleanse data and structure it for use in analytics applications. They strive to make data easily accessible and to optimize their organization's big data ecosystem.

The amount of data an engineer works with varies by organization, particularly with respect to its size. The bigger the company, the more complex the analytics architecture and the more data the engineer is responsible for maintaining. Certain industries are more data-intensive, including healthcare, retail and financial services.

Data engineers work in conjunction with data science teams, improving data transparency and enabling businesses to make more trustworthy business decisions.

The data engineer role

Data engineers focus on collecting and preparing data for use by data scientists and analysts. They take on the following three main roles:

Generalists. Data engineers with a general focus typically work on small teams, doing end-to-end data collection, intake and processing. They might have more skills than most data engineers, but less knowledge of systems architecture. A data scientist who wants to become a data engineer would fit well into the generalist role.

A generalist data engineer might undertake a project to create a dashboard for a small, metro-area food delivery service that displays the number of deliveries made each day for the past month and forecasts the delivery volume for the following month.

Pipeline-centric engineers. These data engineers typically work on a data analytics team with more complicated data science projects across distributed systems. Midsize and large companies are more likely to need this role.

A regional food delivery company might undertake a pipeline-centric project to create a tool for data scientists and analysts to search metadata for information about deliveries. They might look at distance driven and drive time required for deliveries in the past month, then use that data in a predictive algorithm to see what it means for the company's future business.

Database-centric engineers. These data engineers implement, maintain and populate analytics databases. This role typically exists at larger companies where data is distributed across several databases. These engineers work with pipelines, tune databases for efficient analysis and create table schemas using extract, transform and load (ETL) methods. The ETL process copies data from several sources into a single destination system.

A database-centric project at a large, national food delivery service would be to design an analytics database. In addition to creating the database, the data engineer would write the code to get data from where it's collected in the main application database into the analytics database.

Data engineer responsibilities

Data engineers often work as part of an analytics team alongside data scientists. Data engineers provide data in usable formats to the data scientists who run queries and algorithms against the information for predictive analytics, machine learning and data mining applications. Data engineers also deliver aggregated data to business executives, analysts and other end users so they can analyze it and apply the results to improve business operations.

Data engineers work with both structured and unstructured data. Structured data is information that can be organized into a formatted repository like a database. Unstructured data -- such as text, images, audio and video files -- doesn't conform to conventional data models. Data engineers must understand different approaches to data architecture and applications to handle both data types. A variety of big data technologies, such as open source data ingestion and processing frameworks, are also part of the data engineer's toolkit.

Although exact responsibilities for data engineers differ by organization, other typical responsibilities include the following:

Build, test and maintain database pipeline architectures.
Create methods for data validation.
Acquire data.
Clean data.
Develop data set processes.
Improve data reliability and quality.
Develop algorithms to make data usable.
Prepare data for prescriptive and predictive modeling.

Data engineer skill set

Data engineers are skilled in programming languages such as C#, Java, Python, R, Ruby, Scala and SQL. Python, R and SQL are the three primary languages data engineers use.

Engineers need to have a good understanding of ETL tools and representational state transfer-oriented APIs for creating and managing data integration jobs. These skills also help provide data analysts and business users with simplified access to prepared data sets.

Data engineers must understand data warehouses and data lakes and how they work. For instance, Hadoop data lakes that offload the processing and data storage work of established enterprise data warehouses support the big data analytics efforts of data engineers.

Data engineers must also understand NoSQL databases and Apache Spark systems, which are becoming common components of data workflows. Data engineers should have a knowledge of relational database systems as well, such as MySQL and PostgreSQL. Another focus is Lambda architecture, which supports unified data pipelines for batch and real-time processing.

Business intelligence (BI) platforms and the ability to configure them are another important focus for data engineers. They must know how to work with interactive BI platform dashboards to establish connections with data warehouses, data lakes and other data sources.

Although machine learning is more part of the data scientist's or the machine learning engineer's skill set, data engineers must also understand it to be able to prepare data for machine learning platforms. They should know how to deploy machine learning techniques and gain insights from them.

Lastly, knowledge of Unix-based operating systems (OSes) is important. Unix, Solaris and Linux provide functionality and root access that other OSes -- such as macOS and Windows -- don't. They give the user more control over the OS, which is useful for data engineers.

As the data engineer job has gained more traction, companies such as IBM and Hadoop vendor Cloudera Inc. have begun offering certifications for data engineering professionals. Some popular data engineer certifications include the following:

The Certified Analytics Professional (CAP). This vendor-neutral certification from Informs focuses on data analytics and a candidate's ability to transform complex data into valuable insights. The exam has educational and experience requirements, requires a fee to take and is self-paced.
Cloudera CCP Data Engineer. This certification verifies a candidate's ability to ingest, transform, store and analyze data in Cloudera's data tool environment. Cloudera charges a fee for its four-hour test. It consists of five to 10 hands-on tasks, and candidates must get a minimum score of 70% to pass. There are no prerequisites, but candidates should have extensive data engineering experience.
Google Cloud Professional Data Engineer. This certification tests a candidate's ability to use machine learning models, ensure data quality and build and design data processing systems. Google charges a registration fee for the two-hour multiple-choice exam. There are no prerequisites, but Google recommends the candidate have some experience with Google Cloud Platform.
AWS Certified Data Engineer - Associate. This certification offered by Amazon Web Services focuses on such topics as validating skills in data-related AWS services, the implementation of data pipelines and troubleshooting issues. The 130-minute exam requires a fee to take.

As with many IT certifications, those in data engineering are often based on a specific vendor's product, and the training and exams focus on teaching candidates how to use their software.

How to become a data engineer

Certifications alone aren't enough to land a data engineering job. Experience is also necessary to be considered for a position. Other ways to break into data engineering include the following:

Build a portfolio. Create a portfolio filled with data engineering-related projects. A portfolio shows recruiters and hiring managers the type of work an individual can produce. Independent or coursework projects have the potential to show what knowledge on the subject an individual has, as well as their problem-solving skills.
University degrees. Useful degrees for aspiring data engineers include bachelor's degrees in applied mathematics, computer science, physics or engineering. Master's degrees in computer science or computer engineering can help candidates set themselves apart from other job seekers.
Online courses. Inexpensive and free online courses are good ways to learn data engineering skills. There are many useful videos on YouTube, as well as free online courses and resources, such as the following four options:

Codecademy's Learn Python 2. Knowledge of Python is essential for data engineers. This free course requires no prior knowledge. A certificate of completion is included with paid plans.
Linux Server Management and Security. This course provided through Coursera and the University of Colorado covers Linux basics. The course includes five modules and is part of Coursera's Computer Security and Systems Management Specialization.
GitHub Quick SQL Cheatsheet. This GitHub repository is consistently updated with SQL query examples.
O'Reilly data engineering digital courses. Titles in the big data architecture section cover data engineering topics. O'Reilly offers a free trial that provides unlimited learning for 10 days.

Project-based learning. With this more practical approach to learning data engineering skills, the first step is to set a project goal and then determine which skills are necessary to reach it. The project-based approach is a good way to maintain motivation and structure learning.

Data engineer vs. data scientist

Data engineers and data scientists work together. Data engineers prepare and organize the data that companies have in databases and other formats. They also build data pipelines that make data available to data scientists. Data scientists, however, use all that data for analytics and other projects that improve business operations and outcomes.

Data scientists and data engineers differ in their skill sets and focus. Data engineers don't necessarily have a specific focus; they tend to be competent in several areas and well-rounded in their knowledge and skills. By contrast, data scientists often have specialized areas of focus. They're concerned with more exploratory data analysis. Data scientists tackle new, big-picture problems, while data engineers put the pieces in place to make that possible.

A chart comparing data scientist vs. data engineer. — Data engineers and data scientists have overlapping but different skills and responsibilities on the data management team.

Data engineer vs. data architect

The roles of data engineer and data architect are closely related. A data architect is an IT professional who is responsible for defining the policies, procedures, models and technologies to be used in collecting, organizing, storing and accessing company information. Data architects act primarily as planners and designers with a typically broad, company-wide scope of operations.

While data architect and data engineer positions are relatable, they have a few distinct and important differences. Data architects are responsible for high-level data policies and data management practices, while data engineers are responsible for applying these policies and practices. Data engineers build and maintain the blueprint for a data framework -- defining how data is stored, accessed and managed -- and data engineers build and maintain those systems.

Data architects require both business and technical skills. For example, data architects must have skills related to translating business requirements and operations into data policies, as well as expertise in databases, data modeling, data architecture and OSes.

Data engineer vs. big data engineer

Data engineers have a variety of options when applying for related positions. A big data engineer, for example, is a data engineer who works with very large data sets.

Big data engineers are responsible for building and maintaining an organization's big data environment. This includes working on the big data architecture and technology, as well as data preparation and data management processes.

Among other skills, big data engineers require an understanding of data archetypes, coding functions, object-oriented programming languages, structured data relational and NoSQL databases, abstraction tools and data warehousing techniques.

Data engineer salary and outlook

The average salary for a data engineer varies, depending on factors such as location, industry and years of experience. According to Indeed, the average base salary for a data engineer in Boston is $129,913 with a range of $86,072 to $196,085 per year. By comparison, Salary.com notes the pay range for a data engineer in Boston as $130,971 to $160,912.

In addition to data engineers and data scientists, data management and analytics teams contain a variety of roles and specialties. Learn about the skill sets and personnel required to create a strong enterprise data science team.

This was last updated in March 2024

Continue Reading About data engineer

Top data architect and data engineer certifications

Key differences of a data scientist vs. data engineer

Top big data interview questions with answers

Top cloud computing careers and how to get started

How to design a data architecture for business success

Dig Deeper on Data management strategies

Business Analytics

AI assistant from Tableau targets efficiency, deep analysis
The analytics vendor's copilot can recommend questions that might lead to otherwise undiscovered insights as well as understand ...
New Databricks open source LLM targets custom development
The data platform vendor's new language model was designed to provide open source users with AI development capabilities similar ...
Domo to add prebuilt models, chat capabilities to AI suite
The BI vendor's latest innovations include conversational analytics capabilities and prebuilt models designed to help customers ...

AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...
Compare EKS vs. self-managed Kubernetes on AWS
AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

Content Management

Adobe intros new GenAI tools and apps for marketers
The new applications and tools are for marketers looking to connect the different processes of the content supply chain in a safe...
8 top enterprise search engines
Whether an organization wants to improve its internal or external search strategies, an enterprise search engine can help users ...
The importance of ethics in information management
Advancements in data collection and processing may tempt information management professionals to use as much customer data as ...

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP chief AI officer: Waiting on AI is the wrong strategy
SAP's first chief AI officer, Philipp Herzig, outlines the company's new AI-focused organization and underscores why companies ...
SAP, Nvidia partner to boost Business AI development
SAP and Nvidia are working together to combine platforms and services that help customers build business-specific generative AI ...
SAP Datasphere adds data governance, GenAI for analytics
SAP introduced new functionality in SAP Datasphere to help customers better manage their data environments with governance, ...

Close