Big Data Engineering with Embedded Systems — Internet of Things (IoT) Devices

Ainomugisha Solomon
Stackademic
Published in
12 min readJan 9, 2024

--

Of recent got curious about the how different devices in our homes, automobiles and industries coordinate to operate and also manage collect all operational data, which aids in further optimisation and improvement of the processes. All these devices have embedded sensors, accuators which can be connected to via specific SDKs which Amazon web services supports.

Decided to put out a technical piece about how this can be achieved using the different Amazon Web Services (AWS) like AWS IoT Core, S3, Glue, Lambda, Athena, Redshift and Quicksight. Enjoy the read as i explain how all these services interlink to achieve big data engineering with Embedded systems.

Below is a quick-look explanation of the services; we shall deep dive into each and how it works.

IoT Core: This service will act as the central hub for all connected devices. It will receive and process messages from these devices and then store them in an S3 bucket.

S3: The data received from IoT Core will be stored in an S3 bucket. This bucket will serve as the initial data lake where raw data resides.

Glue: AWS Glue will be used to catalog metadata from the data stored in S3. This includes table definitions and other metadata. AWS Glue also runs ETL jobs to transform and cleanse the data.

Lambda: AWS Lambda functions can be triggered based on events in S3 or IoT Core. For example, a Lambda function could be triggered whenever new data arrives in S3, allowing for real-time processing.

Athena: Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. You can use Athena to run complex queries on your data without having to set up any infrastructure.

Redshift: Amazon Redshift is a data warehouse service designed for online analytic processing (OLAP). It can be used to perform complex queries on large datasets stored in S3.

QuickSight: Amazon QuickSight is a fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization. You can create and publish interactive dashboards that include machine learning-powered insights.

To explain more about the services and how they actually work

AWS IOT:
AWS IoT Core acts as a cloud-based service that lets you connect, manage, and ingest data from IoT devices. Here’s a detailed explanation of how IoT Core receives and processes messages from connected devices:

Device Registration: Before a device can communicate with IoT Core, it needs to be registered. This involves creating a Thing, which is an entity in the AWS IoT system that represents a device. Each Thing has a unique identifier and associated metadata, such as name and description. The device can also be assigned a certificate, which is used to authenticate the device when it attempts to connect to IoT Core

Device Connection: Once registered, the device can connect to IoT Core using MQTT or HTTP/HTTPS protocols. These protocols allow devices to publish messages to specific topics and subscribe to receive messages from those topics. The device sends messages to IoT Core, which acts as a message broker

Message Processing: IoT Core uses a feature called the Rules Engine to process incoming messages. The Rules Engine connects data from the message broker to other AWS IoT services for storage and additional processing. For instance, you can insert, update, or query a DynamoDB table or invoke a Lambda function based on an expression that you defined in the Rules Engine. You can use an SQL-based language to select data from message payloads, and then process and send the data to other services

Data Transformation: The Rules Engine can also transform incoming messages before they are sent to other services. For example, a rule might filter out unnecessary data or add additional context to the message. This allows you to tailor the data to the needs of your application

Republishing Messages: In addition to processing messages, the Rules Engine can also republish messages to other subscribers. This means that a single message can be sent to multiple destinations, allowing for flexible data routing

Remember that the security of your IoT devices and the data they generate is paramount. Therefore, it’s crucial to properly configure security settings in IoT Core, including the use of SSL/TLS for encrypted communication and the use of IAM roles and policies to control who can access your devices and data.

AWS S3:
Amazon S3 (Simple Storage Service) is a scalable object storage service designed to store and retrieve any amount of data from anywhere. In the context of IoT, S3 can be used to store raw data from IoT devices that is received through AWS IoT Core.

Here’s how it works:

IoT Rule Creation: When you create an AWS IoT rule with an S3 action, you specify the details of the S3 bucket where the data should be stored. This includes the bucket name, the key (which is usually a combination of the topic and timestamp to ensure uniqueness), and the IAM role that allows access to the S3 bucket. The rule also specifies the SQL statement that selects the data to be written to the S3 bucket.

Data Writing: When a message matching the SQL statement in the IoT rule arrives from an IoT device, AWS IoT Core triggers the S3 action. The data from the MQTT(Message Queue Telemetry Transport) message is written to the specified S3 bucket. The key is dynamically generated based on the topic and timestamp of the incoming message. This ensures that each message is stored in a separate file within the S3 bucket.
Here’s an example of an IoT rule with an S3 action:

{
"topicRulePayload": {
"sql": "SELECT * FROM 'some/topic'",
"ruleDisabled": false,
"awsIotSqlVersion": "2024-01-09",
"actions": [
{
"s3": {
"bucketName": "my-bucket",
"cannedacl": "public-read",
"key": "${topic()}/${timestamp()}",
"roleArn": "arn:aws:iam::123456789012:role/aws_iot_s3"
}
}
]
}
}

This rule selects all messages from the ‘some/topic’ and writes them to the ‘my-bucket’ S3 bucket. The key for each file is a combination of the topic and timestamp of the incoming message. The roleArn is the ARN of the IAM role that grants AWS IoT permission to write to the S3 bucket

Remember that the security of your S3 buckets is crucial. Therefore, it’s recommended to restrict access to your S3 buckets as much as possible and use IAM roles and policies to control who can access your data.

AWS Glue:
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. It’s a part of the AWS Glue Data Catalog, which is a persistent metadata store that contains metadata tables, job definitions, and other control information to manage your AWS Glue environment.

Here’s how AWS Glue works with S3 and ETL jobs:

Cataloging Data: AWS Glue uses a component called a Crawler to discover and catalog metadata from your data stores. You can point your Crawler at your S3 bucket, and it will create table definitions in the AWS Glue Data Catalog. These table definitions include the location, schema, and other metadata of your data. This information is stored as metadata tables, where each table specifies a single data store.

Running ETL Jobs: After cataloging your data, you can define ETL jobs in AWS Glue to extract, transform, and load your data. An ETL job consists of a script that extracts data from your data source (in this case, your S3 bucket), transforms the data, and loads it to your data target (such as another S3 bucket or a database). The script runs in an Apache Spark environment in AWS Glue.

Transforming Data: AWS Glue can generate a script to transform your data, or you can provide the script yourself. This script can be written in Python or Scala, and it defines the transformations to apply to your data. AWS Glue also supports various built-in transformations that you can use to clean and transform your data.

Loading Data: After transforming the data, the ETL job loads the transformed data to your data target. The target could be another S3 bucket, a database like Amazon RDS, or even a data warehouse like Amazon Redshift.

Remember that AWS Glue charges based on the amount of data processed by your ETL jobs. Therefore, it’s important to optimize your ETL jobs to minimize data processing costs.

Lambda:
AWS Lambda is a serverless compute service that lets you run your code without provisioning or managing servers. With Lambda, you can run your code for virtually any type of application or backend service. Lambda can be triggered by various AWS services, including S3 and IoT Core.

Here’s how Lambda functions can be triggered by events in S3 and IoT Core:

S3 Triggers: You can configure an Amazon S3 bucket to send an event to a Lambda function when an object is created or deleted. To do this, you need to set up a notification configuration on the S3 bucket that points to the Lambda function. When an object is created or deleted in the bucket, S3 sends an event to the Lambda function, which is then triggered to execute. The event contains details about the object, such as its name and size.

Here’s an example of how to add a trigger to a Lambda function for an S3 bucket:

aws lambda add-permission - function-name my-function \
- statement-id s3-events - action "lambda:InvokeFunction" - principal s3.amazonaws.com - source-arn arn:aws:s3:::my-bucket

This command adds a permission statement to the Lambda function’s resource-based policy, allowing S3 to invoke the function.

IoT Core Triggers: Similarly, AWS IoT Core can trigger a Lambda function when a certain condition is met, such as when a new message arrives. You can set up rules in IoT Core to match the conditions under which the rule should trigger the Lambda function. When the rule matches an incoming message, IoT Core sends an event to the Lambda function, which is then triggered to execute.

Here’s an example of how to add a trigger to a Lambda function for IoT Core:

aws lambda add-permission - function-name my-function \
- statement-id iot-events - action "lambda:InvokeFunction" - principal iot.amazonaws.com

This command adds a permission statement to the Lambda function’s resource-based policy, allowing IoT Core to invoke the function.

Remember that AWS Lambda functions are stateless by nature, which means they don’t inherently support long-running operations. However, you can design your Lambda functions to handle long-running tasks by invoking themselves recursively or using AWS Step Functions.

Amazon Athena:
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. It’s a serverless service, meaning that you don’t need to set up or manage any infrastructure to start running queries.

Here’s how Athena interacts with S3 and SQL:

Querying S3 Data: Athena can read data directly from files stored in S3. It supports various file formats such as CSV, JSON, and Parquet. When you run a query, Athena reads the relevant files from S3, executes the query, and returns the results. There’s no need to move data around or load it into a database before you can query it.

SQL Support: Athena uses Presto, a distributed SQL query engine developed by Facebook. This means that you can use standard SQL syntax to query your data. Athena also supports complex queries and joins across multiple files and directories in your S3 bucket.

Performance: Athena automatically parallels queries, so you can run them faster than if you were running them on a single machine. You only pay for the amount of data scanned by each query, so you can control your costs by filtering unnecessary data from your queries.

Integration with Other Services: Athena integrates well with other AWS services. For example, you can use AWS Glue to catalog your data in S3 and create a table definition for it in the AWS Glue Data Catalog. Then, you can use Athena to query the data using the table definition. Athena can also integrate with Amazon QuickSight, a business intelligence service, to visualize your query results.

Remember that while Athena is serverless, you still need to consider security when querying sensitive data. Always encrypt your data in S3 and use IAM roles and policies to control who can access your data.

Redshift:
Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data efficiently using your existing business intelligence tools. It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more

Here’s how Redshift interacts with S3 and performs complex queries:

Querying S3 Data: Redshift uses a feature called Redshift Spectrum to query data directly in S3. When you issue a query, it goes to the Redshift SQL endpoint, which generates and optimizes a query plan. Redshift determines what data is local and what is in S3, generates a plan to minimize the amount of S3 data that needs to be read, and then requests Redshift Spectrum workers out of a shared resource pool to read and process the data from S3

Parallel Processing and Distribution: Redshift uses a cluster-based architecture consisting of a set of nodes, each with CPU, storage, and RAM. A lead node ingests and delegates queries to compute nodes, which then process the results. Redshift allows analysts to process massive datasets via parallel processing and distributed design strategy

Complex Queries: Because Redshift is based on PostgreSQL, it can run any type of data model, from a production transaction system third-normal-form model to star and snowflake schemas, data vault, or simple flat tables. This means that you can perform complex queries on your data without having to set up any infrastructure

Machine Learning and Integration: Redshift enables business analysis using in-built machine learning services and other Amazon tools like QuickSight. It also integrates with Apache Spark, enabling data teams to run more analysis applications on their data warehouse

Remember that while Redshift is fully managed, you still need to consider security when querying sensitive data. Always encrypt your data in S3 and use IAM roles and policies to control who can access your data.

QuickSight:
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to deliver insights to everyone in your organization. It provides an easy way to create and publish interactive dashboards that include machine learning-powered insights.

Here’s how QuickSight works:

Creating Dashboards: With QuickSight, you can create and share dashboards that scale to hundreds of thousands of users on any device. QuickSight offers a range of chart types and allows you to customize the layout of your dashboard. You can also add filters and parameters to your dashboards to provide different views of your data.

Connecting to Data Sources: QuickSight allows you to directly connect to and import data from various cloud and on-premises data sources. These include SaaS applications, third-party databases, native AWS services like Amazon Redshift, Amazon Athena, Amazon S3, Amazon RDS, and Amazon Aurora, and even file types like Excel, CSV, and JSON

Machine Learning Insights: QuickSight uses machine learning algorithms to identify patterns and trends in your data. This can help you gain deeper insights into your data and make more informed decisions. For example, QuickSight can predict future sales based on historical data, or identify anomalies in your data that might indicate fraudulent activity

Embedding Analytics: QuickSight allows you to embed interactive dashboards and visualizations into your applications. This means you can blend analytics seamlessly into your application without needing to build your own analytics capabilities. You can also personalize the look and feel of your reports and dashboards using QuickSight Embedded Analytics themes

Mobile App Support: QuickSight Mobile for iOS and Android helps you securely get insights from your data from anywhere. You can favorite, browse, and interact with all your dashboards in a straightforward mobile-optimized experience

Remember that while QuickSight is serverless, you still need to consider security when sharing insights with others. Always encrypt your data and use IAM roles and policies to control who can access your data

Sample Architecture for Building Automated Pipeline Corrosion Monitoring with AWS IoT Core

The components of this solution are:

  • Ultrasonic sensors — These sensors are mounted across the pipeline at periodic intervals to capture pipeline thickness. Also, these sensors publish data to the IoT gateway.
  • IoT Gateway — This is used to ingest data from the individual ultrasonic sensors and in turn publish sensor data under a topic via MQTT protocol to be consumed by AWS IoT Core.
  • AWS IoT Core — AWS IoT Core subscribes to the IoT topics published by the IoT gateway and ingests data into the AWS Cloud for analysis and storage.
  • AWS IoT rule — Rules give your devices the ability to interact with AWS services. Rules are analyzed and actions are performed based on the MQTT topic stream. Here Amazon SNS rule has been used to trigger an email notification when pipe thickness goes below the configured value.
  • AWS Lambda — Used to load data to S3.
  • Amazon SNS — Sends notifications to the operations team as necessary.
  • Amazon S3 — Stores sensor data, mapping information and corresponding metadata like timestamps, sensor region and measurement location.
  • Amazon Athena — Queries Amazon S3 using standard SQL to analyze data.
  • Amazon QuickSight — helps visualize and provide insights on the sensor data.

In conclusion, Amazon Web Services (AWS) provides a suite of powerful services that allow you to store, analyze, and visualize data in a scalable and efficient manner. Services like Amazon S3, AWS Glue, AWS Lambda, Amazon Athena, Amazon Redshift, and Amazon QuickSight work together to form a comprehensive data analytics pipeline. They enable you to collect and store data from various Devices and embedded Systems, catalog and transform that data, run complex queries, and deliver valuable insights.

Stackademic 🎓

Thank you for reading until the end. Before you go:

--

--