Skip to content

This workshop aims to make use of log data to help practitioners gain understanding of CDF and to show the value it brings to enterprises who understand that data in motion related use cases can add value to their business.

Notifications You must be signed in to change notification settings

DashDipti/cdf-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cloudera Data Flow Hands-On-Lab


Version : 1.0.1 25th April 2023


HOL1
This document guides students through the Hands-on lab on of Cloudera Data Flow. It will take you step by step to completing the pre-requisites and deliver this demo.

▶ Recording

You can find the recording here - CDF-Recording.

Pre-Requisites (Taken Care by Instructor)

For the ease of carrying out the workshop and considering the time at hand, we have already taken care of some of the steps that need to be considered before we can start with the actual workshop steps. The prerequisites that need to be in place are:

(1) Streams Messaging Data Hub Cluster should be created and running.
(2) Stream analytics Data Hub cluster should be created and running.
(3) Data provider should be configured in SQL Stream Builder.
(4) Have access to the file syslog-to-kafka.json.
(5) Environment should be enabled as part of the CDF Data Service.

Step 0: Instructor Explanation

Step 0 basically talks about verifying different aspects w.r.t to access and connections before we could begin with the actual steps.
Your instructor will guide you through this.
(1) Credentials: Participants must enter their First Name, Last Name & Company details and make a note of corresponding Workshop Login Username, Workshop Login Password and CDP Workload User to be used in this workshop.

Step 1: Introduction & Set Up

Step 1.1: Define Workload Password

Please use the login url: Workshop login in case you are not logged in.
Enter the Workshop Login Username and Workshop Login Password that you obtained as part of Step 0.
(Note: Note that your Workshop Login Username would be something like wuser00@workshop.com and not just wuser00).

1 1

You should be able to get the following home page of CDP Public Cloud.

1 2

You will need to define your workload password that will be used to access non-SSO interfaces. Please keep a note of this workload password. If you have forgotten it, you will be able to repeat this process and define another one.
You may read more about workload passwords here.

  1. Click on your user name (Ex: wuser00@workshop.com) at the lower left corner.

  2. Click on the Profile option.

1 3

  1. Click option Set Workload Password.

  2. Enter a suitable Password and Confirm Password.

  3. Click button Set Workload Password.

1 4

1 5


Check that you got the message - Workload password is currently set or alternatively, look for a message next to Workload Password which says (Workload password is currently set)

1 6

Step 1.2: Verify permissions in Apache Ranger

NOTE: THESE STEPS HAVE ALREADY BEEN DONE FOR YOU, THIS SECTION WILL WALK YOU THROUGH HOW PERMISSIONS/POLICIES ARE MANAGED IN RANGER. PLEASE DO NOT EXECUTE THE STEPS IN THIS SECTION OR CHANGE ANYTHING.

Step 1.2.1: Accessing Apache Ranger

Click on the Management Console icon.
1 2 1 1

Click on the Environments tab on the left pane. Select the environment that is shared by the instructor (Ex: emeaworkshop-environ).
1 2 1 2

Click on the Ranger quick link to access the Ranger UI.
1 2 1 3
1 2 1 4

Step 1.2.2: Kafka Permissions

In Ranger UI, select the Kafka repository that’s associated with the stream messaging datahub.
1 2 2 1

Verify if the user group workshop-users (all users are part of this group) who will be performing the workshop is present in both all-consumergroup and all-topic.

The below image reflects the group workshop-users being part of all-consumergroup.
1 2 2 2

The below image reflects the group workshop-users being part of all-topic.
1 2 2 3

Step 1.2.3: Schema Registry Permissions

In Ranger, select the SCHEMA-REGISTRY repository that’s associated with the stream messaging datahub.
1 2 3 1

Verify if the user group workshop-users (all users are part of this group) who will be performing the workshop is present in the Policy: all - schema-group, schema-metadata, schema-branch, schema-version.
1 2 3 2

You may just scroll down this screen to understand the policy better.
1 2 3 3

Enter the close icon on the top of this `Ranger' tab to exit this page.
1 2 3 4

Click on Leave option in the pop-up.
1 2 3 5

Step 1.3: Download Resources from GitHub

Scroll up the page here (https://github.com/DashDipti/cdf-workshop) and click on <> Code and then choose the option Download ZIP.
1 4 1

Use any unzip utility to download extract the content of the partner-summit-2023-main.zip file.
1 4 2
1 4 3

In the extracted content just be sure that the downloaded files has a file syslog-to-kafka.json which should be around ~24 KB in size. You will need this file in later step.
1 4 4

Step 2: Create a Flow using Flow Designer

Creating a data flow for CDF-PC is the same process as creating any data flow within Nifi with 3 very important steps.
(a) The data flow that would be used for CDF-PC must be self-contained within a process group.
(b) Data flows for CDF-PC must use parameters for any property on a processor that is modifiable, e.g. user names, Kafka topics, etc.
(c) All queues need to have meaningful names (instead of Success, Fail, and Retry). These names will be used to define Key Performance Indicators in CDF-PC.

Step 2.1: Building the Data Flow using Flow Designer

Step 2.1.1: Create the canvas to design your flow

Access the DataFlow data service from the Management Console.
2 1 1 1
2 1 1 1a

Go to the Flow Design.
2 1 1 2

Click on Create Draft (This will be the main process group for the flow that you’ll create).
2 1 1 3

Select the appropriate environment as part of the Workspace name (Ex: emeaworkshop-environ).
Give your flow a name with your username as prefix (Ex: wuser00_datadump_flow).
Click on CREATE.
2 1 1 4

On successful creation of the Draft, you should now be redirected to the canvas on which you can design your flow. 2 1 1 5

Step 2.1.2: Adding new parameters

Click on the Flow Options on the top right corner of your canvas and then select Parameters.
2 1 2 1

Configure Parameters: Parameters are reused within the flow multiple times and will also be configurable at the time of deployment.
There are 2 options available: Add Parameter, which is used for specifying non-sensitive values and Add Sensitive Parameter, which is used for specifying sensitive parameters like password.

  • Click on Add Parameter.
    2 1 2 2

Add the following parameters.
Name: S3 Directory.
Value: LabData.
Click on Apply.
2 1 2 3

  • Click on Add Parameter.
    2 1 2 4

Add the following parameters.
Name: CDP Workload User.
Value: The username assigned to you. Ex: wuser00.
Click on Apply.
2 1 2 5

  • Click on Add Sensitive Parameter.
    2 1 2 6

Add the following parameters.
Name: CDP Workload User Password.
Value: Workload User password set by you in 'Step 1.1: Define Workload Password'.
Click on Apply.
2 1 2 7

Click on Apply Changes. And then click ok.
2 1 2 8
2 1 2 9

Click on Back to Flow Designer
2 1 2 10

Now that we have created these parameters, we can easily search and reuse them within our dataflow. This is useful for CDP Workload User and CDP Workload User Password.
`NOTE ONLY: To search for existing parameters -

  1. Open a processor’s configuration and proceed to the properties tab.

  2. Enter: #{

  3. Hit Ctrl+Spacebar.

This will bring up a list of existing parameters that are not tagged as sensitive.

Step 2.1.3: Create the flow

Let’s go back to the canvas to start designing our flow. This flow will contain 2 Processors:
GenerateFlowFile: Generates random data.
PutCDPObjectStore: Loads data into HDFS(S3).

Add GenerateFlowFile processor: Pull the Processor onto the canvas and type GenerateFlowFile in the text box, and once the processor appears click on Add.
2 1 3 1a
2 1 3 2
2 1 3 3

Configure GenerateFlowFile processor: The GenerateFlowFile Processor will now be on your canvas and you can configure it by right clicking on it and selecting Configuration.
2 1 3 4

Fill in the values in the right window pane to configure the processor in the following way.
Processor Name: DataGenerator
Scheduling Strategy: Timer Driven
Run Duration: 0ms
Run Schedule: 30 sec
Execution: All Nodes
Properties: Custom Text

<26>1 2021-09-21T21:32:43.967Z host1.example.com application4 3064 ID42 [exampleSDID@873 iut="4" eventSource="application" eventId="58"] application4 has
stopped unexpectedly

The above represents a syslog out in RFC5424 format. Subsequent portions of this workshop will leverage this same syslog format.
2 1 3 5
2 1 3 6

Click on Apply.
2 1 3 7

Add PutCDPObjectStore processor: Pull a new Processor onto the canvas and type PutCDPObjectStore in the text box, and once the processor appears click on Add.
2 1 3 8
2 1 3 9

Configure PutCDPObjectStore processor: The PutCDPObjectStore Processor will now be on your canvas and you can configure it by right clicking on it and selecting Configuration.
2 1 3 10

Configure the processor in the following way.
Processor Name : Move2S3
Scheduling Strategy : Timer Driven
Run Duration : 0ms
Run Schedule : 0 sec
Execution : All Nodes
Properties
Directory : #{S3 Directory}
CDP Username : #{CDP Workload User}
CDP Password : #{CDP Workload User Password}
Relationships: Check the Terminate box under success.

2 1 3 11
2 1 3 12

Click on Apply.
2 1 3 13

Create connection between processors: Connect the two processors by dragging the arrow from DataGenerator processor to the Move2S3 processor and select on success relationship . The click Add.
2 1 3 14
2 1 3 15

Your flow should look something like below.
2 1 3 16

The Move2S3 processor does not know what to do in case of a failure. Let’s add a retry queue to it. This can be done by dragging the arrow on the Move2S3 processor outwards then back to itself, as shown below.
2 1 3 17

Then select the option failure. Click on Add.
2 1 3 18
2 1 3 19

Step 2.1.4: Renaming the queues

Naming the queue: Providing unique names to all queues is very important as they are used to define Key Performance Indicators (KPI) upon which CDF-PC will auto scale. To name a queue, double-click the queue and give it a unique name. A best practice here is to start the existing queue name (i.e. success, failure, retry, etc…) and add the source and destination processor information.

For example, the success queue between DataGenerator and Move2S3 is named success_Move2S3. Click Apply.
2 1 4 1

The failure queue for Move2S3 is named failure_Move2S3. Click Apply.
2 1 4 2

Step 2.2: Testing the flow

Testing the Data Flow: To test the flow we need to first start the test session. Click on Flow Options on the top right corner and then click Start under Test Session section.
2 2 1

In the next window, click Start Test Session.
2 2 2

The activation should take about a couple of minutes. While this happens, you will see this at the top right corner of your screen.
2 2 3

Once the Test Session is ready you will see the following message on the top right corner of your screen.
2 2 4

Run the flow by right clicking the empty part of the canvas and selecting Start.
2 2 5

Both the processors should now be in the Start state. This can be confirmed by looking at the green play button against each processor.
2 2 6

You will now see files coming into the folder which was specified as the Directory on the S3 bucket which is the Base data store for this environment.
2 2 7

Step 2.3: Moving the flow to the flow catalog

After the flow has been created and tested, we can now Publish the flow to the Flow Catalog.
Stop the current test session by clicking on the green tab on top right corner indicating Active Test Session. Click on End.
2 3 1
2 3 2
2 3 3

Once the session stops click on Flow Options on the top right corner of your screen and click on Publish.
2 3 4

Give your flow a unique name and click on Publish.
Flow Name: {user_id}_datadump_flow (Ex: wuser00_datadump_flow).
2 3 5

The flow will now be visible on the Flow Catalog and is ready to be deployed.
2 3 6

Step 2.4: Deploying the flow

Go to the Catalog and search for the Flow Catalog by typing the name of the flow that you just now published. 2 4 1

Click on the flow and you should see the option to Deploy. Click on 'Version 1' and then Deploy. 2 4 2

Select the CDP Target Environment' from the drop down. Make sure you select the environment given by the instructor. (Ex: `emeaworkshop-environ). Click Continue.
2 4 3

Deployment Name: Give a unique name to the deployment. Click Next →.
Deployment Name: {user_id}_flow_prod (Ex: wuser00_flow_prod).
2 4 4

Set Nifi Configuration. In this step we let everything be the default and click Next →.
2 4 5

Set the Parameters and click Next.
CDP Workload User: The username assigned to you. Ex: wuser00.
CDP Workload User Password: Workload User password set by you in 'Step 1.1: Define Workload Password'.
CDP Environment : DummyParameter
S3 Directory: LabData
2 4 6

Set the cluster size.
Select the Extra Small size and click Next. In this step you can configure how your flow will auto scale, but keep it disabled for this lab.
2 4 7

Add Key Performance indicators: Set up KPIs to track specific performance metrics of a deployed flow. Click on Add New KPI.
2 4 8

In the Add New KPI window, fill in the details as below.
KPI Scope: Connection.
Connection Name: failure_Move2S3.
Metric to Track: Percent Full.
Check box against Trigger alert when metric is greater than: 50 Percent.
Alert will be triggered when metric is outside the boundary(s) for: 2 Minutes.
Click on Add.
2 4 9

Click Next.
2 4 10

Click Deploy.
The Deployment Initiated message will be displayed. Wait until the flow deployment is completed, which might take a few minutes. 2 4 11

When deployed, the flow will show up on the Data flow dashboard, as below.
2 4 12
2 4 13

The data gets populated in the S3 bucket. Your instructor will be able to see this and as a participant you don’t have access.
2 4 14

Also, after a while you will see the flow something like below for the flow you just deployed.
2 4 15

Step 2.5: Verifying flow deployment

Click on the flow in the Dashboard and select Manage Deployment.
2 5 1
2 5 2

Manage KPI and Alerts: Click on the KPI and Alerts tab under Deployment Settings to get the list of KPIs that have been set. You also have an option to modify or add more KPIs to your flow here.
2 5 3

Manage Sizing and Scaling: Click on the Sizing and Scaling tab to get detailed information.
2 5 4

Manage Parameters: The parameters that we earlier created can be managed from the Parameters tab. Click on Parameters.
2 5 5

NiFi Configurations: If you have set any configuration w.r.t to Nifi they will show up on the NiFi Configuration tab.
2 5 6

Click on Actions and then click on View in NiFi. This will open the flow in the Nifi UI.
2 5 7
2 5 8

We will now suspend this flow. To do so click on Dashboard and look for the flow that you deployed a while ago (Ex: wuser00_flow_prod).
2 5 9

Click on Actions and then Suspend Flow.
2 5 10

Click on the verification Suspend Flow.
2 5 11

Observe the change in the status of the flow.
2 5 12
2 5 13

Step 3: Migrating Existing Data Flows to CDF-PC

The purpose of this workshop is to demonstrate how existing NiFi flows externally developed (e,g. on local laptops of developers, or pushed from a code repo) can be migrated to the Data Flow. This workshop will leverage an existing NiFi flow template that has been designed with the best practices for CDF-PC flow deployment.

The existing NiFi Flow will perform the following actions. - Generate random syslogs in 5424 Format.
- Convert the incoming data to a JSON using record writers.
- Apply a SQL filter to the JSON records.
- Send the transformed syslog messages to Kafka.

Note that a parameter context has already been defined in the flow and the queues have been uniquely named.

For this we will be leveraging the DataHubs which have already been created - ssb-analytics-cluster-emea, kafka-smm-cluster-emea.
Note that the above names might be different depending upon your environment.

Step 3.1: Create a Kafka Topic

Go to the Environments tab as shown in the screenshot. Click on to your environment. (Ex: emearworkshop-environ).
3 1 1

Click on the Data Hub for Stream Messaging (Ex: kafka-smm-cluster-emea).
3 1 2

Login to Streams Messaging Manager by clicking the appropriate hyperlink in the Streams Messaging Datahub (Ex: kafka-smm-cluster-emea).
3 1 3

Click on Topics in the left tab.
3 1 4

Click on Add New.
3 1 5

Create a Topic with the following parameters and then click Save.
Name: <username>_syslog. Ex: wuser00_syslog.
Partitions: 1
Availability: MODERATE
CLEANUP POLICY: delete
Note: The Flow will not work if you set the Cleanup Policy to anything other than Delete. This is because we are not specifying keys when writing to Kafka.

3 1 6
3 1 7

You can search for the topic that you created now and look for it as shown here.
3 1 8

Obtain the Kafka Broker List

We will require the broker list to configure our processors to connect to our Kafka brokers which allows consumers to connect and fetch messages by partition, topic or offset. This information can be found in the Datahub cluster associated to the Streams Messaging Manager. Later in the lab, we will need to have at hand the list of kafka brokers - already configured in this environment- so to be able to our dataflow to publish to our Kafka topics.

Select Brokers from the left tab.
1 3 4

Save the name of the broker list in a notepad.
1 3 5

Example
kafka-smm-cluster-emea-corebroker2.emeawork.dp5i-5vkq.cloudera.site:9093
kafka-smm-cluster-emea-corebroker0.emeawork.dp5i-5vkq.cloudera.site:9093
kafka-smm-cluster-emea-corebroker1.emeawork.dp5i-5vkq.cloudera.site:9093

Step 3.2: Create a Schema in Schema Registry

You need to now work on Schema Registry. Login to Schema Registry by clicking the appropriate hyperlink in the Streams Messaging Datahub (Ex: kafka-smm-cluster-emea).
3 2 1a

Click on the + button on the top right to create a new schema.
3 2 1c

Create a new schema with the following information.
Name: <username>_syslog. (Ex: wuser00_syslog)
Description: syslog schema for dataflow workshop
Type: Avro schema provider
Schema Group: Kafka
Compatibility: Backward
Evolve: True
Schema Text: Copy and paste the below schema text below into the Schema Text field.

{
  "name": "syslog",
  "type": "record",
  "namespace": "com.cloudera",
  "fields": [
    {
      "name": "priority",
      "type": "int"
    },
    {
      "name": "severity",
      "type": "int"
    },
    {
      "name": "facility",
      "type": "int"
    },
    {
      "name": "version",
      "type": "int"
    },
    {
      "name": "timestamp",
      "type": "long"
    },
    {
      "name": "hostname",
      "type": "string"
    },
    {
      "name": "body",
      "type": "string"
    },
    {
      "name": "appName",
      "type": "string"
    },
    {
      "name": "procid",
      "type": "string"
    },
    {
      "name": "messageid",
      "type": "string"
    },
    {
      "name": "structuredData",
      "type": {
        "name": "structuredData",
        "type": "record",
        "fields": [
          {
            "name": "SDID",
            "type": {
              "name": "SDID",
              "type": "record",
              "fields": [
                {
                  "name": "eventId",
                  "type": "string"
                },
                {
                  "name": "eventSource",
                  "type": "string"
                },
                {
                  "name": "iut",
                  "type": "string"
                }
              ]
            }
          }
        ]
      }
    }
  ]
}

Note: The name of the Kafka Topic (Ex: wuser00_syslog) you previously created and the Schema Name must be the same.

Click Save. 3 2 1d
3 2 1e

Operationalizing Externally Developed Data Flows with CDF-PC

Step 3.3 : Import the Flow into the CDF-PC Catalog

Open the CDF-PC data service and click on Catalog in the left tab. Select Import Flow Definition on the Top Right.
4 1 0

Add the following information.
Flow Name: <username>_syslog_to_kafka. (Ex: wuser00_syslog_to_kafka)
Flow Description: Reads Syslog in RFC 5424 format, applies a SQL filter, transforms the data into JSON records, and publishes to Kafka.
NiFi Flow Configuration: syslog-to-kafka.json (From the resources downloaded earlier).
Version Comments: Initial Version.

4 1 1
4 1 2

Click Import.
4 1 3
4 1 4

Step 3.4: Deploy the Flow in CDF-PC

Search for the flow in the Flow Catalog by typing the flow name that you created in the previous step.
4 2 1

Click on the Flow, you should see the following. You should see a Deploy Option appear shortly. Then click on Deploy.
4 2 2

Select the CDP Target Environment (Ex: emeaworkshop-environ) where this flow will be deployed, then click Continue.
4 2 3

Give the deployment a unique name (Ex: {user_id}_syslog_to_kafka), then click Next.
4 2 4

In the NiFi Configuration screen, click Next → to take the default parameters.
4 2 5

Add the Flow Parameters as below. Note that you might have to navigate to multiple screens to fill it. Then click Next.

CDP Workload User: The workload username for the current user. (Ex: wuser00)
CDP Workload Password: The workload password for the current user (This password was set by you earlier).
Filter Rule: SELECT * FROM FLOWFILE.
Kafka Broker Endpoint: The list of Kafka Brokers previously noted, which is comma separated as shown below.
Example: kafka-smm-cluster-emea-corebroker2.emeawork.dp5i-5vkq.cloudera.site:9093,kafka-smm-cluster-emea-corebroker0.emeawork.dp5i-5vkq.cloudera.site:9093,kafka-smm-cluster-emea-corebroker1.emeawork.dp5i-5vkq.cloudera.site:9093
Kafka Destination Topic: <username>_syslog (Ex: wuser00_syslog)
Kafka Producer ID: nifi_dfx_p1
Schema Name: <username>-syslog (Ex: wuser00_syslog)
Schema Registry Hostname: The hostname of the master server in the Kafka Datahub (Ex: kafka-smm-cluster-emea) (Refer screenshot below).
Example: kafka-smm-cluster-emea-master0.emeawork.dp5i-5vkq.cloudera.site
4 2 6a
4 2 6b

Click Next.
4 2 7

On the next page, define sizing and scaling details and then click Next.
Size: Extra Small
Auto Scaling: Enabled
Min Nodes: 1
Max Nodes: 3
4 2 8

Skip the KPI page by clicking Next and Review your deployment. Then Click Deploy.
4 2 9
4 2 10

Proceed to the CDF-PC Dashboard and wait for your flow deployment to complete. A Green Check Mark will appear once complete, which might take a few minutes.
4 2 11
4 2 12

Click into your deployment and then Click Manage Deployment on the top right to view System Metrics.
4 2 13

Step 4: SQL Stream Builder (SSB)

The purpose of this workshop is to demonstrate streaming analytic capabilities using SQL Stream Builder. We will leverage the NiFi Flow deployed in CDF-PC from the previous step and demonstrate how to query live data and subsequently sink it to another location. The SQL query will leverage the existing syslog schema in Schema Registry.

Step 4.1: KeyTab

To run queries on the SQL Stream Builder you need to have your KeyTab unlocked. This is mainly for authentication purposes. As the credential you are using is sometimes reused as part of other people doing the same lab it is possible that your KeyTab is already unlocked. We have shared the steps for both the scenarios.

Step 4.1 (a): Unlock your KeyTab

Click on the Environment in the left pane. Click on the environment assigned to you. (Ex: emeaworkshop-environ).
1 5 1

Click on the Data Hub cluster for stream analytics. (Ex: ssb-analytics-cluster-emea) 1 5 2

Open the SSB UI by clicking on Streaming SQL Console.
1 5 3

Click on the User name (Ex: wuser00) at the bottom left of the screen and select Manage Keytab. Make sure you are logged in as the username that was assigned to you.
1 5 4

Enter your Workload Username under Principal Name * and workload password that you had set earlier (In Step 1.1: Define Workload Password) in the Password * field.
1 5 5

Click on Unlock Keytab and look for the message 'Success KeyTab has been unclocked'. 1 5 6
1 5 7

Step 4.1 (b): Reset your KeyTab

Click on the Environment in the left pane. Click on the environment assigned to you. (Ex: emeaworkshop-environ).
1 5 1

Click on the Data Hub cluster for stream analytics. (Ex: ssb-analytics-cluster-emea) 1 5 2

Open the SSB UI by clicking on Streaming SQL Console.
1 5 3

Click on the User name (Ex: wuser00) at the bottom left of the screen and select Manage Keytab. Make sure you are logged in as the username that was assigned to you.
1 5 4

If you get the following dialog box it means that your Keytab is already UNLOCKED. 1 6 1

Hence, it would be necessary to reset here by locking it and unlocking it again using your newly set workload password. So, enter your CDP Workload Username in Principal Name (Ex: wuser00). Click on Lock Keytab.
1 6 2

You will get the following message Success KeyTab has been locked.
1 6 3

Now do the following.
Click on the User name (Ex: wuser00) at the bottom left of the screen and select Manage Keytab. Make sure you are logged in as the username that was assigned to you.
1 5 4

Enter your Workload Username under Principal Name * and workload password that you had set earlier (In Step 1.1: Define Workload Password) in the Password * field.
1 5 5

Click on Unlock Keytab and look for the message Success KeyTab has been unclocked. 1 5 6
1 5 7

Step 4.2 : Working on SQL Stream Builder Project

In case you are not on the SQL Stream Builder Interface you may reach so by following the next 2 screenshots, else you can continue from the 3rd screenshot.

Go to the SQL Stream Builder UI: SSB Interface can be reached from the DataHub that is running the Streams Analytics, in our case - ssb-analytics-cluster-emea.
Within the DataHub, click on Streaming SQL Console.
5 1a
5 1b

Create a new project: Create a SQL Stream Builder (SSB) Project by clicking New Project using the following details.
Name: {user_id}_hol_workshop. (Ex: wuser00_hol_workshop).
Description: SSB Project to analyze streaming data.
5 1c

Click Create.
5 1d

Switch to the created project (Ex: wuser00_hol_workshop). Click on Switch.
5 1e

If pop up comes select Switch Project.
5 1f

You will see the screen something like below.
5 1g

Create Kafka Data Store: Create Kafka Data Store by selecting Data Sources in the left pane, clicking on the three-dotted icon next to Kafka, then selecting New Kafka Data Source.
5 1h

Name: {user-id}_cdp_kafka. (Ex: wuser00_cdp_kafka)
Brokers: (Comma-separated List as shown below)
kafka-smm-cluster-emea-corebroker2.emeawork.dp5i-5vkq.cloudera.site:9093,kafka-smm-cluster-emea-corebroker0.emeawork.dp5i-5vkq.cloudera.site:9093,kafka-smm-cluster-emea-corebroker1.emeawork.dp5i-5vkq.cloudera.site:9093

Protocol: SASL/SSL
SASL Username: <workload-username>. (Ex: wuser00).
SASL Mechanism: PLAIN.
SASL Password: Workload User password set by you in Step 1.1: Define Workload Password.
5 1i
5 1j

Click on Validate to test the connections. Once successful click on Create.
5 1k

Create Kafka Table: Create Kafka Table, by selecting Virtual Tables in the left pane by clicking on the three-dotted icon next to it. Then click on New Kafka Table.
5 2a

Configure the Kafka Table using the details below.
Table Name: {user-id}_syslog_data. (Ex: wuser00_syslog_data)
Kafka Cluster: <select the Kafka data source you created previously>. (Ex: wuser00_cdp_kafka)
Data Format: JSON.
Topic Name: <select the topic created in Schema Registry>.
5 2b

When you select Data Format as AVRO, you must provide the correct Schema Definition when creating the table for SSB to be able to successfully process the topic data. For JSON tables, though, SSB can look at the data flowing through the topic and try to infer the schema automatically, which is quite handy at times. Obviously, there must be data in the topic already for this feature to work correctly.

Note: SSB tries its best to infer the schema correctly, but this is not always possible and sometimes data types are inferred incorrectly. You should always review the inferred schemas to check if it’s correctly inferred and make the necessary adjustments.

Since you are reading data from a JSON topic, go ahead and click on Detect Schema to get the schema inferred. You should see the schema be updated in the Schema Definition tab.
5 2c

You will also notice that a "Schema is invalid" message appears upon the schema detection. If you hover the mouse over the message, it shows the reason.
5 3
You will fix this in the next step.

Each record read from Kafka by SSB has an associated timestamp column of data type TIMESTAMP ROWTIME. By default, this timestamp is sourced from the internal timestamp of the Kafka message and is exposed through a column called eventTimestamp. However, if your message payload already contains a timestamp associated with the event (event time), you may want to use that instead of the Kafka internal timestamp.

In this case, the syslog message has a field called timestamp that contains the timestamp you should use. You want to expose this field as the table’s event_time column. To do this, click on the Event Time tab and enter the following properties.
Use Kafka Timestamps: Disable.
Input Timestamp Column: timestamp.
Event Time Column: event_time.
Watermark Seconds: 3.
5 4

Now that you have configured the event time column, click on Detect Schema again. You should see the schema turn valid.
5 5

Click the Create and Review button to create the table.
5 6

Review the table’s DDL and click Close.
5 7

Create a Flink Job, by selecting Jobs in the left pane, clicking on the three-dotted icon next to it, then clicking on New Job.
5 8

Give a unique job name (Ex: wuser00_flink_job) and click Create.
5 9
5 10

Add the following SQL Statement in the Editor.

SELECT * FROM {user-id}_syslog_data WHERE severity <=3

Run the Streaming SQL Job by clicking Execute. Also, ensure your {user_id}-syslog-to-kafka flow is running in CDF-PC.
5 11

In the Results tab, you should see syslog messages with severity levels ⇐3.
5 12

About

This workshop aims to make use of log data to help practitioners gain understanding of CDF and to show the value it brings to enterprises who understand that data in motion related use cases can add value to their business.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published