A Simple guide for monitoring you applications in Service Fabric Cluster

In this blog post, we will understand how to use Log Analytics to effectively monitor and manage your applications in a service fabric cluster.

Here are the key takeaways of this blog post –

  • OMS Extension to collect different counters from the nodes & applications in a Service fabric cluster related to Memory utilization, CPU utilization etc.
  • Configuring Kusto queries to process data from Azure Monitor, Azure Application insights.

Scenario –

A bunch of “. NET Service Fabric applications” are deployed in a service fabric cluster and a monitoring solution needs to be configured so that whenever the Memory utilization or CPU utilization for any of the applications shoot beyond a threshold an alert is initiated.

Analysis –

OMS extension to collect different counters from nodes and applications in Service fabric cluster related to memory and CPU utilization

When it comes to monitoring Azure virtual machines (VMs), it is useful to use Log Analytics, also known as OMS (Operations Management Suite). Its wide range of solutions can monitor various services in Azure and also allows us to respond to events using Azure Monitor alerts. With OMS dashboards, we can control events, visualize log searches, and share custom logs with others.

You can configure one in your Azure cloud using the official documentation as over here which has 3 key steps-

  • Deploy a log analytics workspace
  • Connect the log analytics workspace to your service fabric cluster
  • Deploy azure monitor logs

Note – Once you create the Log Analytics workspace, you will need to install OMS extension on your VMSS like this –

{
    "name": "OMSExtension",
    "properties": {
        "autoUpgradeMinorVersion": true,
        "publisher": "Microsoft.EnterpriseCloud.Monitoring",
        "type": "MicrosoftMonitoringAgent",
        "typeHandlerVersion": "1.0",
        "settings": {
            "workspaceId": "enter workspace id here",
            "stopOnMultipleConnections": "true"
        },
        "protectedSettings": {
            "workspaceKey": "enter Key value"
        }
    }
}

Once this is done, you would start seeing the VMs in connected agent blade of the log analytics workspace.

Then you can go Agents Configuration blade of log analytics workspace to indicate which counters you would like the OMS Agent to collect as below:

Above counters are sample counters; you can add counters based on your requirements.

For Memory utilization and CPU utilization the one’s that we are interested in are marked in the image with a tick mark i.e Working Set (for memory utilization) and % Processor time (for CPU Utilization).

After the counters are added, wait for few minutes for counters to get collected, you can go to Logs blade of log analytics workspace, there you’ll see Perf table which will have values of configured counters.

Configuring Kusto queries to process data from Azure Monitor, Azure Application insights

Azure defines a Kusto query as a read-only request to process data and return results and truth be told it is really easy which makes it more powerful.

My Kusto query below (for memory utilization) works on data that is collected for the last 5 minutes and returns a summary of results based on total memory consumed, name of the node where the application lies and the name of the application.

The threshold that I have set is 7GB’s, so anytime an application in my service fabric cluster consumes more than 7GB’s of memory, I will get alerted.

Perf
| where TimeGenerated > ago(5m)
| where  CounterName == "Working Set"
| where InstanceName has  "<WildCard for your list of application names>"
| project TimeGenerated, CounterName, CounterValue, Computer, InstanceName
| summarize UsedMemory = avg(CounterValue) by CounterName, bin(TimeGenerated, 5m), Computer, InstanceName
//Threshold 7 GB i.e 700000000
| summarize by UsedMemory,Computer, InstanceName | where UsedMemory > 7000000000

The above query works perfectly fine if you are okay in getting alerted whenever the avg or the aggregated value for Memory utilization (over a period of last 5 minutes) is more than 7 GB’s.

But in some scenarios it is absolutely essential that alerts are rolled out whenever the real time value at any point of time for memory utilization crosses the threshold. Below is the query where we are not looking at aggregated values rather we are focusing on real time values.

Perf
| where TimeGenerated > ago(1m)
| where  CounterName == "Working Set"
| where InstanceName has  "Apttus"
| project TimeGenerated, CounterName, CounterValue, Computer, InstanceName | where CounterValue > 7000000000

Similarly you can have a Kusto query for calculating the CPU utilization and get alerted whenever a particular application uses more than 30% of CPU.

METHOD 1 - 
Perf
| where TimeGenerated > ago(5m)
| where  CounterName == "% Processor Time" 
| where InstanceName has  "<Wildcard for you list of application names>"
| project TimeGenerated, CounterName, CounterValue, Computer, InstanceName
| summarize PercentCPU = avg(CounterValue) by bin(TimeGenerated, 1m), Computer, InstanceName
//Threshold 30 
| summarize by PercentCPU,Computer, InstanceName | where PercentCPU > 30


METHOD 2 -
Perf
| where TimeGenerated > ago(5m) 
| where ( ObjectName == "Process" ) and CounterName == "% Processor Time" and InstanceName has "<Wildcard for your list of application names>" 
| summarize AggregatedValue = avg(CounterValue) / 4 by Computer, bin(TimeGenerated, 5m), InstanceName 
| where AggregatedValue >30

Again the above queries will give you the aggregated values, below you can find the query for real time values

Perf
| where TimeGenerated > ago(5m) 
| where ( ObjectName == "Process" ) and CounterName == "% Processor Time" and InstanceName has "<Wildcard for your list of application names>"
| project CounterValue / 4 , Computer , InstanceName , bin(TimeGenerated, 5m) | where Column1 > 70

You will need to shoehorn the Kusto query above because the performance counter “% Processor time” gives you the percentage of elapsed time that the processor spends to execute a non-Idle thread which is different than the value for %CPU Utilization.

You can understand more on this using the articles below-

https://social.technet.microsoft.com/wiki/contents/articles/12984.understanding-processor-processor-time-and-process-processor-time.aspx

https://stackoverflow.com/questions/28240978/how-to-interpret-cpu-time-vs-cpu-percentage

The above queries are for getting alerted when applications in service fabric cluster cross the defined threshold for CPU & Memory utilization.

I am adding two more queries below to get alerted whenever the CPU & Memory utilization crosses the defined threshold for NODES in a scale set.

Query to get alerted when any node in a scale set crosses threshold for Memory utilization

Perf 
| where ObjectName == "Memory" and CounterName == "% Committed Bytes In Use" and TimeGenerated > ago(5m) 
| summarize MaxValue = max(CounterValue) by Computer 
| where MaxValue > 70
Query to get alerted when any node in a scale set crosses threshold for CPU Utilization 

let setpctValue = 10;
// enter a % value to check as threshold
let startDate = ago(5m);
// enter how many days/hours to look back on
Perf
| where TimeGenerated > startDate
| where ObjectName == "Processor" and CounterName == "% Processor Time" and InstanceName == "_Total" and Computer in ((Heartbeat
| where OSType == "Windows"
| distinct Computer))
| summarize PCT95CPUPercentTime = percentile(CounterValue, 95) by Computer
| where PCT95CPUPercentTime > setpctValue
| summarize max(PCT95CPUPercentTime) by Computer

You can now configure alert rules to get notified whenever the count of results for the above Kusto queries is more than 1.

It’s really as simple as that, Well I hope this blog post helped clear things out a little bit. If you have any queries or doubts around this you are always welcome to comment and I would love to have a conversation around it.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s