Azure Monitor ile Altyapı ve Uygulama İzleme Rehberi

person using black laptop computer — Azure İzleyici ile Altyapı ve Uygulama İzleme Kılavuzu.

27 Şub 2026

Malatya’da 220 çalışanlı bir tarımsal makine üreticisi, Azure’da 38 VM + 12 App Service + 4 AKS cluster çalıştırıyordu ama observability “VM CPU yüksek olunca biri farkına varır mı acaba” düzeyindeydi. 4 ay sonunda Azure Monitor + Log Analytics + App Insights + 3 Workbook + 27 alarm + on-call rotasyon kuruldu. MTTD (Mean Time to Detect) saatlerden 6 dakikaya indi.

Workspace Tasarımı

Tek Log Analytics workspace’e tüm log’lar gönderildi (ayrı workspace = ayrı maliyet + cross-query zorlaşır):

VM performance + heartbeat (Azure Monitor Agent)
App Service log + App Insights
AKS container log + Prometheus metrics (Container Insights)
Activity log (subscription-level events)
Diagnostic settings (PaaS resources)
Sentinel security log

Aylık ingestion: ~~85 GB. Commitment tier 100 GB/gün almak gerekmiyor, pay-as-you-go yeterli.

Diagnostic Settings: Azure Policy ile Zorunlu

// Tüm VM'lerin diagnostic settings'i workspace'e gönderiyor mu?
policy: "DeployIfNotExists"
condition: 
  - field: "type"
    equals: "Microsoft.Compute/virtualMachines"
deploymentTemplate: "deploy-azuremonitor-extension"

Yeni VM oluşturulduğunda otomatik AMA install + workspace’e bağlanma.

App Insights Entegrasyonu

12 App Service ve 4 AKS app App Insights instrument edildi:

// .NET 8 App Service
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = builder.Configuration["AppInsights:ConnectionString"];
    options.EnableAdaptiveSampling = true;  // %5'e kadar düşürebilir, kostu azaltır
});

// Custom telemetry
builder.Services.AddSingleton<ITelemetryInitializer, OperationIdInitializer>();

KQL Alarm Kütüphanesi

Alarm 1: VM CPU > %85, 10 dk sürekli

Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize AvgCpu = avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| where AvgCpu > 85
| summarize ConsecutiveCount = count() by Computer
| where ConsecutiveCount >= 2

Alarm 2: App Service P95 latency > 2 sn

requests
| where timestamp > ago(15m)
| where cloud_RoleName startswith "app-"
| summarize P95 = percentile(duration, 95) by cloud_RoleName, bin(timestamp, 5m)
| where P95 > 2000

Alarm 3: AKS Pod restart count > 5/saat

KubePodInventory
| where TimeGenerated > ago(1h)
| where ContainerStatus == "Running"
| summarize MaxRestarts = max(ContainerRestartCount) by Name, Namespace
| where MaxRestarts > 5

Alarm 4: SQL DB DTU > %80

AzureMetrics
| where ResourceProvider == "MICROSOFT.SQL"
| where MetricName == "dtu_consumption_percent"
| where Average > 80
| summarize MaxDtu = max(Average) by Resource, bin(TimeGenerated, 5m)

Alarm 5: Failed sign-in spike (Identity)

SigninLogs
| where ResultType != "0"
| summarize FailedCount = count() by IPAddress, bin(TimeGenerated, 10m)
| where FailedCount > 50

Action Group: Severity-Based Dispatch

Severity	Aksiyon	SLA
Sev 0 (Critical)	SMS + Phone + Teams + Email + ITSM ticket	15 dk
Sev 1 (Error)	SMS + Teams + Email	1 saat
Sev 2 (Warning)	Teams + Email	4 saat
Sev 3 (Info)	Email	1 iş günü

az monitor action-group create -g rg-monitoring -n ag-critical 
  --action sms onCallRotation +905555555555 
  --action voice onCallRotation +905555555555 
  --action webhook teamsWebhook https://xxx.webhook.office.com/... 
  --action email onCallEmail oncall@example.com

Workbook: Tek Pane Dashboard

3 workbook hazırlandı:

Operations Dashboard: Tüm resource’ların health summary, top 10 issue
Cost Dashboard: Resource bazlı aylık trend, anomaly detection
Security Dashboard: Failed sign-in, NSG block, suspicious activity

Application Map (App Insights)

Otomatik dependency graph: web app → API → SQL → Cosmos. Latency spike olduğunda hangi component sorumlu görünüyor.

Maliyet

Kalem	Aylık (USD)
Log Analytics ingestion (~~85 GB)	~~$190
App Insights (sampling enabled)	~~$120
Container Insights (4 cluster)	~~$110
Sentinel (subset of logs)	~~$220
Toplam	~~$640

Sonuçlar

Metrik	Önce	Sonra
MTTD	~~saatler	~~6 dk
MTTR (Mean Time to Resolve)	~~3 saat	~~32 dk
Müşteri kaynaklı incident bildirimi	~~%70 incidents	~~%12
Aylık downtime	~~6 saat	~~45 dk

Sahada Düşülen Üç Tuzak

Tüm log’u sample etmeden toplayıp ingestion patlaması: AppInsights adaptive sampling şart, yoksa log maliyeti hizmetten yüksek olur.
Alarm spam: Threshold çok düşükse 50 alarm/gün, kimse bakmıyor. İlk hafta tuning şart.
Action group telefon listesini güncel tutmamak: 6 ay sonra ayrılan kişi alarm SMS alıyor. Quarterly review.

CloudSpark olarak Azure Monitor + Log Analytics + App Insights kurulumu, KQL alert library, on-call rotasyon ve incident response programları için danışmanlık veriyoruz.

Etiketler: