從502到排障:Nginx常見故障分析案例
作為一名運(yùn)維工程師,你是否曾在深夜被502錯誤的報(bào)警電話驚醒?是否因?yàn)樯衩氐腘ginx故障而焦頭爛額?本文將通過真實(shí)案例,帶你深入Nginx故障排查的精髓,讓你從運(yùn)維小白進(jìn)階為故障排查專家。
引言:那些年我們踩過的Nginx坑
在互聯(lián)網(wǎng)公司的運(yùn)維生涯中,Nginx故障可以說是最常見也最讓人頭疼的問題之一。從簡單的配置錯誤到復(fù)雜的性能瓶頸,從偶發(fā)的502到持續(xù)的高延遲,每一個故障背后都有其獨(dú)特的原因和解決方案。
作為擁有8年運(yùn)維經(jīng)驗(yàn)的工程師,我見證了無數(shù)次午夜故障處理,也總結(jié)出了一套行之有效的故障排查方法論。今天,我將通過10個真實(shí)案例,手把手教你如何快速定位和解決Nginx常見故障。
案例一:經(jīng)典502錯誤 - 上游服務(wù)不可達(dá)
故障現(xiàn)象
某電商網(wǎng)站在促銷活動期間突然出現(xiàn)大量502錯誤,用戶無法正常下單,業(yè)務(wù)損失嚴(yán)重。
故障排查過程
第一步:查看Nginx錯誤日志
# 查看最新的錯誤日志 tail-f /var/log/nginx/error.log # 典型502錯誤日志 2024/09/15 1425 [error] 12345#0: *67890 connect() failed (111: Connection refused)whileconnecting to upstream, client: 192.168.1.100, server: shop.example.com, request:"POST /api/order HTTP/1.1", upstream:"http://192.168.1.200:8080/api/order", host:"shop.example.com"
第二步:檢查上游服務(wù)狀態(tài)
# 檢查后端服務(wù)是否正常運(yùn)行 netstat -tulpn | grep 8080 ps aux | grep java # 測試上游服務(wù)連通性 curl -I http://192.168.1.200:8080/health telnet 192.168.1.200 8080
第三步:分析Nginx配置
upstreambackend_servers { server192.168.1.200:8080weight=1max_fails=3fail_timeout=30s; server192.168.1.201:8080weight=1max_fails=3fail_timeout=30sbackup; } server{ listen80; server_nameshop.example.com; location/api/ { proxy_passhttp://backend_servers; proxy_connect_timeout5s; proxy_read_timeout60s; proxy_send_timeout60s; } }
根因分析
通過排查發(fā)現(xiàn),主服務(wù)器192.168.1.200由于負(fù)載過高導(dǎo)致Java應(yīng)用崩潰,而備份服務(wù)器配置有誤未能及時接管流量。
解決方案
# 1. 重啟故障服務(wù)器的應(yīng)用
systemctl restart tomcat
# 2. 修復(fù)備份服務(wù)器配置
# 將backup參數(shù)移除,讓兩臺服務(wù)器同時處理請求
upstream backend_servers {
server 192.168.1.200:8080 weight=1 max_fails=2 fail_timeout=10s;
server 192.168.1.201:8080 weight=1 max_fails=2 fail_timeout=10s;
}
# 3. 重載Nginx配置
nginx -t && nginx -s reload
預(yù)防措施
? 配置健康檢查機(jī)制
? 設(shè)置合理的負(fù)載均衡策略
? 建立完善的監(jiān)控告警體系
案例二:SSL證書過期導(dǎo)致的服務(wù)中斷
故障現(xiàn)象
某金融網(wǎng)站客戶反饋無法訪問,瀏覽器顯示"您的連接不是私密連接"錯誤。
故障排查過程
檢查SSL證書狀態(tài)
# 查看證書到期時間 openssl x509 -in/etc/nginx/ssl/domain.crt -noout -dates # 使用openssl檢查在線證書 echo| openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates # 查看Nginx SSL配置 nginx -T | grep -A 10 -B 5 ssl_certificate
Nginx SSL配置示例
server{
listen443ssl http2;
server_namefinance.example.com;
ssl_certificate/etc/nginx/ssl/domain.crt;
ssl_certificate_key/etc/nginx/ssl/domain.key;
ssl_protocolsTLSv1.2TLSv1.3;
ssl_ciphersECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphersoff;
# HSTS設(shè)置
add_headerStrict-Transport-Security"max-age=31536000"always;
}
解決方案
# 1. 生成新的SSL證書(以Let's Encrypt為例) certbot --nginx -d finance.example.com # 2. 手動更新證書配置 ssl_certificate /etc/letsencrypt/live/finance.example.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/finance.example.com/privkey.pem; # 3. 測試并重載配置 nginx -t && nginx -s reload # 4. 驗(yàn)證SSL證書 curl -I https://finance.example.com
自動化解決方案
# 創(chuàng)建證書更新腳本 cat> /etc/cron.d/certbot <'EOF' 0 12 * * * /usr/bin/certbot renew --quiet --post-hook?"nginx -s reload" EOF # 添加證書監(jiān)控腳本 cat?> /usr/local/bin/ssl_check.sh <'EOF' #!/bin/bash DOMAIN="finance.example.com" DAYS=30 EXPIRY_DATE=$(echo?| openssl s_client -connect?$DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate |cut-d= -f2) EXPIRY_EPOCH=$(date-d"$EXPIRY_DATE"+%s) CURRENT_EPOCH=$(date+%s) DAYS_LEFT=$(( ($EXPIRY_EPOCH-$CURRENT_EPOCH) /86400)) if[$DAYS_LEFT-lt$DAYS];then echo"SSL certificate for$DOMAINexpires in$DAYS_LEFTdays!" # 發(fā)送告警 fi EOF
案例三:高并發(fā)下的性能瓶頸
故障現(xiàn)象
某視頻網(wǎng)站在晚高峰期間響應(yīng)緩慢,部分用戶反饋視頻加載失敗。
性能分析工具
# 查看Nginx連接狀態(tài) curl http://localhost/nginx_status # 使用htop查看系統(tǒng)負(fù)載 htop # 檢查網(wǎng)絡(luò)連接數(shù) ss -tuln |wc-l netstat -an | grep :80 |wc-l
Nginx狀態(tài)頁配置
server{
listen80;
server_namelocalhost;
location/nginx_status {
stub_statuson;
access_logoff;
allow127.0.0.1;
denyall;
}
}
性能優(yōu)化配置
# 主配置優(yōu)化
worker_processesauto;
worker_connections65535;
worker_rlimit_nofile65535;
events{
useepoll;
multi_accepton;
worker_connections65535;
}
http{
# 開啟gzip壓縮
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_typestext/plain text/css application/json application/javascript;
# 緩存優(yōu)化
open_file_cachemax=100000inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;
# 連接優(yōu)化
keepalive_timeout65;
keepalive_requests100;
# 緩沖區(qū)優(yōu)化
client_body_buffer_size128k;
client_max_body_size50m;
client_header_buffer_size1k;
large_client_header_buffers44k;
}
系統(tǒng)層面優(yōu)化
# 優(yōu)化系統(tǒng)參數(shù) cat>> /etc/sysctl.conf <'EOF' # 網(wǎng)絡(luò)優(yōu)化 net.core.somaxconn = 65535 net.core.netdev_max_backlog = 5000 net.ipv4.tcp_max_syn_backlog = 65535 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 1200 net.ipv4.tcp_max_tw_buckets = 5000 # 文件描述符優(yōu)化 fs.file-max = 1000000 EOF # 應(yīng)用配置 sysctl -p
案例四:緩存配置錯誤導(dǎo)致的問題
故障現(xiàn)象
某新聞網(wǎng)站更新內(nèi)容后,用戶仍然看到舊內(nèi)容,清除瀏覽器緩存后問題依然存在。
緩存配置分析
server{
listen80;
server_namenews.example.com;
# 靜態(tài)資源緩存
location~* .(jpg|jpeg|png|gif|ico|css|js)${
expires1y;
add_headerCache-Control"public, immutable";
add_headerPragma public;
}
# 動態(tài)內(nèi)容
location/ {
proxy_passhttp://backend;
# 錯誤的緩存配置
proxy_cache_valid20030210m;
proxy_cache_valid4041m;
add_headerX-Cache-Status$upstream_cache_status;
}
}
問題排查
# 檢查緩存目錄 ls-la /var/cache/nginx/ # 查看緩存配置 nginx -T | grep -A 20 proxy_cache # 測試緩存狀態(tài) curl -I http://news.example.com/article/123 | grep X-Cache-Status
正確的緩存配置
http{
# 緩存路徑配置
proxy_cache_path/var/cache/nginx levels=1:2keys_zone=my_cache:10mmax_size=10ginactive=60muse_temp_path=off;
server{
listen80;
server_namenews.example.com;
# API接口不緩存
location/api/ {
proxy_passhttp://backend;
proxy_cacheoff;
add_headerCache-Control"no-cache, no-store, must-revalidate";
}
# 新聞內(nèi)容緩存
location/article/ {
proxy_passhttp://backend;
proxy_cachemy_cache;
proxy_cache_valid2005m;
proxy_cache_use_staleerrortimeout updating;
add_headerX-Cache-Status$upstream_cache_status;
}
# 靜態(tài)資源長期緩存
location~* .(jpg|jpeg|png|gif|ico)${
expires1y;
add_headerCache-Control"public, immutable";
}
location~* .(css|js)${
expires1d;
add_headerCache-Control"public";
}
}
}
緩存管理工具
# 清除特定URL緩存 curl -X PURGE http://news.example.com/article/123 # 批量清除緩存 find /var/cache/nginx -typef -name"*.cache"-mtime +7 -delete # 緩存統(tǒng)計(jì)腳本 cat> /usr/local/bin/cache_stats.sh <'EOF' #!/bin/bash CACHE_DIR="/var/cache/nginx" echo"Cache directory size:?$(du -sh $CACHE_DIR)" echo"Cache files count:?$(find $CACHE_DIR -type f | wc -l)" echo"Cache hit rate:?$(grep -c HIT /var/log/nginx/access.log)" EOF
案例五:日志輪轉(zhuǎn)異常導(dǎo)致磁盤空間耗盡
故障現(xiàn)象
服務(wù)器突然無法響應(yīng),檢查發(fā)現(xiàn)磁盤空間100%占用,主要是Nginx日志文件過大。
問題診斷
# 檢查磁盤空間 df-h # 找出大文件 du-h /var/log/nginx/ |sort-hr # 檢查日志輪轉(zhuǎn)配置 cat/etc/logrotate.d/nginx
修復(fù)和優(yōu)化
# 緊急處理:截?cái)喈?dāng)前日志 > /var/log/nginx/access.log > /var/log/nginx/error.log # 重啟nginx以重新打開日志文件 nginx -s reopen
優(yōu)化的日志輪轉(zhuǎn)配置
# /etc/logrotate.d/nginx
/var/log/nginx/*.log{
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 640 nginx nginx
sharedscripts
postrotate
if[ -f /var/run/nginx.pid ];then
kill-USR1 `cat/var/run/nginx.pid`
fi
endscript
}
日志配置優(yōu)化
http{
# 自定義日志格式
log_formatmain'$remote_addr-$remote_user[$time_local] "$request" '
'$status$body_bytes_sent"$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_timeuct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"';
# 條件日志記錄
map$status$loggable{
~^[23] 0;
default1;
}
server{
# 只記錄錯誤請求
access_log/var/log/nginx/access.log main if=$loggable;
# 靜態(tài)資源不記錄日志
location~* .(jpg|jpeg|png|gif|ico|css|js)${
access_logoff;
expires1y;
}
}
}
監(jiān)控腳本
# 磁盤空間監(jiān)控
cat> /usr/local/bin/disk_monitor.sh <'EOF'
#!/bin/bash
THRESHOLD=80
USAGE=$(df?/ | awk?'NR==2 {print $5}'?| sed?'s/%//')
if?[?$USAGE?-gt?$THRESHOLD?];?then
echo"Disk usage is?${USAGE}%, exceeding threshold of?${THRESHOLD}%"
# 自動清理老日志
? ? find /var/log/nginx -name?"*.log.*"?-mtime +7 -delete
# 發(fā)送告警
fi
EOF
案例六:負(fù)載均衡配置錯誤
故障現(xiàn)象
某服務(wù)采用多臺后端服務(wù)器,但發(fā)現(xiàn)流量分配不均,部分服務(wù)器負(fù)載過高而其他服務(wù)器閑置。
負(fù)載均衡策略對比
# 輪詢(默認(rèn))
upstreambackend_round_robin {
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
# 加權(quán)輪詢
upstreambackend_weighted {
server192.168.1.10:8080weight=3;
server192.168.1.11:8080weight=2;
server192.168.1.12:8080weight=1;
}
# IP哈希
upstreambackend_ip_hash {
ip_hash;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
# 最少連接
upstreambackend_least_conn {
least_conn;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
健康檢查配置
upstreambackend_with_health {
server192.168.1.10:8080max_fails=3fail_timeout=30s;
server192.168.1.11:8080max_fails=3fail_timeout=30s;
server192.168.1.12:8080max_fails=3fail_timeout=30sbackup;
# keepalive連接池
keepalive32;
}
server{
location/ {
proxy_passhttp://backend_with_health;
# 健康檢查相關(guān)
proxy_next_upstreamerrortimeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries2;
proxy_next_upstream_timeout5s;
# 連接復(fù)用
proxy_http_version1.1;
proxy_set_headerConnection"";
}
}
監(jiān)控腳本
# 后端服務(wù)器健康檢查腳本
cat> /usr/local/bin/backend_health_check.sh <'EOF'
#!/bin/bash
SERVERS=("192.168.1.10:8080""192.168.1.11:8080""192.168.1.12:8080")
for?server?in"${SERVERS[@]}";?do
if?curl -sf?"http://$server/health"?> /dev/null;then
echo"$server: OK"
else
echo"$server: FAILED"
# 發(fā)送告警
fi
done
EOF
案例七:安全配置漏洞
故障現(xiàn)象
網(wǎng)站被惡意掃描,發(fā)現(xiàn)存在多個安全漏洞,需要加強(qiáng)Nginx安全配置。
安全加固配置
server{
listen80;
server_namesecure.example.com;
# 隱藏版本信息
server_tokensoff;
more_set_headers"Server: WebServer";
# 安全頭設(shè)置
add_headerX-Frame-Options"SAMEORIGIN"always;
add_headerX-XSS-Protection"1; mode=block"always;
add_headerX-Content-Type-Options"nosniff"always;
add_headerReferrer-Policy"no-referrer-when-downgrade"always;
add_headerContent-Security-Policy"default-src 'self' http: https: data: blob: 'unsafe-inline'"always;
# 限制請求方法
if($request_method!~ ^(GET|HEAD|POST)$) {
return405;
}
# 防止目錄遍歷
location~ /.{
denyall;
access_logoff;
log_not_foundoff;
}
# 限制文件上傳大小
client_max_body_size10M;
# 限制請求頻率
limit_req_zone$binary_remote_addrzone=api:10mrate=10r/s;
limit_req_zone$binary_remote_addrzone=login:10mrate=1r/s;
location/api/ {
limit_reqzone=api burst=20nodelay;
proxy_passhttp://backend;
}
location/login {
limit_reqzone=login burst=5nodelay;
proxy_passhttp://backend;
}
}
防護(hù)腳本
# fail2ban配置示例 cat> /etc/fail2ban/filter.d/nginx-4xx.conf <'EOF' [Definition] failregex = ^-.*"(GET|POST).*"(404|403|400) .*$ ignoreregex = EOF cat> /etc/fail2ban/jail.local <'EOF' [nginx-4xx] enabled =?true port = http,https filter = nginx-4xx logpath = /var/log/nginx/access.log maxretry = 10 bantime = 3600 findtime = 60 EOF
案例八:反向代理配置問題
故障現(xiàn)象
使用Nginx作為反向代理時,客戶端真實(shí)IP丟失,后端服務(wù)無法獲取正確的客戶端信息。
問題分析和解決
server{
listen80;
server_nameapi.example.com;
location/ {
proxy_passhttp://backend;
# 正確傳遞客戶端IP
proxy_set_headerHost$host;
proxy_set_headerX-Real-IP$remote_addr;
proxy_set_headerX-Forwarded-For$proxy_add_x_forwarded_for;
proxy_set_headerX-Forwarded-Proto$scheme;
# 處理重定向
proxy_redirectoff;
# 超時設(shè)置
proxy_connect_timeout30s;
proxy_send_timeout30s;
proxy_read_timeout30s;
# 緩沖設(shè)置
proxy_bufferingon;
proxy_buffer_size4k;
proxy_buffers84k;
proxy_busy_buffers_size8k;
}
}
WebSocket支持
map$http_upgrade$connection_upgrade{
defaultupgrade;
'' close;
}
server{
listen80;
server_namews.example.com;
location/websocket {
proxy_passhttp://backend;
proxy_http_version1.1;
proxy_set_headerUpgrade$http_upgrade;
proxy_set_headerConnection$connection_upgrade;
proxy_set_headerHost$host;
proxy_cache_bypass$http_upgrade;
# WebSocket特殊配置
proxy_read_timeout86400;
}
}
案例九:URL重寫規(guī)則沖突
故障現(xiàn)象
網(wǎng)站URL重寫規(guī)則復(fù)雜,出現(xiàn)重定向循環(huán)和404錯誤。
重寫規(guī)則優(yōu)化
server{
listen80;
server_nameexample.com www.example.com;
# 強(qiáng)制跳轉(zhuǎn)到主域名
if($host!='example.com') {
return301https://example.com$request_uri;
}
# SEO友好的URL重寫
location/ {
try_files$uri$uri/@rewrites;
}
location@rewrites{
rewrite^/product/([0-9]+)$/product.php?id=$1last;
rewrite^/category/([a-zA-Z0-9-]+)$/category.php?name=$1last;
rewrite^/user/([a-zA-Z0-9]+)$/profile.php?username=$1last;
return404;
}
# 防止重定向循環(huán)
location~ .php${
try_files$uri=404;
fastcgi_pass127.0.0.1:9000;
fastcgi_indexindex.php;
includefastcgi_params;
}
}
調(diào)試重寫規(guī)則
# 開啟重寫日志
error_log/var/log/nginx/rewrite.lognotice;
rewrite_logon;
# 測試重寫規(guī)則
location/test {
rewrite^/test/(.*)$/debug?param=$1break;
return200"Rewrite test:$args
";
}
案例十:性能監(jiān)控與調(diào)優(yōu)
故障現(xiàn)象
需要建立完善的Nginx性能監(jiān)控體系,及時發(fā)現(xiàn)和解決性能問題。
監(jiān)控腳本
# Nginx性能監(jiān)控腳本
cat> /usr/local/bin/nginx_monitor.sh <'EOF'
#!/bin/bash
NGINX_STATUS_URL="http://localhost/nginx_status"
LOG_FILE="/var/log/nginx_monitor.log"
# 獲取狀態(tài)信息
STATUS=$(curl -s?$NGINX_STATUS_URL)
ACTIVE_CONN=$(echo"$STATUS"?| grep?"Active connections"?| awk?'{print $3}')
ACCEPTS=$(echo"$STATUS"?| awk?'NR==2 {print $1}')
HANDLED=$(echo"$STATUS"?| awk?'NR==2 {print $2}')
REQUESTS=$(echo"$STATUS"?| awk?'NR==2 {print $3}')
READING=$(echo"$STATUS"?| awk?'NR==3 {print $2}')
WRITING=$(echo"$STATUS"?| awk?'NR==3 {print $4}')
WAITING=$(echo"$STATUS"?| awk?'NR==3 {print $6}')
# 記錄到日志
echo"$(date): Active:$ACTIVE_CONN, Reading:$READING, Writing:$WRITING, Waiting:$WAITING"?>>$LOG_FILE
# 告警邏輯
if[$ACTIVE_CONN-gt 1000 ];then
echo"High connection count:$ACTIVE_CONN"| logger -t nginx_monitor
fi
EOF
綜合調(diào)優(yōu)配置
# 終極優(yōu)化配置 worker_processesauto; worker_cpu_affinityauto; worker_rlimit_nofile100000; error_log/var/log/nginx/error.logwarn; pid/var/run/nginx.pid; events{ useepoll; worker_connections10240; multi_accepton; accept_mutexoff; } http{ include/etc/nginx/mime.types; default_typeapplication/octet-stream; # 日志格式 log_formatmain'$remote_addr-$remote_user[$time_local] "$request" ' '$status$body_bytes_sent"$http_referer" ' '"$http_user_agent"$request_time$upstream_response_time'; # 性能優(yōu)化 sendfileon; tcp_nopushon; tcp_nodelayon; keepalive_timeout65; keepalive_requests1000; # 壓縮優(yōu)化 gzipon; gzip_varyon; gzip_min_length1000; gzip_comp_level6; gzip_typestext/plain text/css application/json application/javascript text/xml application/xml; # 緩存優(yōu)化 open_file_cachemax=100000inactive=20s; open_file_cache_valid30s; open_file_cache_min_uses2; open_file_cache_errorson; # 安全優(yōu)化 server_tokensoff; client_header_timeout10; client_body_timeout10; reset_timedout_connectionon; send_timeout10; # 限流配置 limit_req_zone$binary_remote_addrzone=global:10mrate=100r/s; limit_conn_zone$binary_remote_addrzone=addr:10m; include/etc/nginx/conf.d/*.conf; }
故障排查方法論總結(jié)
1. 標(biāo)準(zhǔn)化排查流程
1.收集故障信息:確認(rèn)故障現(xiàn)象、影響范圍、發(fā)生時間
2.查看日志文件:error.log、access.log、系統(tǒng)日志
3.檢查配置文件:語法檢查、邏輯檢查
4.驗(yàn)證網(wǎng)絡(luò)連通:端口狀態(tài)、連通性測試
5.分析性能指標(biāo):CPU、內(nèi)存、網(wǎng)絡(luò)、磁盤
6.確定根本原因:深入分析,找出真正原因
7.實(shí)施解決方案:臨時修復(fù)、永久解決
8.驗(yàn)證修復(fù)效果:功能測試、性能測試
9.總結(jié)經(jīng)驗(yàn)教訓(xùn):文檔記錄、流程優(yōu)化
2. 常用排查工具
?日志分析:tail、grep、awk、sed
?網(wǎng)絡(luò)工具:curl、wget、telnet、netstat、ss
?性能監(jiān)控:htop、iotop、iftop、nginx-status
?系統(tǒng)診斷:strace、lsof、tcpdump
3. 預(yù)防性措施
? 建立完善的監(jiān)控告警體系
? 定期進(jìn)行配置文件備份
? 實(shí)施自動化運(yùn)維工具
? 制定標(biāo)準(zhǔn)化操作流程
? 定期進(jìn)行故障演練
結(jié)語
Nginx故障排查是運(yùn)維工程師必備的核心技能,需要扎實(shí)的理論基礎(chǔ)和豐富的經(jīng)驗(yàn)
-
nginx
+關(guān)注
關(guān)注
0文章
184瀏覽量
13081
原文標(biāo)題:從502到排障:Nginx常見故障分析案例
文章出處:【微信號:magedu-Linux,微信公眾號:馬哥Linux運(yùn)維】歡迎添加關(guān)注!文章轉(zhuǎn)載請注明出處。
發(fā)布評論請先 登錄
Nginx常見故障案例總結(jié)
評論