Monitoring V2Ray subscription URLs without getting paged for every rotation
Subscription URLs are the V2Ray world's version of auto-updating package lists. Hand a client one URL and it pulls back a bundle of configs. The server rotates keys or swaps out a dead box, the bundle changes, the client picks it up on the next poll. That is the happy path. I have spent a lot of evenings debugging the unhappy paths.
This post is for anyone running more than about ten servers behind a subscription URL and trying to figure out how to monitor the whole thing without either pager fatigue or blind spots the size of a truck. It covers what goes wrong in practice, what a decent check list looks like, and one story from the trenches.
Refresher: what a subscription URL actually is
For the readers who are here because they Googled an error and landed on an unfamiliar page, skip this if you already live in this world.
A V2Ray or Xray subscription URL is an HTTPS endpoint that returns a blob. The blob is usually base64. Decoded, it is a newline separated list of VPN config URIs: vless://..., vmess://..., trojan://..., ss://..., occasionally hy2://... or tuic://.... The client fetches the URL on a schedule, decodes the blob, and presents the list of servers to the user. Connecting to any one of them routes traffic through that server.
Clash and sing-box add a wrinkle by accepting YAML or JSON instead of base64, but the pattern is the same. One URL, many configs inside, periodic refresh.
What actually goes wrong in production
Start with the most obvious failure and then we work down into the subtle ones.
A server inside the bundle is dead but the bundle is healthy
The bundle endpoint returns 200. The base64 decodes. There are 12 configs inside. Three of them point to a box that has been down for a week. Clients that happen to pick one of those three fail. The other nine still route fine, so your "is my subscription URL up" check stays green. Half your users are happy, the rest quietly churn.
Any monitor that only checks the URL itself is useless here. You have to pull the bundle apart and test each config individually. More on that in a minute.
Drift between clients
Client A fetched the bundle at 02:00 local time. Client B fetched at 08:00. Between those pings you rotated a VLESS UUID. Client A has the new one. Client B has the old one, which no longer matches anything on the server side, so Client B's connections silently fail auth and get dropped. No error surfaces at the TCP or TLS layer. The user sees "connected" but nothing works.
This is the worst category because it only hurts a subset of users, the symptoms are vague, and the support requests sound like "the VPN is slow" or "sometimes it doesn't work" instead of anything actionable.
CDN caching your bundle when you did not want it to
If you front your subscription endpoint with Cloudflare or any other CDN, the edge will happily cache the bundle. You push a new one to origin. Cloudflare serves the old one for up to an hour. Clients fetching during that window get stale data.
Fix is to set explicit Cache-Control headers on the bundle response. Cache-Control: no-store if you rotate often, or max-age=60 if you are okay with short caching. The default behavior of most CDNs is "cache because you did not tell me not to".
Truncated responses
If your origin is briefly overloaded and the bundle response gets truncated, some clients will parse a partial bundle and silently operate with fewer configs. The base64 decoder does not always complain about a truncated payload if the truncation happens to land on a valid boundary. You end up with clients running half a fleet without knowing.
Catch this by tracking the parsed config count per poll. If today's count is meaningfully lower than yesterday's, that is either a truncation or a real rotation. Both are worth a glance.
Protocol mix changes
You pulled Shadowsocks from the bundle because you rolled VLESS-REALITY out everywhere and wanted to deprecate the old stuff. Clients that only speak SS, because they are old, or because someone configured them that way a year ago and nobody touched them since, stop working. You did not break them exactly. You changed the contract.
A monitor that tracks the distribution of protocols inside the bundle catches this. "Protocol mix went from 40% SS / 60% VLESS to 0% SS / 100% VLESS." That is worth a Slack message before you ship it to everyone, not after.
A check list that catches most of the above
Nothing exotic. Each of these is a one-off check or a small loop. The skill is stringing them together and running them on a schedule.
1. Structural sanity on the bundle itself
curl -sSL -o /tmp/sub.txt "https://sub.example.com/my-subscription"
BYTES=$(wc -c < /tmp/sub.txt)
[ "$BYTES" -lt 500 ] && echo "bundle suspiciously small: $BYTES bytes"
# base64 decode, count how many URI lines came out
DECODED=$(base64 -d < /tmp/sub.txt 2>/dev/null)
COUNT=$(echo "$DECODED" | grep -cE '^(vless|vmess|trojan|ss|hy2|tuic)://')
echo "bundle contains $COUNT configs"
Alert if the byte count drops below your usual range or the config count changes by more than, say, 20% from the previous poll.
2. Per config handshake
This is the one generic uptime monitors cannot do. For each parsed URI, open the transport, perform the protocol handshake, note the result. If any one fails, record which one and include it in the daily or hourly summary.
You can script this by pointing a real Xray or sing-box instance at each config in turn and doing a probe, or use a dedicated check library. We do this at TunnelHQ by running the Hiddify checker on each config. A real handshake takes a few hundred milliseconds per config, so a bundle of a hundred configs finishes a full sweep in well under a minute.
3. Diff against the previous poll
Store the parsed list and the hash of the bundle. On each new poll, compare. You care about three kinds of change:
- Configs added. Someone shipped a new server, good.
- Configs removed. Usually intentional, but worth logging in case it was not.
- Existing configs modified. Key rotation, host change, transport swap. These are the ones most likely to cause the silent drift problem.
Do not alert on every change. Rotations are normal. Alert on changes that look wrong: the bundle suddenly has zero VLESS configs when it had sixty yesterday, the total count dropped by half, a protocol appeared that you never ship.
4. Hash should change when you expect
Flip side of the same coin. If you rotated a key an hour ago and the bundle hash is identical to this morning's, something did not propagate. Either your bundle generator did not pick up the change, or your CDN is caching too aggressively, or your origin is serving a stale copy. All three happen more often than you would hope.
5. Fetch from multiple regions
A bundle can be healthy globally but blocked from one region. If your subscription URL is on a domain that occasionally gets blocked in Iran or mainland China, a single region fetch will not tell you. You need multiple vantage points. Cheap VPSes in three or four regions, or a monitoring service that already has the node network. Either works.
A story about not checking individual configs
We onboarded an operator running around 200 servers behind a single subscription URL. Their existing monitoring was a UptimeRobot ping on the bundle endpoint. Green for months. They were proud of their uptime.
Their first sweep on TunnelHQ came back with 47 of the 200 configs dead. All of them had been dead for weeks, some for months. The bundle was returning all 200 configs, so the bundle endpoint itself was genuinely healthy. Clients were silently failing over to the 153 that still worked. Users adapted. Support saw occasional complaints and assumed they were user error.
They thought their infra was in much better shape than it was. The UptimeRobot monitor was not lying. It was answering a different question than the one the operator needed answered. "Is the bundle endpoint responding with HTTP 200" is an easy question. "Is every server inside the bundle actually healthy" is the hard one.
That 47-out-of-200 number was embarrassing for about a day, then it was useful. They rotated the dead ones, cleaned up their config generator to deprovision better, and now their sweep catches new dead boxes within minutes.
How often to poll
Rough guidance, because it depends on your rotation frequency and user tolerance:
- If you rotate rarely (less than once a week), polling the bundle every 10 to 30 minutes is fine. Per-config checks every 1 to 5 minutes depending on how quickly you want to catch dead ones.
- If you rotate often (multiple times a day), poll the bundle every 1 to 2 minutes so clients get fresh data, and run per-config checks on the 5-minute mark.
- Respect your own rate limits. Clients polling every minute from thousands of devices can saturate a small origin.
Closing thought
Subscription URLs are a great pattern. They solve a real distribution problem, and the client ecosystem around them is mature. What they also do is create an abstraction that hides individual server health inside a bundle. Generic monitors check the bundle. Good monitors check everything inside the bundle, too. The gap between those two is where most silent failures live.
If you want this handled for you
Paste your subscription URL into TunnelHQ. We poll it on your interval, parse every config inside, run a real handshake on each, and alert when any one fails or when the bundle structure changes unexpectedly. Free for 5 monitors, no card.
Start free or read the VLESS, VMess, Trojan monitoring pages for the per-protocol details.