ISIS Hierarchical SNPs

ISIS Hierarchical SNPs Juniper Networks

prz@juniper.net

Juniper Networks

tli@juniper.net

The document presents an optional new type of SNP called a Hierarchical SNP (HSNP). When feasible, it compresses traditional CSNP exchanges into a Merkle tree-like structure, which speeds up synchronization of large databases and adjacency numbers while reducing the load from regular CSNP exchanges during normal operation.

The document introduces an optional new type of SNP called the Hierarchical SNP (HSNP). It compresses traditional CSNP exchanges into a Merkle tree-like structure , enabling faster synchronization of large databases and adjacency information while reducing the overhead of regular CSNP exchanges during steady state operation. By combining parallel flooding, CSNP exchanges, and HSNP-based compression, database resynchronization can be accelerated because fewer packets (not the entire CSNP set) need to be exchanged to fix inconsistencies. Using HSNPs also reduces unnecessary flooding and communication overhead, while still detecting mismatched fragments more efficiently than full CSNP exchanges. To maintain a consistent framework, we initially treat each CSNP entry for an LSP as equivalent to a "as good as perfect", though rather long, Merkle hash. In other words, in the first step, the LSP ID, sequence number, and checksum act as a "perfect fragment hash". The CSNPs then serve as the "bottom" of the Merkle tree conceptually speaking. We include the fragment's PDU length to add more entropy in the next step. In this step we compute a over the fragment's description to create a shorter fragment-level merkle hash. Those hashes are never transmitted in packets, since doing so would effectively duplicate CSNP functionality. However, the hash that combines all fragment hashes of a node becomes a "node Merkle hash". Groups of such node hashes can then be summarized again into a "node range Merkle hash", created by hashing the individual node hashes together. For the node and node range hashes, we switch from to a different hashing method, as explained in . The resulting hierarchy of hashes enables validation of large LSDB synchronizations using far fewer packets than relying solely on CSNPs. Although these hashes summarize ranges recursively, the resulting exchange, depending on range mismatches, resembles a -like process rather than maintaining a fixed tree structure in every exchange. This document limits itself for practical purposes to LSDB sizes of the order of 1E6 fragments and further considerations necessary to prevent overly garrulous exchanges of hashes covering smaller and smaller sets of fragments. More details on the targeted IS-IS envelope can be found in and further considerations revolving around these assumptions are summarized in .

In practical terms, the most interesting problem is figuring out how to divide the database into groups of leaf nodes (each representing a set of fragments) that both compress the data effectively compared to standard CSNPs and remain as stable as possible. If the fragment boundaries within each hash keep changing, neighboring systems will need to recalculate their hashes instead of reusing cached ones from their own Merkle tree, which adds unnecessary computation. The subdivision should also create enough "bins" to handle any distribution of fragment IDs across the network. This helps avoid pathological cases where all fragments might end up in a single hash, such as when e.g. a hashing function degrades. Ideally, when a hash mismatch occurs, it should take only one or two packets, with hashes for smaller fragment groups or regular CSNPs, to fix the difference. Reducing I/O and/or computation directly improves how quickly the systems synchronize. To begin, in IS-IS networks we can fit just under 100 PSNP entries into the typical 1500-byte MTU frame. This is the consequence of each CSNP entry including the Node ID, Fragment number, Sequence number, Checksum, and Lifetime fields -- totaling 7 + 1 + 4 + 2 + 2 = 16 bytes per entry. hashes, as shown in , take up 6 + 6 + 8 = 20 bytes each, which means that about 70 such hashes fit into a single packet. A more detailed explanation of why a 64-bit and the resulting hash sizes were chosen is provided in . These limits form the basis for the recommended partitioning and packing strategies discussed later. At the lowest compression level, it is optimal to generate a single CSNP packet on a mismatch in a hash. To achieve this, the first-level hashes should initially group about 80 LSP fragments together, with exceptions handled later. There is no need to maximize this initial packing. As the LSDB grows, it is better to leave some flexibility ("slack") in how fragments are grouped. This increases the likelihood that both sides of an adjacency will maintain the same leaf-level packing, even during flooding transitions, and prevents the ranges from shifting constantly. The packing process always places all fragments belonging to the same system and its pseudonodes within a single node Merkle hash. This hash may occasionally exceed the recommended size of 80 fragments. When it comes to node range hashes a hash is considered "full" when adding the next system's fragments would go beyond this limit. Unless, of course, the leaf is still empty. At that point, any range hash can be folded into a higher-level, less specific range hash. However, whenever a mismatch occurs at any level, the process must disaggregate the hash and send the corresponding smaller, more specific range or node hash instead.

To talk meaningfully about node range hashes, we will refer to hashes that cover a wider range of nodes as less specific, and to those covering only a subset of that range as more specific. Instead or in addition to the "first level" packing a system can decide to pack "more densely". In such a mode, the HSNP may include hashes that cover a much larger range than the first-level hashes. How this denser packing is implemented is left up to the specific implementation. A good general approach is to increase packing density in parts of the database that have not changed, where no hash mismatches have been observed, or when it is reasonable to assume that the neighbor already holds a mostly synchronized database. As a secondary consideration, it would be useful for efficient, cache-based implementations for both sides to agree on the ranges of Merkle hashes advertised. This would make caching of merkle nodes much more effective. However, while this idea seems viable in theory, implementing it across a large number of interfaces would effectively require a global synchronization protocol - something impractical in a network where nodes are constantly adding, removing, and updating fragments asynchronously. These ongoing changes continually affect what the "optimal" hash ranges are. And with enough churn, such a range negotiation protocol might never converge at all. Alternatively, providing a fast way to reconstruct the internal Merkle hash for a mismatched range could reduce the need for perfect range alignment. For example, in the proposed packing scheme, nodes always agree on system ID boundaries. By maintaining a Merkle hash per system ID, a node can quickly recompute the required hashes whenever received ranges differ from its cached ones - even in networks with a large number of nodes. It is still highly preferable for advertised Merkle hash ranges to align on system ID boundaries as much as possible - especially at the top level. Under stable conditions, these top-level Merkle hashes significantly reduce the amount of CSNP exchange required, minimizing both packet volume and processing overhead. Even though a fully stable network could, in theory, be represented by a single hash covering the entire LSDB, doing so is neither desirable nor beneficial. Since an HSNP packet must be sent anyway, it is much better to fill it with around 70 node range hashes. This approach limits the amount of decompression required if a collision occurs within one of those ranges and also reduces the risk of hash collisions, as discussed in . In summary, a node should avoid compressing beyond the point where a single HSNP covers the entire database. Ideally, one HSNP should contain at most about (MTU / Node Range Hash Entry Size) hashes - or fewer - to keep collision probabilities low, as described in detail in .

Each fragment generates a 64-bit siphash-1-3. The salt key is given in . In case the fragment hash generates a zero value the value MUST be replaced with constant of value 1. To validate correctness of an implementation a reference hash is given in .

Since large-scale deployments must compute significant numbers of hashes over sets of frequently changing fragments, it is highly desirable to use a specialized hash function that supports fast incremental updates when fragments are added, removed, or when their checksums change. Once hashes are built over sets of fragments, it is desirable to support very fast splitting and merging of such sets, especially when two hashes differ in which fragments they contain. Deeper considerations on such hashes can be found in but our design space is simplified due to irrelevance of security involved. The hash for a set of fragments is computed using a very fast XOR over their fragment hashes. This makes it straightforward to update the hash when a leaf is added, removed, or its checksum changes. As a result, less specific ranges can quickly derive their hash by XOR'ing the hashes of all included, more specific ranges, when those are available.

IIH of nodes supporting this extension MUST include in IIH a new TLV that will indicate support for reception of HSNPs. All nodes on the adjacency MUST advertise the TLV on their IIHs, otherwise HSNPs are not used. Observe that a node may very well just receive and process the HSNPs and answer them always with CSNPs necessary although this is obviously less beneficial than fully supporting sending and receiving of HSNPs.

Advertising standard CSNPs is extended with HSNP advertisements when this feature is supported. Since both CSNPs and HSNPs carry range information in their headers, they can be freely mixed, depending on which level of fragment "compression" best fits the situation. In practice, sufficiently specific range mismatches will naturally fall back to CSNP exchanges or flooding to resolve remaining differences. The ranges MUST be sorted based on Range Start System ID. The ranges MAY overlap albeit it is highly undesirable. Any node IDs that are not covered by the ranges in a packet - either because there are gaps between the advertised ranges, or between those ranges and the HSNP's Start and End System IDs - MUST be treated as missing. Consequently, if a node detects that it holds Merkle hashes for LSPs that are not covered by a received HSNP, it MUST behave as it would in the same situation with a CSNP, namely by flooding the missing LSPs. As with CSNPs, an HSNP whose first range covers the first node in the database MUST use 0000.0000.0000 as the start system ID in its packet range so that missing nodes can be detected. The same rule applies at the other end of the database: an HSNP whose range covers the last node MUST indicate this in a way that allows detection of any missing trailing nodes. When a node receives an HSNP where any of the contained hashes does not match after recomputation or comparison, it MUST immediately send HSNPs with Merkle hashes covering the mismatched ranges. These new hashes MUST be more specific than the range where the mismatch occurred. Alternatively, instead of more specific HSNP hashes, a node MAY choose to send corresponding CSNPs, PSNPs, or flood the mismatched LSPs directly. Sending CSNPs or flooding immediately may be preferable when the mismatch affects only a small number of LSPs. If there is a mismatch - or no computation available - for a hash covering just a single node (with its pseudonodes), or for a hash spanning fewer fragments than a full CSNP PDU, then CSNPs covering all fragments of the node and its pseudonodes MUST be sent. Alternatively, the node MAY choose to flood those specific fragments directly instead.

The HSNP PDU format closely follows the CSNP format. Instead of CSNP entries, it carries the corresponding Merkle hashes - which cover exactly the same fragments that would appear in CSNP packets. The Start and End System IDs exclude pseudonode bytes, as those are implicitly included within the ranges.

The Start and End System IDs use the standard ID length and indicate the range of fragments covered by the HSNP, just like CSNPs do. The key difference is that all pseudonodes of the systems within this range are implicitly included (implying as well that all fragments of the range are included). Both the Start and End System IDs are inclusive, meaning fragments from both endpoints are part of the range. The variable length fields are a sorted sequence of Node Range Hash Entries in the following format.

Range of LSPs that are included in the hash. Range Start and Range End System ID are inclusive, i.e. the fragments of both the start and the end system ID are contained within the range. Merkle hash consists of 8 bytes of the 64-bit computed hash of all fragments covered by the range. This makes an entry in typical deployment scenarios 6 + 6 + 8 = 20 bytes long and hence about 70 hashes fit into a typical MTU . Ranges MUST be sorted on Start System ID.

An example will clarify things further. Consider an LSDB with 512 nodes, each having a system ID of 1000.0000.00<2 digits node-id> and holding 32 fragments numbered 0 - 31. We skip uneven node identifiers to create intentional "holes" in the numbering. The pseudonode byte is treated simply as part of the system ID, since it doesn't affect the scheme itself. In a stable state, reasonable compression can deliver 128 "first-order" leaves - each containing fragments from 2 systems (64 fragments total) - requiring roughly 512 / (2 * 70) ~ 4 packets. The first of these "first-order" packets would look approximately like this:

Based on local decisions, a node can further compress HSNPs until - in the most extreme case - it sends just one packet full of hashes. In our example with 512 nodes, this divides them across 70 hashes (assuming equal fragment counts for simplicity), resulting in about 8 nodes per hash (equivalent to 8 * 32 fragments). The resulting packet would look like this:

HSNPs provide negligible benefit in small networks with only tens of nodes and hundreds of fragments. Perfect flooding would theoretically suffice at any scale (though history shows this is too optimistic), and even CSNPs represent under such assumptions protocol overhead only - yet they have significantly contributed to IS-IS stability in real deployments. As networks grow larger, the costs of link flaps, node restarts, and periodic CSNP exchanges increase substantially. HSNPs can significantly extend IS-IS scalability in these scenarios. To push IS-IS to its practical limits, we must account for flooding rates driven by fragment refreshes and LSDB synchronization. Path computation impacts can be deferred - bare pathological scenarios - by using dampening and parallelization, and thus we focus on realistic scenarios in the order of 50,000 nodes and 1 million fragments. At maximum configured lifetime, this generates ~15 packets/second per interface from refreshes (1M/65K fragments/second), plus ~10,000 CSNP packets for full LSDB sync. Achieving a modest 120-second sync requires ~80 CSNPs/second, and across 16,000 interfaces, that represents a peak of 1.5 million packets/second - far beyond current fast-flooding capabilities. We disregard here further techniques like , especially since they do not improve CSNP scale. With maximum HSNP compression, however, sync overhead drops to roughly one additional packet beyond refresh flooding, leaving ~250,000 packets/second (15 * 16K) to be handled by fast flooding and flood reduction techniques - a challenging but feasible target. The considerations above make it clear that combining fast flooding, flood reduction, and HSNP features will be essential in extending IS-IS scalability as deployments continue to grow larger.

Although the mechanism can theoretically use a single Merkle hash to represent an arbitrarily large database, such an approach is not advisable. Instead, it is far preferable to limit hash coverage so that at minimum one full HSNP packet is required. In practice, limiting compression so that a maximum of about a dozen HSNP packets covers the entire database is usually sufficient. For example, a single maximally compressed HSNP packet for a 1,000 - fragment database covers ~140 fragments per hash. Allowing for more HSNP packets (e.g., 10 instead of 100 CSNPs) still provides a 10x compression factor, reduces disaggregation needs during LSDB changes, and further lowers the already negligible collision risk (which in 1K sized LSDB is however vanishingly small with a single hash already).

Even with CSNPs or PSNPs, IS-IS has a corner case where LSDB synchronization can fail - particularly during node restarts. In simple terms, if a new fragment has the same sequence number and different content but an identical 16-bit Fletcher checksum, the collision goes undetected until the fragment expires. Assuming Fletcher checksums are uniformly distributed (even with minor content changes), the collision probability for that case is 1/2**16 ~ 0.0015%. This baseline enables meaningful comparisons with HSNP Fletcher collision probabilities. HSNP uses 64-bit over what is essentially PSNP data for a fragment plus the fragment's IS-IS length. The key concern is the probability that two fragments - covered by the same hash - generate the same hash simultaneously. This would cause both fragments to "disappear" due to the XOR checksum nature. Collisions occurring for fragments in different node range hashes are irrelevant. One might argue that XORing different sets of hashes could produce the same result, but the probability of two distinct sets having identical modulo-1 sums across all 64 bits is vanishingly small. This scenario is not considered further. Collision probability analysis is complex for the general case, though the birthday paradox gives a rough estimate of 0.18% collision likelihood for 48-bit hashes in a 1M - sized set. For 64-bit hashes this intuitively reduces to ~0.000,002,7%. To reflect reality better than simply relying on statistical assumptions, we instead rely on extensive simulations that mirror real-world conditions. These simulations model 50,000-node networks with 1M fragments, assuming node IDs differ by only 3 bytes, maximum fragment lifetimes, random protocol checksums on fragment refresh, and packet length changes in just 5% of refreshes to reflect network stability. Results are derived from 32 networks running for 2 years each. Across 64 years of simulation and resulting 36E9 refreshes of all fragments, we observed a total of 142 collisions for 48 bits variant of on the whole set or roughly ~0.000,000,4%, much lower than birthday paradox prediction. However, assuming a single HSNP packet covering the whole database only 3 collisions of those matter - a probability of ~0.000,000,008%, or roughly 1 occurrence per 20 years. These collisions have an average lifetime of about 10 hours. These rates are orders of magnitude lower than the birthday paradox predictions, likely because node IDs act as a consistent "salt," effectively pre-partitioning the probability space. This would be arguably "good enough" by a long stretch. Nevertheless, on further investigation of using the standard 64 bits 1-3 variant simulations over the same scenario generate *no* collisions that can be detected. Measuring the CPU cost of 48-bit variant vs. 64-bit variant or even 64-bit variant of traditional fletcher checksum is delivering negligible differences of low single digit percent when using modern implementation techniques. Thus, 64-bit 1-3 variant has been chosen and should represent a very safe margin for even much larger sizes of databases. Ultimately, a highly conservative (not to say paranoid) implementation can simply monitor the LSDB for colliding fragment hashes and exclude them from the same HSNP hash. This forces receiving nodes to use separate collision-free hashes instead. Such an approach completely eliminates any risk of synchronization misses when using HSNPs. Other techniques are possible, such as slowly walking the database and sending CSNPs. However, for a 1M-fragment database that generates 10,000 such CSNP packets, the chance of this detecting a collision during its 10-hour window is likely extremely small.

Hashes covering large numbers of fragments are more vulnerable to packet losses, as each lost packet affects a much larger portion of the LSDB during synchronization. Implementations can choose HSNP node ranges freely, but should balance maximum compression against "good enough" compression that reduces both collision risk and vulnerability to unavoidable packet drops.

As mentioned earlier, nodes can use various strategies to accelerate decompression. For example, LSPs missing from HSNPs (those not covered by any ranges) are clearly absent and can be immediately reflooded. Similarly, small mismatched Merkle ranges can trigger immediate CSNPs, PSNPs, or direct flooding. Caching of hashes can be applied at many levels. The simplest and most useful approach is maintaining a hash for all fragments of a node and its pseudonodes, though other granularities work equally well. Even when adjusting for changes - such as receiving a range < A - B > while having cached < A - B & next-after-B > - the cached hash can be quickly updated by simply XORing out the next-after-B node Merkle hash.

TBD

People have been talking about "compressing CSNPs" for a very long time, reportedly going back to when Radia Perlman and an insomniac Dave Katz were walking the halls discussing it. Recent attempts to scale the protocol much further have made it worthwhile to turn this idea into a standardized, practical engineering solution. Les Ginsberg identified several unresolved issues and contributed alternative ideas to the draft. Job Snijders initiated discussion of being likely a better solution that traditional fletcher checksumming of the fragments.

A Digital Signature Based on a Conventional Encryption Function Advances in Cryptology – CRYPTO '87 SipHash: A Fast Short-Input PRF Security considerations for incremental hash functions based on pair block chaining Computers and Security 25 Skip lists: A probabilistic alternative to balanced trees Communications of the ACM IS-IS Distributed Flooding Reduction

pub fn fragment_hash( fragmentid: &SharedFragmentID, fragmentcontent: &FragmentContent, variant: Option, size: Option, ) -> HSNPHash { let nid = fragmentid.node.node_id().0; let pnodebe = fragmentid.pnode.0.to_be_bytes(); let seqnrbe = fragmentcontent.seqnr.0.to_be_bytes(); let fragmentnrbe = fragmentid.fragmentnr.0.to_be_bytes(); let csumbe = fragmentcontent.isis_checksum.0.to_be_bytes(); let lenbe = fragmentcontent.isis_pdu_length.0.to_be_bytes(); let mut rotate_in_primary = nid.iter().chain( csumbe.iter().chain( seqnrbe.iter().chain( fragmentnrbe .iter() .chain(lenbe.iter().chain(pnodebe.iter())), ), ), ); let size = size.unwrap_or(HSNPSize::LIBRARY_HSNP_SIZE); let variant = variant.unwrap_or(HSNPFragmentHashVariant::LIBRARY_HSNP_FRAGMENT_HASH); match variant { HSNPFragmentHashVariant::Siphash => { let key = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16u8]; let mut hasher = SipHasher13::new_with_key(&key); let mut sl = [0u8; 16]; rotate_in_primary .map(|v| *v) .collect_slice_checked(&mut sl[..]); hasher.write(&sl); let r = hasher.finish(); match size { HSNPSize::_64Bits => { r.into() }, HSNPSize::_48Bits => { let hin = r ^ (r >> 48); (hin & 0xffff_ffff_ffff).into() }

]]>