next-hop-tracking.txt 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327
  1. 0. Introduction
  2. This is the design specification for next hop tracking feature in
  3. Quagga.
  4. 1. Background
  5. Recursive routes are of the form:
  6. p/m --> n
  7. [Ex: 1.1.0.0/16 --> 2.2.2.2]
  8. where 'n' itself is resolved through another route as follows:
  9. p2/m --> h, interface
  10. [Ex: 2.2.2.0/24 --> 3.3.3.3, eth0]
  11. Usually, BGP routes are recursive in nature and BGP nexthops get
  12. resolved through an IGP route. IGP usually adds its routes pointing to
  13. an interface (these are called non-recursive routes).
  14. When BGP receives a recursive route from a peer, it needs to validate
  15. the nexthop. The path is marked valid or invalid based on the
  16. reachability status of the nexthop. Nexthop validation is also
  17. important for BGP decision process as the metric to reach the nexthop
  18. is a parameter to best path selection process.
  19. As it goes with routing, this is a dynamic process. Route to the
  20. nexthop can change. The nexthop can become unreachable or
  21. reachable. In the current BGP implementation, the nexthop validation
  22. is done periodically in the scanner run. The default scanner run
  23. interval is one minute. Every minute, the scanner task walks the
  24. entire BGP table. It checks the validity of each nexthop with Zebra
  25. (the routing table manager) through a request and response message
  26. exchange between BGP and Zebra process. BGP process is blocked for
  27. that duration. The mechanism has two major drawbacks:
  28. (1) The scanner task runs to completion. That can potentially starve
  29. the other tasks for long periods of time, based on the BGP table
  30. size and number of nexthops.
  31. (2) Convergence around routing changes that affect the nexthops can be
  32. long (around a minute with the default intervals). The interval
  33. can be shortened to achieve faster reaction time, but it makes the
  34. first problem worse, with the scanner task consuming most of the
  35. CPU resources.
  36. "Next hop tracking" feature makes this process event-driven. It
  37. eliminates periodic nexthop validation and introduces an asynchronous
  38. communication path between BGP and Zebra for route change notifications
  39. that can then be acted upon.
  40. 2. Goal
  41. Stating the obvious, the main goal is to remove the two limitations we
  42. discussed in the previous section. The goals, in a constructive tone,
  43. are the following:
  44. - fairness: the scanner run should not consume an unjustly high amount
  45. of CPU time. This should give an overall good performance and
  46. response time to other events (route changes, session events,
  47. IO/user interface).
  48. - convergence: BGP must react to nexthop changes instantly and provide
  49. sub-second convergence. This may involve diverting the routes from
  50. one nexthop to another.
  51. 3. Overview of the changes
  52. The changes are in both BGP and Zebra modules. The short summary is
  53. the following:
  54. - Zebra implements a registration mechanism by which clients can
  55. register for next hop notification. Consequently, it maintains a
  56. separate table, per (VRF, AF) pair, of next hops and interested
  57. client-list per next hop.
  58. - When the main routing table changes in Zebra, it evaluates the next
  59. hop table: for each next hop, it checks if the route table
  60. modifications have changed its state. If so, it notifies the
  61. interested clients.
  62. - BGP is one such client. It registers the next hops corresponding to
  63. all of its received routes/paths. It also threads the paths against
  64. each nexthop structure.
  65. - When BGP receives a next hop notification from Zebra, it walks the
  66. corresponding path list. It makes them valid or invalid depending
  67. on the next hop notification. It then re-computes best path for the
  68. corresponding destination. This may result in re-announcing those
  69. destinations to peers.
  70. 4. Design
  71. 4.1. Modules
  72. The core design introduces an "nht" (next hop tracking) module in BGP
  73. and "rnh" (recursive nexthop) module in Zebra. The "nht" module
  74. provides the following APIs:
  75. bgp_find_or_add_nexthop() : find or add a nexthop in BGP nexthop table
  76. bgp_find_nexthop() : find a nexthop in BGP nexthop table
  77. bgp_parse_nexthop_update() : parse a nexthop update message coming
  78. from zebra
  79. The "rnh" module provides the following APIs:
  80. zebra_add_rnh() : add a recursive nexthop
  81. zebra_delete_rnh() : delete a recursive nexthop
  82. zebra_lookup_rnh() : lookup a recursive nexthop
  83. zebra_add_rnh_client() : register a client for nexthop notifications
  84. against a recursive nexthop
  85. zebra_remove_rnh_client(): remove the client registration for a
  86. recursive nexthop
  87. zebra_evaluate_rnh_table(): (re)evaluate the recursive nexthop table
  88. (most probably because the main routing
  89. table has changed).
  90. zebra_cleanup_rnh_client(): Cleanup a client from the "rnh" module
  91. data structures (most probably because the
  92. client is going away).
  93. 4.2. Control flow
  94. The next hop registration control flow is the following:
  95. <==== BGP Process ====>|<==== Zebra Process ====>
  96. |
  97. receive module nht module | zserv module rnh module
  98. ----------------------------------------------------------------------
  99. | | |
  100. bgp_update_ | | |
  101. main() | bgp_find_or_add_ | |
  102. | nexthop() | |
  103. | | |
  104. | | zserv_nexthop_ |
  105. | | register() |
  106. | | | zebra_add_rnh()
  107. | | |
  108. The next hop notification control flow is the following:
  109. <==== Zebra Process ====>|<==== BGP Process ====>
  110. |
  111. rib module rnh module | zebra module nht module
  112. ----------------------------------------------------------------------
  113. | | |
  114. meta_queue_ | | |
  115. process() | zebra_evaluate_ | |
  116. | rnh_table() | |
  117. | | |
  118. | | bgp_read_nexthop_ |
  119. | | update() |
  120. | | | bgp_parse_
  121. | | | nexthop_update()
  122. | | |
  123. 4.3. zclient message format
  124. ZEBRA_NEXTHOP_REGISTER and ZEBRA_NEXTHOP_UNREGISTER messages are
  125. encoded in the following way:
  126. /*
  127. * 0 1 2 3
  128. * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  129. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  130. * | AF | prefix len |
  131. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  132. * . Nexthop prefix .
  133. * . .
  134. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  135. * . .
  136. * . .
  137. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  138. * | AF | prefix len |
  139. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  140. * . Nexthop prefix .
  141. * . .
  142. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  143. */
  144. ZEBRA_NEXTHOP_UPDATE message is encoded as follows:
  145. /*
  146. * 0 1 2 3
  147. * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  148. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  149. * | AF | prefix len |
  150. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  151. * . Nexthop prefix getting resolved .
  152. * . .
  153. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  154. * | metric |
  155. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  156. * | #nexthops |
  157. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  158. * | nexthop type |
  159. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  160. * . resolving Nexthop details .
  161. * . .
  162. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  163. * . .
  164. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  165. * | nexthop type |
  166. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  167. * . resolving Nexthop details .
  168. * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  169. */
  170. 4.4. BGP data structure
  171. Legend:
  172. /\ struct bgp_node: a BGP destination/route/prefix
  173. \/
  174. [ ] struct bgp_info: a BGP path (e.g. route received from a peer)
  175. _
  176. (_) struct bgp_nexthop_cache: a BGP nexthop
  177. /\ NULL
  178. \/--+ ^
  179. | :
  180. +--[ ]--[ ]--[ ]--> NULL
  181. /\ :
  182. \/--+ :
  183. | :
  184. +--[ ]--[ ]--> NULL
  185. :
  186. _ :
  187. (_).............
  188. 4.5. Zebra data structure
  189. rnh table:
  190. O
  191. / \
  192. O O
  193. / \
  194. O O
  195. struct rnh
  196. {
  197. u_char flags;
  198. struct rib *state;
  199. struct list *client_list;
  200. struct route_node *node;
  201. };
  202. 5. User interface changes
  203. quagga# show ip nht
  204. 3.3.3.3
  205. resolved via kernel
  206. via 11.0.0.6, swp1
  207. Client list: bgp(fd 12)
  208. 11.0.0.10
  209. resolved via connected
  210. is directly connected, swp2
  211. Client list: bgp(fd 12)
  212. 11.0.0.18
  213. resolved via connected
  214. is directly connected, swp4
  215. Client list: bgp(fd 12)
  216. 11.11.11.11
  217. resolved via kernel
  218. via 10.0.1.2, eth0
  219. Client list: bgp(fd 12)
  220. quagga# show ip bgp nexthop
  221. Current BGP nexthop cache:
  222. 3.3.3.3 valid [IGP metric 0], #paths 3
  223. Last update: Wed Oct 16 04:43:49 2013
  224. 11.0.0.10 valid [IGP metric 1], #paths 1
  225. Last update: Wed Oct 16 04:43:51 2013
  226. 11.0.0.18 valid [IGP metric 1], #paths 2
  227. Last update: Wed Oct 16 04:43:47 2013
  228. 11.11.11.11 valid [IGP metric 0], #paths 1
  229. Last update: Wed Oct 16 04:43:47 2013
  230. quagga# show ipv6 nht
  231. quagga# show ip bgp nexthop detail
  232. quagga# debug bgp nht
  233. quagga# debug zebra nht
  234. 6. Sample test cases
  235. r2----r3
  236. / \ /
  237. r1----r4
  238. - Verify that a change in IGP cost triggers NHT
  239. + shutdown the r1-r4 and r2-r4 links
  240. + no shut the r1-r4 and r2-r4 links and wait for OSPF to come back
  241. up
  242. + We should be back to the original nexthop via r4 now
  243. - Verify that a NH becoming unreachable triggers NHT
  244. + Shutdown all links to r4
  245. - Verify that a NH becoming reachable triggers NHT
  246. + no shut all links to r4
  247. 7. Future work
  248. - route-policy for next hop validation (e.g. ignore default route)
  249. - damping for rapid next hop changes
  250. - prioritized handling of nexthop changes ((un)reachability vs. metric
  251. changes)
  252. - handling recursion loop, e.g.
  253. 11.11.11.11/32 -> 12.12.12.12
  254. 12.12.12.12/32 -> 11.11.11.11
  255. 11.0.0.0/8 -> <interface>
  256. - better statistics