Documents
Kraken overview
Oct. 23 2016 — 8:12a.m.
Kraken POC Overview and Summary EDMXX-XX – Version 1.6 Author:Kraken POC Team / Endace ©2013 Endace Technology Ltd. All Rights Reserved – Confidential – External distribution prohibited – Internal distribution restricted.
EDMXX-XX v1.6 Kraken POC Overview and Summary Table of Contents 1 Introduction: ................................................................................................................................... 7 2 Customer User Stories: ................................................................................................................... 8 2.1 Rob and the Audit Dept: ......................................................................................................... 8 2.2 Rob and the EU: ...................................................................................................................... 8 2.3 Steve and Diagnosing a Network Problem ............................................................................. 8 2.4 Rob and Adding Storage: ........................................................................................................ 8 2.5 Rob Again and US Legal Hassles:............................................................................................. 8 2.6 A Friendly Government Agency .............................................................................................. 8 3 Three Possible Competitive Advantages......................................................................................... 9 4 Design Analysis:.............................................................................................................................10 4.1 5 6 Competitive Technologies: ...................................................................................................10 4.1.1 Probe:............................................................................................................................10 4.1.2 High End SAN: ...............................................................................................................10 4.1.3 Low End SAN or NAS: ....................................................................................................10 4.2 Hard Drive Reliability ............................................................................................................10 4.3 Flash ......................................................................................................................................10 4.4 Load Balancing / Overall Balance..........................................................................................11 4.5 Intelligent Retention .............................................................................................................11 4.6 Fault Tolerance .....................................................................................................................11 Identified Key Technical Challenges:.............................................................................................12 5.1 Cost .......................................................................................................................................12 5.2 Mechanical Disk Packing .......................................................................................................12 5.3 Cooling ..................................................................................................................................12 5.4 Power Distribution ................................................................................................................12 5.5 Probe Interface (Packet and Query) .....................................................................................12 5.6 Scalable Load Balancing ........................................................................................................12 5.7 Node Write/Read CPU and Disk Performance......................................................................13 5.8 Robust Failure and Transient Handling.................................................................................13 5.9 Overprovision / Specified End of Life....................................................................................13 POC Detailed Design Goals To The Meet Key Technical Challenges.............................................14 6.1 Form Factor / Heat / Vibration: ............................................................................................14 6.2 Packet Store ..........................................................................................................................14 ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 2 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 7 6.3 Packet Query Performance ...................................................................................................14 6.4 Queries Supported in POC ....................................................................................................14 6.5 Probe CPU Loading................................................................................................................15 6.6 Node CPU Loading.................................................................................................................15 6.7 Resilient Failure Mode ..........................................................................................................16 6.8 Even Packet Store .................................................................................................................16 6.9 Packet Availability .................................................................................................................16 Kraken POC Detailed Straw-man Design.......................................................................................17 7.1 Disk Drive Choice: .................................................................................................................17 7.2 Form Factor Choice: ..............................................................................................................17 7.3 The Ring / Adding and Subtracting Components..................................................................18 7.4 Interconnect Topology..........................................................................................................19 7.5 Q2 FPGA Architecture – Packet Storage ...............................................................................20 7.6 Tentacle FPGA Architecture – Packet Storage ......................................................................21 7.7 Tentacle FPGA Architecture – Query Return ........................................................................22 7.8 Q2 FPGA Architecture – Query Return .................................................................................23 7.9 Next Generation Probe Architecture: ...................................................................................25 7.10 Intelligent Load Balancing.....................................................................................................26 7.11 Packet Storage Flow..............................................................................................................27 7.12 Queries and Query Response................................................................................................27 7.13 Management processes........................................................................................................28 7.13.1 Startup / Boot ...............................................................................................................28 7.13.2 Disk failures...................................................................................................................28 7.13.3 Tentacle failure .............................................................................................................29 7.14 8 9 Other considerations ............................................................................................................29 Kraken POC Phased Development Plan ........................................................................................30 8.1 Cooling ..................................................................................................................................30 8.2 Chassis...................................................................................................................................30 8.3 Initial Development Platform (IDP).......................................................................................31 8.3.1 IDP Introduction............................................................................................................31 8.3.2 IDP Plan .........................................................................................................................32 8.4 Ethernet Switch Options .......................................................................................................35 8.5 Final POC Platform ................................................................................................................36 Appendix A: Kraken Rough Cost Breakdown – note – now out of date .......................................37 ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 3 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 10 Appendix B: Ethernet Rings......................................................................................................39 11 Open Questions ........................................................................................................42 11.1 Query Size .............................................................................................................................42 11.2 Packet Sorting .......................................................................................................................42 11.3 Arista Switch .........................................................................................................................42 11.4 Text Search............................................................................................................................42 12 Bibliography ..............................................................................................................................43 ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 4 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary Figure 1: Overall Kraken Topology........................................................................................................19 Figure 2 Q2/Probe Architecture............................................................................................................20 Figure 3: Packet Processing on a Kraken Tentacle................................................................................21 Figure 4: Query Return Process ............................................................................................................22 Figure 5: Next generation Kraken-enabled Probe Architecture. Please note: this is a wild guess to provide a discussion framework. Beyond the “Generic Packet Storage Interface” nothing here is required for Kraken operation..............................................................................................................25 Figure 6: Initial POC Development Platform.........................................................................................31 Figure 7: Final POC Test Setup ..............................................................................................................36 Tables of Tables: Table 1: Basic Storage Numbers – Packet Capture Rate and History Length versus Storage Required . 7 Table 2: Time to Query Completion as a function of Query Size..........................................................14 Table 3: Example queries for POC.........................................................................................................15 Table 4: Possible Disk Drives.................................................................................................................17 Table 5: Expected cost of Initial POC Development Platform ..............................................................31 ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 5 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary Revision History: Author Revision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Date 13-5-2013 14-5-2013 16-5-2013 21-5-2013 24-5-2013 24-5-2013 24-5-2013 24-5-2013 4-6-2013 5-6-2013 10-6-2013 11-6-2013 12-6-2013 19-6-2013 25-6-2013 Changes First Draft Filling in basic headings and ideas from Added some comments from Architecture ideas from Added input from Kraken meeting May 23rd, 2013 Added FPGA Architecture text+diagrams Added ring stuff costing stuff added New requirements based on discussions with New architecture added More details of new architecture Lots more from Some fixes Updates based on events since last Thursday Updates based on feedback from 1.6 1-7-2013 Split document into two and renamed this piece. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 6 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 1 Introduction: Kraken is a product aimed at solving the deep storage problem faced by network analytics users. Like a SAN, Kraken includes large amounts of disk or other storage; unlike a SAN, Kraken works directly with packets and lets you search and retrieve them quickly.Some basic numbers on storage: Table 1: Basic Storage Numbers – Packet Capture Rate and History Length versus Storage Required Time Rate (Gbps) 1 second 1 hour 1 day 7 days 1 month 1 year 0.1 12 MB 45 GB 1 TB 7 TB 29 TB 354 TB 1 128 MB 450 GB 10 TB 73 TB 295 TB 3 PB 4.8 614 MB 2 TB 50 TB 354 TB 1 PB 16 PB 10 1 GB 4 TB 105 TB 738 TB 2 PB 34 PB 24 3 GB 10 TB 253 TB 1 PB 6 PB 83 PB 40 5 GB 17 TB 421 TB 2 PB 11 PB 138 PB Three key goals for Kraken identified by CTO/Marketing: 1) Performance: always capable of X (20Gbps or whatever) with no “it depends”. 2) Query time: depends linearly on amount of data returned not amount of data searched. 3) “Zero Touch Maintenance”: just works without maintenance for a specified period. Notes: a) Query time – the above statement is for queries based on pre-indexed fields (e.g. ip 5-tuple). Queries based on non-index fields, i.e. searching for a text string, may depend on the amount of data searched. b) “Lights out Management” here refers to the ability to leave the Kraken unit alone without any maintenance for a specified period, for example, in a darkened datacentre. It does not refer to basic “lights out” functionality as provided by IPMI. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 7 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 2 Customer User Stories: 2.1Rob and the Audit Dept: Rob has been informed by his audit department that he needs to store 3 months worth of packets. His probes capture 1Gbps worth of traffic. He has a budget of $800k. As long as he can retrieve any block of packets from a certain time up to a month ago he’s happy. 2.2Rob and the EU: Rob has been told he needs to store packets on a time basis in order to meet EU data retention requirements. He needs to guarantee that all packets stored are stored for precisely the same length of time. 2.3Steve and Diagnosing a Network Problem Steve has our latest probe connected to a Kraken. He’s monitoring four 10Ge links but his overall traffic is bursty and averages around 24Gbps total. He’s a Tier 3 guy trying to figure out why application A running on Server S is slow every Thursday at 4:33pm. It’s Monday morning. 2.4Rob and Adding Storage: Rob has been told his Tier 3 guys need longer packet storage in order to diagnose network issues more effectively. He currently has a probe with 24Gbps capture and 24TB of storage which gives him roughly 2.5 hours of storage. He would like to have three days worth of storage. He would like to just plug something in and have it work as he has no resources for any kind of software development. 2.5Rob Again and US Legal Hassles: The FBI has decreed that unencrypted data is the same as public data. So, if they seize a server in a cloud datafarm as a result of an ongoing legal case, any data on that server, even if it is from another virtual customer is subject to search and possible litigation. This violates Rob’s company contract with its cloud customers. So all server storage will in the future be encrypted. Rob wants to know if Kraken and the Probe encrypt all storage too. 2.6A Friendly Government Agency An FGA has the encryption keys for a well-known chat program. They wish to unencrypt all packets sent by this program on a large network in the last 24 hours and look for the text string “Domino’s Pizza” as they have information suggesting this is the favourite pizza of international terrorists. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 8 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 3 Three Possible Competitive Advantages What is Kraken’s likely competition ? We can identify several based on what we see our customers doing. One competition is ourselves – if buying another Probe with its storage is cheaper or easier than buying Kraken then that is what customers will do. Also, if buying a SAN or NAS and connecting it to the probe is easy and cheaper that is what they will do. Probe is already making this path easier than it is currently - because Probe needs to compete with vendors who have SAN integration. We presumably have other vendors out there seeing the need that we see and developing something Kraken-like – which implies that we need to put some strong IP into Kraken or we will be fighting a production battle (which isn’t our forte) versus fighting a technical battle (which is). So, cost is clearly going to be an important factor for customers – not just capital expenditure (CAPEX) cost but also operational cost (OPEX) – such things as cost of power, cost of AC, cost of renting rack space, continuing cost of upgrade/maintenance. Performance is likely to be a big differentiator too – can the solution scale to 24 Gbps ?40Gbps ? A high performance SAN capable of sinking 40Gbps is much more expensive than a low performance device. Where does Kraken sit here ? One architecture that has been discussed for Kraken is similar to RAID 0 – a bunch of cheap commercial disks with no redundancy. This model is cheap and can be robust but will have the issue that drive failures will occur and packets will be lost. That can be addressed by utilizing some form of redundancy on Kraken but at the expense of more cost and presumably more complexity. So, packet availability, or in other words, the probability that when a customer asks for a certain set of packets that he gets them, will be another major differentiator between Kraken and its competitors. Summary -> Kraken Key Differentiators: COST PERFORMANCE PACKET AVAILABILITY ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 9 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 4 Design Analysis: 4.1Competitive Technologies: 4.1.1 Probe: 48TB at $60k, 3U, Query performance with POI ? 4.1.2 High End SAN: 4.1.3 Low End SAN or NAS: NetApp E5460 list price $200k, ex-demo $70k, 60x3TB drives, 4U, dual controller with dual 8Gbps FC. No performance numbers yet. Probe software hassingle-LUN SAN support but without POI download – that presumably goes in EP5.2. 4.2Hard Drive Reliability Hard drives generally fail in two ways – they gradually have sectors that lose the ability to store data and eventually they will suffer a complete catastrophic failure (Pinheiro, 2007). We are looking for an architecture that is low cost and high performance while having an acceptable packet availability metric. Low cost implies that we use commercial high density drives as opposed to enterprise rated drives. High performance implies that we don’t employ RAID strategies as such (verify ?). So we’re considering a large array of cheap perishable drives with little or no redundancy. We obviously can expect to have a number of drives fail completely in the expected lifetime of a shipped unit. We need to answer the following questions: 1) How many extra drives to we need to include in a Kraken so that we drop to the expected storage capacity only at end of life ? 2) How many extra drives do we need to include in a Kraken so that we drop to the expected performance only at end of life ? 3) What is the probability that a customer will not find a packet when they request it ? (Need assumptions here – something like 24Gbps traffic, 500 customers, each looking for a 500Mbyte packet trace once a week). How does that compare with a low-end SAN? A highend SAN? For SANs we need to include expected downtime into the equation. 4.3Flash Looks like either cheap flash disk or even all flash storage will be too late for POC but may be there for productization. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 10 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 4.4Load Balancing / Overall Balance A single Probe / multiple Kraken configuration for storage capacity increase is required for POC. The multiple Kraken multiple Probe case has a complicated load balancing scenario and will not be considered for POC. 4.5Intelligent Retention We may wish to store different packets for different amounts of time. Intelligent retention will increase the IO load on the disks and the CPU load running the retention policy. Numerous questions exist for this option. We will not consider Intelligent Retention for POC. 4.6Fault Tolerance We are looking for slow degradation not outright failure. We need to survive 40G links being disconnected, hard drives failing, Krakens being removed, added and upgraded. When a probe is capturing continuously the packet data stored is perishable to some extent as it will in general only be needed or useful for a fixed amount of time. Thus our fault tolerance may be higher in some areas than a system which stores data for long periods of time and which has a more absolute requirement. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 11 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 5 Identified Key Technical Challenges: Based on the above discussion, we identify the following key technical challenges: 5.1Cost End user cost must be equal to or cheaper than a SAN/NAS solution of similar performance and functionality. This includes initial purchase cost (CAPEX) and operational cost (OPEX) including rack space rental, air conditioning and power cost. 5.2Mechanical Disk Packing Specifically, how do you pack a huge number of commercial spinning drives in as small a space as possible ? How do you deal with the vibration issues involved ? 5.3Cooling Each node drive will generate roughly 3W of heat. Each tentacle will generate an additional 35W of heat. The entire Kraken is expected to generate 2.4kW. Removing this heat will be a serious challenge. 5.4Power Distribution The power consumed in the chassis will be approximately 2400W. While this is not enormous for a 6U chassis it is still significant. 5.5Probe Interface (Packet and Query) How does the probe send packets and queries to the Kraken ? How does Kraken reply ? How does the interface deal with packets lost due to congestion (should not happen), disk drive bad sectors (will happen) or disk drive failure (will happen). How do we deal with Kraken’s added to the array ? Kraken’s removed ? Kraken’s starting up or shutting down ? How can we guarantee a response time that depends on query size but not storage size ? Flow control in both directions ? Can the Probe re-sort a bunch of unordered packets arriving from multiple Krakens in a memory hole ? Does the Q2 card need to help with this sorting ? How do we deal with requests that are really large (bigger than a memory hole) ? Does our API split requests ? 5.6Scalable Load Balancing We need to be able to scale up in a) storage depth, b) packet rate while holding query reply times to increasing at most with ln(n). This implies the ability to add nodes to a Kraken, add more Krakens to an existing Kraken and add more probes to a Kraken system. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 12 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 5.7Node Write/Read CPU and Disk Performance The node needs to a) write packets to disk, b) keep time and flowhashindexes for packets, c) parse incoming requests, d) identify packets based on time and flowhash indexes that might match incoming requests, e) read those packets off disk, f) parse those packets to verify which ones actually do match the incoming request, g) return those packets somehow to 40G ring. We need to verify that any particular architecture provides sufficient processing power for this. 5.8Robust Failure and Transient Handling Hard drive failures both gradual and sudden. Addition of probes or Krakens, removal of same. We want a lights-out replacement method rather than a urgent maintenance method – ie. Kraken should keep running with redundant fail-over for a specified amount of time – a year or six months 5.9Overprovision / Specified End of Life We need to overprovision the box both in terms of storage space and performance in such a way that we reach specified minimums at defined end of life. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 13 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 6 POC Detailed Design Goals ToThe Meet Key Technical Challenges The purpose here is to identify what exact goals the POC (proof-of-concept) must meet in order to demonstrate that we have solved the key technical challenges. 6.1Form Factor / Heat / Vibration: We need to show that we can pack >300TB of 2.5” drives in ≤ 3U of vertical height , power the drives while they run a representative read/write pattern and keep the drives cool enough (within allowed operating temperatures for that specific drive). The entire Kraken will take up 7U of rack space {3U for drives, 3U for compute and 1U for a large Ethernet switch}. 6.2Packet Store We need to store generic Ethernet packets at 24Gbps continuously without dropping any. 6.3Packet Query Performance We must finish returning packets to probe main memory after a single query in a time given by: =∝ + Where T is the total time in seconds from query sent to last packet returned, alpha is a constant which must be less than 2.0e-9 seconds per byte, n is the number of bytes returned by the query and L is a time constant less than 500ms. Running multiple queries simultaneously will reduce individual query performance. We will allow for at least 32 simultaneous queries. Example times are given in a table below: Table 2: Time to Query Completion as a function of Query Size Query Size Time to Finish 1B 1 kB 1 MB 1 GB 1 TB 0.5 s 0.5 s 0.5 s 2.5 s 36 min 6.4Queries Supported in POC All queries supported in POC will be of the form n{fk,t1,t2} where n is a number of triplets with 1 <= n <= 256, fk is a flow key included in each packet in the flow hash location of a standard 0xe extension header, t1 is a time, t2 is a time and any of fk,t1 or t2 may be replaced by *. Note that much more complicated queries (such as text string search) are possible in the Kraken architecture but they will not be demonstrated in the POC. The following table contains examples of queries supported for POC. Note also that there will be three basic forms of query message in POC – the ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 14 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary first will return the number of packets and number of bytes that match the query, the second will return the actual packets that match the query and the third will cancel/cleanup the query. Table 3: Example queries for POC. Example: Translation: Supported In POC ? {flowhash=0xABCDEF,t1=0s,t2=1s} Find all packets with flowhash equal to 0xABCDEF Yes between times 0 seconds and 1 seconds inclusive. {flowhash=*,t1=1s,t2=1.5s} or Find all packets with any flowhash between times Yes {flowhash=0x123456,t1=*,t2=*} or 1s and 1.5s and all packets with flowhash equal {flowhash=0x345678,t1=2s,t2=*} to 0x123456 at any time and all packets with flowhash equal to 0x345678 between time 2s and the current time. {flowhash=*,t1=*,t2=10s} Find all packets with any flowhash between the Yes earliest time for which you have packets and the time of 10s. Text_string=”Domino’s Pizza” Find any packets containing this string Application=FaceBook Find any packets created by the Facebook No No application. 6.5Probe CPU Loading In the eventual product, Kraken related processes must not consume large amounts of probe processor capability. However, for the POC we will be developing both the ILB and the Query algorithms in software. So, the POC will use as much as all the processor resources available in an 8000-type probe box (the Centos probe). Once an understanding of the ILB and Query algorithms is achieved, we will investigate ways of offloading processing so as to reduce reliance on probe resources as much as possible. So, for example, query response sorting will be done on the Probe for POC. 6.6Node CPU Loading We need to show that the node cpu chosen has at least 20% overprovision to handle a representative access pattern of packet reception and queries. The worst case representative pattern would be 24Gbps of 64 byte packets received and one 100MB query every second. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 15 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 6.7Resilient Failure Mode We need to show that we can handle 1) a complete drive failure, 2) addition of a Kraken, 3) removal of a Kraken. We do not need to show Kraken interoperating in a standard network fabric. 6.8Even Packet Store The ILB (Intelligent Load Balancing) algorithm must guarantee that the oldest packet on any disk is no more than 10% older than the oldest packet on any other disk. I.e. the disks need to be equally utilized. This does not apply to disks being used for drive failure resilience. 6.9Packet Availability We need to demonstrate that packet availability will meet a customer acceptable level.Define customer acceptable ? Presumably this is based on flow as well – probability of flow corruption ? ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 16 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 7 Kraken POC Detailed Straw-man Design Given the POC design goals from the previous section, we wish to narrow down and eventually choose a POC architecture. What we describe below has followed from the above design goals. 7.1Disk Drive Choice: Disk drive choice is still in flux. We prefer the Western Digital WD10JPVT at the moment due to its high density and low cost. We assume for the POC that all drives will be identical in all Kraken’s, although, for the production Kraken we will probably need to support whatever is cheapest at the time. Currently we intend to have roughly 400 of these drives per Kraken. Table 4: Possible Disk Drives Power W W/GB Z (mm) $0.08 1.7 0.0009 15 $78.13 $0.10 1.75 0.0023 9.5 1000 $77.00 $0.08 1.4 0.0014 9.5 Vanilla 1000 $77.00 $0.08 2.1 0.0021 9.5 Hybrid 8GB Flash 1000 $119.00 $0.12 2.7 0.0027 9.5 Disk per System 166 Manufacturer Western Digital Western Digital Western Digital Model Noteworthiness WD20NPVT Green (Power) 2000 $157.00 WD7500BPKT Black (Quality) 750 WD10JPVT Blue Hitachi HTS72010A9E630 Seagate ST1000LM014 No of Systems 1 WD20NPVT Capacity $157.00 $26,062 WD7500BPKT $78.13 $12,970 WD10JPVT $77.00 $12,782 HTS72010A9E630 $77.00 $12,782 $119.00 $19,754 ST1000LM014 Cost Cost / G 7.2Form Factor Choice: We’d like to hit the 3U disk drive form factor. For POC we will have 3U for disk drives, 3U for compute and 1U for a large Ethernet switch. Power supplies will be included in the compute 3U. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 17 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 7.3The Ring / Adding and Subtracting Components We have a requirement that we can add Krakens, subtract Krakens and potentially add or subtract probes with an absolute minimum of disturbance to our packet capture solution. We also have the requirement that we maximize bandwidth into and out of the Kraken. These constraints are satisfied by a) building a ring architecture with data flowing in both directions through the ring – that way if the ring is broken during the addition or subtraction of a Kraken data flow continues and b) doubling the number of rings to double our bandwidth. We have decided that ring architectures are a distraction at this point – they are obviously doable and require no technology invention – it is mostly a market/customer driven decision. So, for the POC, we will only demonstrate a point-to-point 40G link between a probe and one or two Krakens. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 18 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 7.4Interconnect Topology The proposed topology of the Kraken POC is based on 10G KR (backplane) Ethernet. Each Tentacle looks like a mini-probe – it consists of a standard dag fpga looking a great deal like a dag10sx2, an embedded ComExpress processor (probably based off the Intel quad-core i7-3612QE or i7-4700EQ processors), up to 16 Gbyte of plug-in dram memory, four onboard SATA connections and 16 extension SATA connections to a total of 20 disks (probably the 1TB 2.5” form factor spinning disks).We have around twenty tentacles, each connected to a main central Intel Fulcrum Ethernet switch by two 10G KR links and one 1G KR link. The Intel Fulcrum Ethernet switch is connected to the outside world by four 40G QSFP-type links.We are considering additional 10G KR connections between fpgas for additional inter-tentacle bandwidth. There is also a “twenty-first” ComExpress processor which acts as the central startupPXE boot manager for the Kraken box, the source of environmental information and the management for the Fulcrum switch. Figure 1: Overall Kraken Topology ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 19 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 7.5Q2 FPGA Architecture– Packet Storage In describing the architecture, it makes most sense to consider the two directions separately – firstly the packets storage (probe to Kraken), followed by the search query return (Kraken to probe).The anticipatedpacket storage part of the probe architecture is shown in the figure below: Figure 2 Q2/Probe Architecture The intended operation is described in the following sections: On the probe/centos box, we have a pair of D9.2X (or other DAG) receiving traffic. Received packets are hash load balanced (Flow HLB) into four streams each, such that the streams are flow-safe and the HLB settings of both cards are identical. (i.e. the flows contained in Stream #0 are the same flows as are in Stream #1). A software process (the “packet processor”) merges corresponding streams from each card so that we have four flow-safe streams of packets for all 4 incoming links. (i.e. Stream #0 and Stream #1 are merged in time order) The packet processor then encapsulates these packets in multi-E3 format, using MAC addresses chosen according to its own intelligent load balancing (ILB) algorithm. These MAC addresses determine the final destination within the Kraken, so the load balancing algorithm becomes a software-based solution (i.e. bandwidthbased / flow based / “intelligent” etc.).A process within the probe is listening for per node “keep-alive” packets being sent by each node in each Kraken to determine the list of active MAC addresses to send packets to or to expect queries from. The load balancing algorithm is split into multiple streams, such that multiple cores can implement it, thus providing scope to try different things. Once we reach this point, the Q2 simply becomes a transmit source, similar to any other DAG card, with the difference that we implement multiple burst managers in order to be capable of saturating the output 40G link. Each burst manager would be capable of transmitting ~27Gbps. Note that the packets stored in the TX MEM ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 20 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary Stream are ERF format with multi-E3 packets inside.There is no load balancing between the 10Q2 transmit output and the Kraken receiving tentacle – it is the responsibility of the ILB algorithm to prevent overloading a Kraken tentacle. The E3 packets have sequence numbers on a perper node basis so that receiving tentacles can detect dropped packets. 7.6Tentacle FPGA Architecture– Packet Storage The Ethernet switch within the Kraken distributes incoming packets based on L2 routing (i.e. destination MAC addresses). Each node (one disk per node) within a Tentacle will have been assigned an individual MAC address. Each of the two 10G KR ports on the tentacle within Kraken will occasionally send a set of broadcast “keep alive” packets to the main switch using the MAC addresses of the nodes behind it as the source MAC address. These packets serve two purposes – one, they allow the switch to “learn” which MAC addresses are down each 10G KR port, two, they allow the ILB and Query processes running on the probe to identify and catalogue Kraken tentacle nodes. A diagram of packet processing on the Kraken Tentacle is shown below: Packet Processor #0 2x10Ge KR Tentacle FPGA (10sx2 like) MEM Stream #0 MEM Stream #1 MEM Stream #2 MEM Stream #3 MEM Stream #4 MEM Stream #5 MEM Stream #6 MEM Stream #7 MEM Stream #8 MEM Stream #9 MEM Stream #10 MEM Stream #11 MEM Stream #12 MEM Stream #13 MEM Stream #14 MEM Stream #15 MEM Stream #16 MEM Stream #17 MEM Stream #18 MEM Stream #19 Time Index FlowHash Index Disk #0 Packets Packet Processor #19 Figure 3: Packet Processing on a Kraken Tentacle. EachTentacle FPGA is intended to be a simple mechanism of handling large bandwidth links into the Intel CPUs, in much the same way as a traditional DAG card. As such, we intend to leverage existing DAG firmware and software IP as much as possible. The FPGA is required to do the following: Dual-port 10G Receive: incoming packets will be multi-E3 encapsulated, with a destination MAC address which corresponds to a particular tentacle node. We will not de-encapsulate the multi-E3. Instead, we will add an additional ERF header which will include a standard flowhash extension header. However, the flowhash in this extension header will simply ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 21 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary contain the lower bits of the destination MAC address. This will let us steer individual multi-E3 packets based upon the E3 destination MAC address. We will steer packets intended for each node to a different memory stream. As the current HSBM supports 32 streams we will be able to handle at least that many nodes per tentacle (currently 20). Each stream will be processed by a separate process to strip the external ERF and E3 headers, write the packet to disk, insert references into the time index and theflowhash index. 7.7Tentacle FPGA Architecture – Query Return An overall diagram of the Query return process is shown below: Broadcast Query Query Processor #0 Time Index FlowHash Index Disk #0 Merge Process Packets Tentacle FPGA (10sx2 like) TX MEM Stream 2x10Ge KR Broadcast Query Query Processor #19 Pause Frame Flow Control Time Index FlowHash Index Disk #19 Packets Pause Frame Flow Control TimeStamp Ptr Index Sort Process Query #1 Packets Q2 FPGA TimeStamp Ptr Index Sort Process Packets Kraken Ethernet Switch Query #2 Query #16 Figure 4: Query Return Process Query returns are initiated by the probe/Centos box. The queries are broadcast via 1GE management links to each of the tentacles of the Kraken, where the CPUs start looking up the packets in timestamp order (making use of the time index). Packets are then encapsulated in E3 format with the query ID embedded into the E3 encapsulation and source MAC address set to the address of the particular node responding. The Merge process writes packets to a single TX memory stream. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 22 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary The tentacle CPU transmits the packets via the Tentacle FPGA in much the same way as a standard DAG card transmits via TERF. We can support any number of queries at this level – provided software specifies the ID correctly, it doesn’t matter that queries are interleaved in the transmit stream. All of the tentacles may transmit large amounts of data at the same time – indeed, this would be the situation we would like to see. As such, the outgoing 40G link will be overloaded. In order to ensure that we do not lose packets, the switch should be set up to utilise standard Ethernet pause frames in the case of congestion. This would have the effect of throttling the aggregate bandwidth of the tentacles to the outgoing bandwidth of the switch.We currently lack any support for pause frames in firmware, so we will be required to implement support for this.This mechanism of flow control back to the tentacle alsomeans that we do not need significant amounts of buffering within the firmware in this path. Flow control on a query basis is discussed in the next section as this is part of the Q2’s role. Once the query is complete (whether packets have been found or not), the tentacle CPU will send a packet indicating query completion. 7.8Q2 FPGA Architecture – Query Return As part of initiating the query, the probe sets up a pair of memory holes for the returning packets. The Q2 card receives the packets andde-encapsulates the E3. The packets are then steered to a suitable memory hole based on the query ID. At the same time, we write a pointer, a timestamp and a tentacle identifier (the abbreviated source MAC address) for each packet into the second memory hole. The packet data from each of the tentacles will be in order, but the ordering between the tentacles is not guaranteed.Software is therefore required to sort the pointers into timestamp order, such that downstream applications can retrieve an ordered set of packets. The query completion packets allow the software processing the packets to complete the merge process. Because we use two memory holes for each query return, we could implement up to 16 concurrent queries per burst manager. Existing firmware IP limits the receive performance to ~27Gbps of bandwidth to host memory (HSBM is only capable of gen2x8 speeds). This means that in the case of data being returnedquickly, we need a flow control mechanism back to the Kraken on a query basis. We need to ensure that the flow control only affects the query that requires it, not all outstanding queries. As such, we propose a credit-based system – each tentacle is given a set number of credits for each query return. As it sends packets, that number is reduced. As space on the probe is made available for that query (i.e. the packets are consumed), we return credits to ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 23 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary the tentacles, thus allowing them to send more responses for that query. This credit return path will need to be designed carefully so as to ensure we are not going to either exhaust buffering resources or artificially limit the data rates. Alternatives to software-based sorting have been explored, but are not described here (for example, merging between adjacent tentacles and daisy-chaining results such that we end up with a single in-order stream of packets). All of the options would require a firmware component that we don’t see value in for the POC phase of the project. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 24 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 7.9Next Generation Probe Architecture: A basic diagram of a possible next generation Kraken-enabled Probe architecture is shown in the figure below: Figure 5: Next generation Kraken-enabled Probe Architecture. Please note: this is a wild guess to provide a discussion framework. Beyond the “Generic Packet Storage Interface” nothing here is required for Kraken operation. Packets are captured by a DAG card and written to multiple memory streams. A multithreaded version of Capture Daemon (CPD) reads packets, processes them to determine the ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 25 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary protocol stack, hands them off to a DPI engine, gets them back from the DPI engine, adds the packet to any existing flow record updates for the Probe Meta-Database (PMDB) and passes the packet on to a Generic Packet Storage Interface. CPD also performs flow tracking. The DPI engine updates records in the PMDB upon identification of application. Note that in the current Probe the packets and their indexes form a unit which is separate, but, closely linked with the PMDB. In the next generation probe we are assuming that the meta-database and the actual packet store will be entirely separate. A user, running the VISION UI will determine what packets he/she would like to download and then ask VISION for those packets. Our intention is that VISION’s interface to the Kraken will be a generic “Packet Query API”, probably compliant with REST, which abstracts the query interface. 7.10 Intelligent Load Balancing Software Intelligent Load Balancing (ILB) is required to distribute traffic evenly amongst Kraken nodes to maintain performance. It is to be performed in one place only to allow maximum flexibility and abstraction from particular hardware, and runs in parallel to the packet processors in Figure 2, selecting flows to associate to MAC addresses. The exact algorithm has not yet been determined, but it is intended that such load balancing is flow-coherent wherever possible. In the case of very large flows that exceed the capacity of a single disk (determined by the number of packets from that flow over a short time period) the load balancing model may transition from flow coherent to evenly distributed (and back) moment to moment. It is intended to experiment with this algorithm to determine whether it is more efficient to distribute these to nodes on the same tentacle, randomly across all logical nodes, or striping across tentacles (possibly taking in to account the grouping of additional interconnect). Considerations in this include: balancing tentacle merge and link bandwidth with the higher inter-logical-node communication bandwidth available within a CPU. In general, for each of the 4 pairs of memory holes from the 9.2x2’s (see section 7.5), a thread does a 2-way merge (with a small buffer to handle timestamp jitter between the two streams), encapsulates a group of packets in E3 with a destination node MAC assigned by the load balancing algorithm. It then forwards this to one of two central threads that manages the two transmit memory holes on the Q2 (and transmits on a first-come-first-served basis). Load balancing between the two transmit streams may be required to mitigate congestion due to varying flow sizes, but there should be sufficient HSBM bandwidth to maintain the minimum capture rate (24G) in pathological cases. Parts of existing Probe software can likely be re-used for the merge section. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 26 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 7.11 Packet Storage Flow When packets arrive at a tentacle/CPU, they are load balanced into one stream per destination MAC address. There is one thread, associated with one (possibly two?) storage disk, that listen in parallel on one memory hole. This system is considered a ‘logical node’ and (as far as is possible) acts completely independently from the others. The logical node writes flow and time index metadata (to RAM, then to either its disk or fast tentacle local storage such as an SSD) and stores the packet to its disk in one large rotating file (that may be simply writing to the raw disk). 7.12 Queries and Query Response Kraken will support queries on stored data using a Packet Query interface, abstracted through a generic (most likely REST) API for use by Probe. Queries consist of a function with a time range, a filter (such as pcap/tcpdump) and an optional string to search for. For the POC functions will most likely be a fixed set such as packets matching, all packets in flow and bandwidth. A query result may consist of packets matching the query or aggregate metadata, possibly returned as part of the “query done” mechanism or potentially encapsulated in E3 using the Meta ERF format. The query mechanism should be flexible enough to allow composing functions and for adding simple userdefined functions that may involve custom packet offsets or distributed attribution table lookup (such as a NAT translation or GeoIP database) but these may not be implemented for POC. Each logical node (disk) receives a copy of the same query and processes it simultaneously (operating as independently as possible), returning results in timestamp order encapsulated in E3 with the MAC address assigned to that logical node.Each disk will have time and flow hash indexes to accelerate this process. Depending on Q2-end POC box merge performance constraints, these results may also be coalesced into timestamp order per physical CPU before they leave the tentacle (in which case they would use the MAC address of one of the 10G links).SeeFigure 4: Query Return Process for an overview of the proposed architecture. It is expected there will be one software thread instance per ‘software process’ in that diagram, with minimal inter-thread communication or locking required. As an end goal, queries similar to the following should be possible in production, and performed in parallel for each logical node (pseudo syntax): BW(t1, t2, map(username, map(nat1, flows(t1, t2, ip.addr == 1.2.3.* &&ip.port == 80, “foo”)))) In HTTP traffic from addresses in 1.2.3.0/24, find each instance of “foo” between t1 and t2 and return the entire flow on a match. Map the IP addresses of these flows to the ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 27 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary internaladdress assigned to that external address tuple at the time of the packets, and map that to the user assigned that internal address at that time. Finally, calculate the bandwidth of these flows by user and display as a chart. With a reasonably flexible generic mechanism for defining what is examined in a mapping and how it is looked up. Mapping tables may be large, distributed and time-varying. At least the inner function will be possible in the POC.Worst case, if a query requires searching entire disks (such as a very broad text search) it may take up to 8 hours for all sub-queries to complete, due to disk read performance. Is this ok? 7.13 Management processes There are a number of processes that need to be thought through for the Kraken to function effectively. Some of these include: 7.13.1 Startup / Boot On startup, there are a number of things that need to happen: FPGAs on each tentacle need to be loaded. For the POC this will take the form of a CPLD and ROM for each tentacle, as for current DAG cards. Alternative image loading may be explored at a later date (i.e. beyond POC) Each tentacle CPU needs to boot and initialise the mini-DAG. The intention is that we use PXE to boot all of the tentacles from a common software image. The probe needs to know that the Kraken is available. It needs to know the capabilities of the Kraken, including how many hard drives are available (this number will not be constant, given that some hard drives will fail). This can be advertised via keep alive packets from the tentacle fpga. Adding an additional Kraken to the probe becomes an extension of this – the Kraken boots and advertises its capabilities, leaving the probe to adjust the load balancing to split the outgoing traffic between the available Kraken. 7.13.2 Disk failures The operational model for Kraken implies that disk failures will occur and need to be handled without user intervention. It is the responsibility of the tentacle CPUs to monitor the status of each of the disks. Once a failure has been detected, the following needs to happen: If a spare unused drive exists on this tentacle, the failed drive will be unmapped from its MAC address and the new unused drive will take its place. Packets destined for that disk will automatically be steered to an alternative disk. The tentacle needs to inform the management CPU that the disk has failed. If a spare unused drive does not exist, the tentacle cpu will remove that MAC address from the keep-alive table in the FPGA. This will halt the keep-alive’s for that MAC address. The ILB algorithm on the probe will notice this eventually and stop steering packets towards this ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 28 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary MAC address. Empty query replies for that MAC address may be necessary for some time to prevent lock issues. In terms of query processing, the tentacle CPU needs to handle the fact that the packets may have been lost, and be capable of handling that fact. Detecting partial disk failures (i.e. where some data becomes unreadable rather than the entire disk failing) may be a difficult problem. 7.13.3 Tentacle failure Tentacle CPUs are required to remain up for months at a time. As such, there is always the possibility that the software will fail at some point. At this stage, we do not have a mechanism planned for how to deal with this (ideally we would have a hardware reset mechanism, such that the management CPU can reset a tentacle). 7.14 Other considerations Some things that have perhaps not been adequately covered as yet: Buffering: we have DDR memory available on both the Q2 and the mini-DAG. At present, we haven’t identified definite use-cases for this. We would like it included in the POC, so that if we do require buffering at some point, we can utilise it. Packet types: We intend to use E3 as a transport mechanism. This is a standard Ethernet frame, with our own Ethertype and subtypes. We will need to carefully define suitable subtypes for the appropriate uses. Inter-node communication: while, through additional interconnect links, we should have plenty of east-west bandwidth available (at least within a Kraken box); how exactly this would work needs to be thought through carefully. It should not be necessary where flow coherency is maintained, but this will not be possible for very large flows. A simple alternative option for string search would be to perform the search after merge as the data comes in through the Q2 query receive side. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 29 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 8 Kraken POC Phased Development Plan Technical challenges to be addressed and how we wish to de-risk them. 8.1Cooling In the final system we will have a stack of drives in a chassis. Many questions arise as to how we can cool such a system. Actually purchasing drives, sata controllers and hacking together a test chassis, while accurate, is also going to be expensive (~$35k NZD) and time consuming (2 man weeks). Instead, we would like to hack together a representative chassis, buy a bunch of aluminium wide bar from UIrich, chop the bar into drive-like pieces, add power resistors to simulate the drive heating and push them into the chassis in different configurations with different fan formations. We should be able to rapidly test a large number of configurations for less cost (~$7k NZD) and in less time (1 man week).This has been agreed to (30-5-2013) and the work has been started. 8.2Chassis The eventual Kraken product will have a proper “bent-tin” chassis. However, this will take a long time before it will be available. As such we intend to build two initial prototype chassis in a simple quick manner. We expect the two prototype chassis to cost ~$1.5kNZD each. This cost was included in the estimate for cooling above. This has been agreed to (30-5-2013) and the work started. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 30 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 8.3Initial Development Platform (IDP) 8.3.1 IDP Introduction The final POC hardware platform will not be ready for several months. In the meantime we need to start developing and de-risking software and firmware. We propose to build the following development platform to let us study/develop packet capture, load balancing, MAC steering, disk write with indexing, query broadcast, query response and aggregation, query flow control. This has been agreed to (6-6-2013) and the work started. Figure 6: Initial POC Development Platform Table 5: Expected cost of Initial POC Development Platform Item dag92x2 Qty 2 Cost Each $ - 10sx4 3 $ - 8000-like test probe 40Ge/10Ge Switch Arista 1 $ - 1 $ 3,000 Haswell Motherboard 2 $ 332 CPU 2 $ 477 16-port SATA card 2 $ 1,432 Disks 40 $ 157 Small 1Ge Switch 1 $ 200 Power supplies 2 $ 100 Total: Ext. Cost $ $ $ - $ $ $ $ $ $ $ 3,000 664 955 2,864 6,280 200 200 $ 14,163 Notes: Stolen from lab Stolen from firmware team We can take one of Daniel's castoffs Perhaps free ? (May just steal from Systems) ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 31 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 8.3.2 IDP Plan First plans: what things we can do to move towards getting this thing working: Packet source: 3000 box transmitting out of one port. Packet capture probe lookalike (in a 3000 box): we use a 92x to receive packets into two memory streams (HLB). Software does the multi-E3 encapsulation (is there some software to do that already?), using a random MAC address from a given set of 20. We encapsulate in ERF again. We transmit the resulting packets out of two 92x cards (which strip the outer ERF off). o Work required: software encapsulation / ILB algorithm (splat for the first pass). Standard firmware for now. o Questions to answer: can we do encapsulation/ILB in software? what rate is achievable?Software can achieve 10Gb+ per thread when selecting a random address. Tentacle lookalike (Haswell MB, LSI card): 92x receives the E3 (we have two tentacles receiving half the traffic each). DO NOT de-encapsulate. Once through BFS, the outer layer of ERF will have a non-IP extension header. We replace the hash in this with the dst MAC address. We can then steer based on this to one of 20 streams. Software receives in 20 memory holes, each of which writes to disk via LSI RAID card thingie. o Work required: firmware needs to implement the steering based on MAC.DONE. o software needs to capture and write to disk. Initially this could just be a dagsnap, although getting the next-gen CPD (i.e. indexing etc) working would be the next step. o Questions to answer: can we write to disk at the desired rate? how much headroom do we have? Does the CPU have issues with 20 threads accessing disks at once?20 threads writing to disk can apparently each sustain roughly 50MB/s write performance. Check how pause frames will work... does the switch support them? what granularity do we have? How long will it take us to resume transmit? The switch almost certainly does support them – we need to test how well this works though. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 32 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary So, we need someway of trying to flood from multiple tentacles into the switch and check that we can saturate a given output link. Simple to test once we have a switch and some firmware which respects pause... Add firmware support for pause frames: on receive, we need to extract pause frames from the incoming packets and pass them to the transmit side where we initialise a timer and push pressure back to TERF until the timer expires.DONE, but not tested at all! Define a query interface: understand how REST API works and which bits we need. Q2 datapath query return path: o we need to implement a multi-HSBM 40G rxdatapath. o We also need to be capable of steering based on query ID, which is embedded in the E3 format. o We also want to de-encapsulate the packets (maybe?) Q2 datapath packet capture side: o Two or 4 BM read stream modules, as Gerard has apparently demonstrated (two streams) o multi-stream TERF: 4 streams, best effort merging into a 40G core. We want this to be a 40G rather than mapping streams to 4x10G because it gives more flexibility if one of the streams is larger than the others, without having to switch. Flow control options: o Packet storage: we have 400 receive streams within the Kraken. Each of these should be receiving packets from 1 of N processing threads (where N is somewhere around 6 perhaps – however many are needed to implement the ILB algorithm?). Each receive stream can send stream status to the processing thread via statistics packets – perhaps if the buffer level hits half full we send something, otherwise we just send once a second. These are generated by the FPGA monitoring the pointers, not by software. The ILB processing thread on the probe can then use this information to adjust the ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 33 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary load balancing distribution. Is this enough to handle hotspots / large flows? How long will it take to adjust the load balancing, and is that enough to divert traffic and avoid drop? o Query return: we have one receive stream per query on the Q2. The tentacles respond at whatever rate they’re capable of. If the switch is overloaded (which we would expect it to be if all tentacles are reading from all drives), we use 802.3X pause to stop the tentacles. At thispoint we’re saturating the link, so head of line blocking is not a concern (provided queries aren’t prioritized). Once at the Q2, we are limited by HSBM internal bandwidth of 32GB. This implies that we would need to push pause back to the switch on the 40G port. Provided we are using the full 32Gb/s bandwidth to the memory hole(s), again, this is OK. As a memory hole fills, we need to send stream status updates to each of the tentacles, so they are aware of whether to throttle back a particular query. This does imply that we require a separate transmit stream for each query on each of the tentacles. DAG Probe ILB ILB ...R Kraken Packet storage 40G ... Query Handler ... N Query steer (Packet storage operation) Stream status Storage status steer Q2 Tentacle SATA DAG x20 Transmit (two merged streams) R A Pause (32G datapath throttle) Switch Pause (40G link Query throttle) Stream status return Lookup Stream status DAG B SATA x20 Receive stream Transmit stream Packet flow Flow control / status flow R = Receive processors / ILB (6?) N = Query handlers (32? 32-R?) Firmware jobs: o o 10G pause end-to-end – i.e. we transmit pause ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 34 of 43 Tentacle (Query operation)
EDMXX-XX v1.6 Kraken POC Overview and Summary o 40G pause support? o Stream Status automatic transmission o Multiple TX streams (up to 32) o 8.4Ethernet Switch Options A fundamental component of this design is the Ethernet switch at the entrance to the Kraken box. The functionality that we need in this switch is fairly simple (L2 learning and non-block routing) but the performance of the switch will directly impact the Kraken POC performance. Our current first choice for the POC is to buy an Arista 1U switch. Another choice for this switch is the Intel Fulcrum FM6764. We are making contact with Intel and ADI Engineering (the maker of the reference design). However, as this is the hottest switch chip from Intel at present we may have trouble getting samples and reference designs. A much more difficult option would be to use an fpga but this would require developing our own 10Ge switching fabric as none is available as IP at present.Given the non-blocking bandwidth required by the design (at least 450 Gbps in POC) such a switch would require multiple FPGAs and would be a serious design problem. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 35 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 8.5Final POC Platform In the final Proof-Of-Concept (POC) demonstration we would like to have a full probe with Vision and the Meta-Database available. Whether the probe will be running EP5.1.1 or EP5.2 or something entirely new is unknown at this point. However, the probe will allow a view into flows and thereby allow the user to specify more interesting queries. The diagram below shows how the POC is intended to be connected Figure 7: Final POC Test Setup Note that the packets are duplicated to both the standard probe box and the test probe box (running Centos). This is done to simplify integration of Kraken related software running on the test probe. It may turn out to be easier to integrate the Kraken related software directly into the standard probe. In this case, only one probe will be required. This latter case has the added advantage that necessity of carefully time-syncing the two probes is removed. It is intended that this architecture will demonstrate all requirements listed in the chapter “POC Detailed Design Goals ToThe Meet Key Technical Challenges”. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 36 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 9 Appendix A: Kraken Rough Cost Breakdown – note – now out of date SUCKERS True Capacity 380 TB Advertised Capacity 350 TB total # of drives: 380 spare drives TENTACLES Mantle Power Supplies Cooling Chassis 30 drive cost: $ 77 cpu cost: $ 70 suckerpcb ? $ 50 Other sucker ? $ 10 Total Nodes: $ 47,642 # of tentacles 10 Tentacle PCB $ 100 Tentacle components $ 500 Total Tentacles $ 6,000 # of Mantles 1 Mantle PCB $ 1,500 Mantle SBC $ 300 Mantle FPGA $ 1,200 Mantle DDR3 $ 200 Mantle other component $ 800 Total Mantles: $ 4,000 # of Power Supplies 3 Power Supply Cost $ 350 Cable cost $ 20 Total Power Supply: $ 1,110 Fans $ 400 Boards $ 300 Components $ 100 Total Cooling: $ 800 Chassis Cost $ 1,200 Optics $ 600 Total Chassis Cost: $ 1,800 Total Kraken Components: $ 61,352 Production Cost: $ 2,000 Production Test: $ 1,000 ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 37 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary Total Product Cost: $ Gross Margin MSRP 64,352.00 70% $ 214,507 ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 38 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 10Appendix B: Ethernet Rings If we build a double ring architecture that would seem to guarantee continuous connection between a Probe and its Krakens and a constant high bandwidth. Clearly we must avoid a true Ethernet ring as that would lead to a storm of duplicated packets. So each ring must be broken intentionally at some point but fixed rapidly when another part of the ring fails. This is identical to how STP (Spanning Tree Protocol) works. Keeping a standard Ethernet architecture would allow extending the Kraken ring using existing switches. However, the existing solution of STP can require 50 seconds to “heal” a modified network – that results in far too many lost packets for us. Unfortunately, while there are a large number of potential solutions out there, none are truly “standard” as of yet. RSTP apparently has a convergence time of around 2-5ms (towards the upper end of that usually) per hop, worst case around 100ms (root bridge failure). It is well standardized in 802.1D2004 which also obsoletes STP. 802.1D-2004 apparently made some optimizations to the original version of RSTP to better support low convergence times, and I’m not sure how well supported that area is (I believe it used to be in the region of a second). RSTP wasn’t really designed with rings in mind so there is some unnecessary traffic that goes on, and half the ring is momentarily isolated. Essentially the problem is when a break occurs the ‘downstream’ switch/box thinks it is the root and advertises this down, it isn’t until the message reaches the real root (through the blocked port) that the correct information propagates back up. This means the backup ports with fast proposalagreement handshaking don’t work too well in a ring. A good description of what happens in a ring is from page 20 convergence.pdf. of http://blog.ine.com/wp-content/uploads/2010/04/understanding-stp-rstpThere is a paper on the performance at http://www.odva.org/Portals/0/Library/CIPConf_AGM2009/2009_CIP_Networks_Conference_Techn ical_Track_RSTP.pdf, as well as a calculation mechanism in IEC 62439-1 but I can’t access that (also see link at bottom). ERPS (ITU-REC G.8032) seems like quite a nice protocol, though it is fairly new so we might need our own implementation (which I’m guessing we’d probably do anyway regardless, except RSTP). It seems to be supported by a number of common vendors (Cisco, Juniper etc. though I can’t find any Arista) though presumably only in rather new switches and/or software releases and I am not familiar which product ranges. Essentially it works like the diagram in our document: One port is chosen to be blocked known as the Root Protection Link (RPL) owned by the RPL Owner (also known as the master bridge). If a node sees a link go down it immediately blocks the port and sends a message (the standard says ‘as quickly as possible’, within 3.3ms) to the RPL owner who unblocks ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 39 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary the port and sends a message around the ring to flush MACs. When the link comes back up those nodes send a message to the RPL owner which blocks its RPL port and sends a flush message, after which the affected nodes put the link that just came up into the forwarding state. v2 of the standard optimizes the flush messages in some way, and adds administration commands for specifically bringing down a link as well as support for multiple rings. There is also a polling hello message sent by the RPL owner as a fallback just in case. 50ms recovery time in the standard is for a fibre ring of <1200km, <16 nodes so we can probably do better than that. There are a couple of potential issues though there is not yet a mechanism for electing the RPL owner, nor for detecting erroneous multiple RPL owners. Because of this I’m not sure what would happen if the RPL owner failed. Separate data (RPL blocked) and control (RPL open) VLANs are also needed (there can be more than one of these pairs, and there can be multiple per domain). 802.17 is probably a dead end: It is a different MAC layer (i.e. not at all a normal Ethernet packet, though it does interoperate by using the same addresses and including the same fields somewhere), is quite complicated and appears to not have really been implemented by anyone much and seems to essentially being replaced by g.8032 in the minds of carriers. It has some nice features that are probably unnecessary like class of service. Then there is PRP and HSR defined by IEC 62439-3. They were originally intended for industrial control (the standards relate to use in power substations) and have almost zero recovery time as they send duplicates of packets at both ends and de-dupe using a counter attach to each packet. PRP seems to need a redundant link (not entirely clear why as the standard isn’t freely available, but I believe it is due to counters only being per network), and appends a trailer just before the Ethernet FCS. HSR is designed for use in a ring (send packets down both ends) and has a counter for each source(-destination pair?). HSR puts its header after the Ethernet header (possibly as an ethertype?) so that unlike PRP the entire frame does not need to be read to decide if the packet is a duplicate or not (which could be a long time with jumbo frames). This would affect compatibility though as it no longer appears as padding. The same standard also defines MRP (IEC 62439-2), a simple ring protection protocol (one ring only) that has one device send a control packet in both directions and checks it receives them. That standard guarantees <10ms for up to 14 nodes (depends on the size of the loop), but I don’t see ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 40 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary any advantage over G.8032 which is much more widely supported other than it being slightly simpler. All 3 of these only seems to be supported by more specialist switch vendors. Personally I rather like G.8032 due to its simplicity and closeness to what we would probably end up doing anyway, though I worry a little about what happens if someone takes out the RPL owner. RSTP should be fine as a management VLAN or similar, and might be ok for packets depending on how many packets we want to lose, if we want the simplest solution. HSR might also be worth considering if we want zero failover time, but we would almost certainly need to implement it ourselves and the standard isn’t freely available. I imagine with all these protocols since we control the hardware we could also have extremely short hello times. If we were to have our own protocol (such as simple very fast polling) we would need to be careful not to have a momentary loop when the link comes back up (as avoided by G.8032). We may also like to keep in mind that Kraken may not be in the same rack and how might that work with a ring? In summary, the ring architecture in Ethernet is problematic – doable but adds work – we need to feed in some actual customer requirements here. ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 41 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 11Open Questions 11.1 Query Size 11.2 Packet Sorting 11.3 Arista Switch 11.4 Text Search Can we break big queries into multiple 1GB queries ? Can we assume we know the amount of data that will be returned? Answer from Stuart was yes Do the packets in the returned query need to be sorted (in timestamp order) before reaching the memory hole on the probe or can we use cpu on the probe to perform the sorting ? Our development test platform will need a 10Ge capable switch. Will we be able to get such a beast from Arista ? Is text string search required for the POC ? Does it need to be distributed? If so, does it need to handle non-flow-coherent (i.e. large) flows that require east-west communication? ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 42 of 43
EDMXX-XX v1.6 Kraken POC Overview and Summary 12Bibliography Pinheiro, E. (2007, February). Failure Trends in a Large Disk Drive Population. Retrieved from static.googleusercontrent.com: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com /en//archive/disk_failures.pdf ©2013 Endace Technology Ltd. Confidential – External distribution prohibited – Internal distribution restricted. Page 43 of 43