Integrated Approach from model introspection to Architecture transformation of CSP’s
As organizations embrace the importance of Digitization to meet the mandatory requirements set by the Industry4.0 evolution there are certain targets which all business whether small or big want to achieve which are nimble, efficient, open, and scalar and HPC (High Performance). Since the word Cloud translates seems like a One stop nerd jerky solution to fulfill all above problems so whether we like or not every organization want to evolve to cloud, though every enterprise targets may vary like For a bank it can be centralized control and Operations agility, for a CSP it can vision2020 to compete with OTT’s and nimble players and making situation more complex for vendors like Huawei whose future is driven by customer requirements to update product, services and solution offerings to meet those requirements.
Believe or not every business, every segment evolving this way and Cloud is central to it. As an industry reference we can look at RedHat official annual reports in recent years and blogs from James Whitehurst and explanation of Cloud companies rise in this segment, Mirantis is not too far from the same. On services side we can see a big turmoil still as for your information AWS in 2016-2017 in Australia market captures 75% revenue from niche consultancy services (internet source) , similarly Cloud companies one can safely say that are in a state of turmoil the Cloud company want to put an Telco vendor Hat and similarly Telco vendors want to put a hat of IT companies everybody wants to give one message to Customer CXO and Strategist they know better than others with clearly missing one notion how to best create a value of customers?
I think best way to answer without prejudice and balance is as important as developing a solution because if evolution partner is promoting its own solution no way It can support organization to reach its target Architecture and meet minimum criteria of minimum enterprise continuum as we used to build the future Network2020 for a CSP .
So in this paper I want to bring some dimensions of How to make a robust design of a Telco Cloud , how to smoothly integrate it in the Live environment , I will cover the domains of NFV , Telco Cloud , SDN its integration and Agile delivery based on Telco DevOps approach . This is my focus area for many years and I want to share my own idea as a network architect on this direction to the Network architects , the purpose is also to indirect propose an approach for any open System integrator to execute such transformation in the most open and transformation manner .
So first for an architect we must list down difference dimensions and how IT Cloud will be different than Telco Cloud to benchmark a transparent index to select ,evaluation and optimize !!!
1. Solution Selection
How to select the best products for your Network, I think there is no single answer and most clear answer is driven by requirements , not only technical (SoC , Proposals , KPI ) but most importantly on commercial and commitment from partners to strive to meet customer requirements and fulfil smooth migrations . Not only delivery is important but the Support service SLA is importantly important. we all know the traditional RPO/RTO model as we used to see in the IT services can handle all CSP requirements, believe or not IT companies reliance on so many partners and total separation of H/W selection and application services is the biggest concern for a CSP and a matter of principle the partner who shall combine different layers will leverage maximum advantage for the Enterprise
2. Maturity of Solution
In the Legacy world all vendors used to develop in one direction, lock the standard first in standard first, then do a pilot project for solution hardening then go for mass rollout. Only issue is that in NFV world with standards refreshing changing every 6months it’s not the technology but the speed with how we harness them. It means the main factor must be that when a Bug comes with so fast changing standard who will Fence it and block from CSP environment. Partner capability in eco system development, Open Labs, EANT Labs (my favorite initiative) and Plug test (4th one and we are straight participating and evolving through it ) will define the Value Chain to ensure Openness does not mean non- Standard and solves the one main issue with Innovation in all those years !
3. Use Case development
Rome was not built in a day as adage goes so goes in open world, the Solution must be Use case based and Cloud is not an exception. I think for CSP specially Tier1 the Key use case of NFV Telco Cloud are as follows
Lock the Interface specification specially Ve-VNFM-VF/EM
On boarding API’s and Image parameters standardization
Simulate a value chain such as done in OSM community like vCPE, vIMS, vBNG end to end through NFVO /Tacker
I think in IT Cloud too much work talks to the Code and as matter of principle cannot fulfil how Cloud must integrate with the Telco environment
So as we see it our target in Cloud Journey in 1-2 Years must be
Target1: any new VNF /Application coming from any vendor even in house can onboard on any Cloud in 1month Maximum
Target2: any copy or clone as we used to say must not take more than 2days, 1 day for onboarding, followed by another day for tuning and validation through tempest, Rally, Cloud Performance, Cloud Availability etc to mention the few
In the Re-Architected world the Capability continuum must sum at something like 60-90min for any application but this target need standardization of interfaces and obviously need Re-Architect process completion first.
4. Represent a Telco Device as a code
Rome was not built in a day as adage goes so goes in open world, the Solution must be Use case based and Cloud is not an exception. I think for CSP specially Tier1 the key is to know steps of Telco PNF modeling as a code, it seems simple bur involvement of so many community and standard body make it complex also. We need to refer Point#6 to answer this part.
5. How to develop Networking Solution for a Telco Cloud
For an IT Cloud this seems like a standard simple shot to design the Underlay and to overlay the traffic using BGP /MPLS using an agile way. Contrail which is market leader gives a good overview of this in the contrail community updates. However as we move to the Telco Cloud things no longer remain the same and whole story must start from POD or which my transport friends call Fabric design . Sorry for mixing POD with Fabric but fact is this is an important idea of a Telco Cloud. For a Telco Cloud based on OpenStack, a POD should be self-contained with APIs (controllers), compute nodes, and storage as well as any network nodes including the infrastructure. So whole idea is that east/west and north/south is segmented and controlled in a way that E-W traffic must not leave the POD and . It also helps if they are setup in Availability zones, this way workloads that are east/west intensive can be assigned to PODs where they won’t have to leave the POD for east/west data. POD sizes should be small and there should be a lot of horizontal scaling. A hefty good paper from big switch explain this part http://go.bigswitch.com/rs/974-WXR-561/images/Core-and-Pod%20Overview.pdf
Moving up the next thing that come are bridges Br-int, Br-Ext and BR-Tunn, actually Bridge is deployed using ML2 Neutron drivers as a plug of OVS
To make design simple the Management zone entities like Mirantis MCP, RH Over cloud, Ceph can use Standard ML2 bridges with the OVS which is also a default configuration.
For the VNF the mature VNF vendors propose the idea to industry that bridge is not used because most HPC VNF’s will use Provider network and not the tenant networks for the workload. As this requires VNF directly access the Fabric network so no need vNIC presentation to the Bridge and this is the reason for the OVS-DPDK scenario no need to deploy a bridge on the Computes , for SR-IOV it will be more simple . Also as per community recommendations it is suggested not to deploy Standard OVS with the OVS+DPDK on same host for the reason that the Kernel OVS bridge which is based on ML2 Plugin will cause Br-Int as single transfer point and can impede the performance of a Telco Cloud . From OpenStack Pike the community can work really how to make standard configuration and architect design for whole Cloud ………..For now this issues is left to the community.
In my later blog I will talk how to optimally plan your NFV DC consider any VNF, any Cloud to decide on optimum Flavors, OVS and Bridges mode and as a matter you may come to architect a solution where for Customer-1 you may think deploy every VNF as SR-IOV will save you more NIC/CPU resource than another scenario-2 where the Deployment should follow the VNF principle like a OVS/EVS for IMS while a SR-IOV for BNG etc. This itself is a detailed topic and author believes need to shed light in detail and will be discussed separately.
For networking BGP will be the key, with SDN as well as without SDN. One of the most important architect component shall be to know the fact that the SDN and specially the contrail https://www.juniper.net/us/en/products-services/sdn/contrail/contrail-cloud/ consider BGP to integrate both future Cloud and non-Cloud Network part , the Open Stack consider BGPaaS so DC should consider this part .
6. Telco DevOps is not IT DevOps
In the Year 2011 the webinar introduced me to this strange term of DevOps , over the years I have come to conclusion that since NetOps is NE focused while DevOps is IaaC focus , means IT only want code abstraction and automation so the two cannot be same . In my recent talks with leading Community architects and mentors I have come to following conclusion about this
a. Linux scripting is the key and you cannot do DevOps unless a CSP cannot free their OPS from Infra OPS , must go scripting way
b. The Cloud must support multi cloud. I means Telco DevOps must watch for the future of deployment of OpenStack + Container , using an open source tools like Spinnaker , Rally will be a nice idea to go
c. Link to the community, the Crazy Mirantis MCP is a great example of this using SALT formulas, Meta model standardization and artifacts pulling through gerrit and Jenkins is a nice crazy way to automate you Telco Cloud.
d. All are IT terms how about Telco, the Telco DevOps must consider TOSCA, HOT, YANG and YAMl scripts automation and obviously its abstraction. You can follow my little crazy profile on Jenkins and Code Club talk about this. But code is useless and logic is the key, so how to successfully model your Telco NE’s in software and script is the real challenge. Second will come how the Telco DevOps will join the hands of IT Cloud to automate the whole ecosystem through single point of access which is the orchestrator
e. Finally to guide for transformation you need move from PMI-PMP to PMI-ACP as I did a year before , teams , process and tools to support it to develop an Agile Process support this , what is target for new requirement can meet in 2weeks maximum from idea to offering .However it seems like a long journey to reach it
7. Automated testing
For a IT Cloud it’s a good idea to automate everything but for a Telco Cloud the nice idea is to start from test automation, if a NFV project can automate 50% test and on top for any Infrastructure change can validate all environment automatically it will be ideal to build a unified Telco Cloud, a nice bench mark to start automation journey in hill climb phase must be
- New DC Clone (75min)
- Cloud verification (65min)
- VNF deploy (55min)
- FR/NFR test (20min)
- Continuous monitor using AODH , Gnocchi , ELK , Nagios is a nice point of start Telco Cloud Journey
Finally the Cloud Transformation partner must have a Solution continuum to support such service like 50+scnerio and 1000+ cases which a project customer can select and optimize.
8. Multi-Site Cloud Design
What makes the Telco Cloud different from IT Cloud is the way the service will be orchestrated and healed. For example cross DC , backup’s and network scaling to mention the few , as per Telco Cloud Design foundation Tricircle and its shared network design is a key to plan a Telco cloud , a nice new community Trio20 which talks about Nova ,Cinder extension across sites is another nice idea . The one important concept will be SAN or Ceph as it is the data container of all Instances, and for Solution selection any SDS will work as long it can handle for the structured and un-structured data
9. SLA Multi-Site Cloud Design
It is very common to say Telco Cloud as 5 9’s vs IT Cloud as 3 9’s but how it should realize is the key to successful realization , below guide line is a nice idea to start with
a. In OpenStack use HAproxy , mem cache , redis to offer HA
b. VNF blue print to support HA by domain , by service
c. Service pool or load sharing design to control HA by Telco Service
d. VM and Aggregate design
10. HPC Cloud
HPC or High Performance Clouds are the way NFV should build , the NUMA design is the key on NFVI , will NUMA will be crossed , how to adequately match the CPU cores , PCI Pass-through support like for VNF1 use DPDK , VNF2 use the SR-IOV , similarly dynamic Huge pages design is a key . Although it is a good idea to plan a uniform huge pages but as per our experience this will result in huge compromise in optimum resource selection and hence an architect must consider.
11. Integration tool set
The IT application as a principle are standalone applications and only Vn-NF interface is enough to make them work while for the NFV Cloud it are 10+ interfaces like NFVI , Ve-VNFM , Or-Vi , Or-VNFM , Os-Ma , Se-Ma etc
How to standardized these interfaces and API’s for smooth onboarding and resource orchestration and developing the corresponding tool chain is the key for transformation
12. IT Cloud Migration vs Telco Cloud Migration
In a typical IT Cloud migration you must normalize and study only the DB replication and instance states but for Telco Migration the list can be exhaustive including meeting Customer KQI etc , PNF to VNF migration and managing operations evolution , customer experience and Smooth evolution is the key to success
13. Multiple Hypervisor Support
In an IT Cloud we primarily talk about XEN while for NFV Cloud the ETSI talk about KVM in their Reference Architecture but VMware Stake is as important and design must consider how to incorporate KVM with VMware. The driver testing validation and pooling of both is key to expand the cloud. In this phase of the industry may be it is difficulty but for Industry4.0 this will be target reference architecture.
14. Open Platform is the real problem in Telco Cloud
IT cloud is disruptive and very open, it is Open based on API’s, while a Telco Cloud requires lot of standardization such as ETSI RA, Plug test, EANT, Community etc
15. EPA and building a High performance Clouds
I remember the inclusion of EPA in OpenStack and its use case for 4K video the famous one in community to prepare different hardware for different use case, like make best use of memory, Disk, RAM, smart NIC etc. A company named mellanox have done a hell of job for this work. To sum up in the future converged Cloud following will be key to deploy a true NFV cloud
a. NUMA and the CPU affinity need focus how to assure workloads placed in right NUMA , you may need make design align VNF , NUMA , OVS/Bridge and CPU/NIC
b. TLB buffer size and associated Huge pages normalization for a target Cloud
c. CPU pinning main focus must be the PMD threads , their dimension criteria and allocation across different VM’s /vNIC
I personally think there seems like an inherent Pandora box in Community specially for OVS workloads that needs to be addressed for SR-IOV it seems OK as NIC pass through can be used to achieve the Line speed
16. Architecture transformation
For the Architecture you must consider migration as ultimate target is move some service from As is to To be Architecture. The key principle for IT Cloud is App focused as all customization is to be done on App not the Cloud. This makes a Public Cloud very simple to manage and automate , conversely the Telco Cloud is a different world , the PNF migration will need understand the detailed analysis of service requirements and the to customize the Cloud like EPA , HPC , THP etc to meet the Telco Service needs .This is very important point for the Architecture .
Conclusion for the CXO Office and Chief Architects
Finally I want to wrap this paper with the summary that Telco Cloud is not the IT Cloud, as Network architect you must quantify the real requirements and meet with both the Solution maturity and the service capability of the partner. To summarize the dimensions which make Telco Cloud different from IT Cloud are MVI IoT , KPI , Performance , Elasticity , Security , HA , Service SLA /KQI , NFV Assurance part and smooth operations . For Tier1 which want to transform there are some more areas to look like how to transform through DevOps, Process, Tools and Skills. This al need to evaluate from Telco service point of view. I think till now why despite many commercial Telco Clouds the Migration has not happened is not technical, it is because the Architecture not consider the fact that this change is not the technology but the Enterprise architecture and Operations will be the biggest impact for this .This is the reason a strong partner with IT skills and Telco mind is the key to remove the impediments and achieve quick wins for the business.
Closing the discussion ,well what we infer from this paper is that IT Cloud concepts no longer linearly applies to the Telco Cloud specially if you architecting a cloud that will future host the vAPP seamlessly on a common DC . Similarly concepts of Python, Net flow no longer applies to a case where you need Service abstraction and automation from business and CRM perspective. Specially consider future OSS the Business use case need model to Service catalog and one click instantiation , so you can see the concept is not the same at well . Similarly scalar VNF is not the same idea as we know Scaling in IT Cloud which is a mere VM and resource expansion. Those who are new on this part may need to learn statistics first especially how LMA and DLS algorithms really support for the Network automation. Seeing it altogether NFV is a system while Cloud is a Platform so concepts of E2E QOS, HA, SLA, Security, FCAPS and building a flatter architecture will be key objective of CXO office.
Nevertheless no architecture is complete unless Security is built in. We all know Cloud took longer acceptability time from end users due to this and a matter of principle security need design in instead of design out because Open interfaces also means that everyone can know the language of how to talk to a component it means the advantage can be exposed . Inherently the API security through HAProxy, image compression can address high level issues but control of tenants will be key especially for the future converged clouds, the biggest challenge as CSP will open the Cloud will be IP related because Self-service case means many malicious attacks from Unknowns. Hypervisor security and future separating the Host OS per domain security in Kubernetes will be the main concern for the CSP’s. The Network is just opening imagine the case where the future S/W developer from a university will be invited to access your cloud to write an application for you. How the Chief Security architect will accept it. It is not acceptable now but will be acceptable as we evolve to build architecture moving along for this the key point Cyber teams need to remember is division of domains and duplication of analytics . Confused I mean IPtables + Security groups +ACL but it will impact performance for now so to support such architecture NFVI need evolve , Pike standard is just addressing 80% of such scenarios and we do write to community to ramp up this Part . May be you know now in the community this talks is given most important and infact you can join this work also.
The Market has just opened there are many partners along , even we know cases where having Cloud expertise can be shown by company a sign they know the evolution part , this is a trap knowing service and its assurance is the key , to best to survive is who is more open . For me Open means Tireless effort to bring value to customers and make community feel. Hey Guys we are together, we are Open and we will solve together!!!!
Finally every business has right to separate wheat from Chaff and we should embrace all that glitter is not gold. Best of luck and see you later in the new edition
The paper cannot be considered complete unless I thanks following
- Ben Silverman Principal Architect OnX Cloud , a teacher , a friend , my mentor
- Jaakko Vuorensola from Redhat a longtime friend and influencer
- Ajay simha RedHat Chief architect for the crazy work for the reference architectures
- OPNFV ,ETSI , OSM are obviously bible for all this work , how to solve issues in Open Source way is obviously a nice idea to have while evolving from NetOps
- Customers and partners , your questions and problems is what define my writings
And obviously the crazy consulting team in Huawei, together we believe understanding customer real requirements and to build a solution is the best way to transform business and to bring long time value to customers, partners and industry.
Sheikh is Huawei Middle East Senior Architect for NFV, SDN, and Telco Cloud with focus on ICT Service delivery through Telco DevOps. Focused to define the Roads for future 5G Core Network. Always interested in those disruptive technology driving the industry transformation, Author hails from Telco CSP background and since 2013 working on Telco Cloud domain including Amazon, Huawei, Mirantis, VMware, RedHat etc . The comments in my writings are my own and shall not be considered as any relation/binding with those of my employer .