System Compleat.

Things about Openstack Swift implementation


(, Younjin Jeong )

This is first time that writing blog post in English. There might be some wrong senses, so please read this post in favor. 

The Swift, from Openstack project, is one of the simplest storage cluster in the world. Like any other cloud solutions in nowadays, it's quite young project, and tries to follow Amazon service, which is called as "S3". The project is supported by vendors who wants to take lead in cloud market. It has a quite simple architecture to understand, easy to install and use. There's tons of blog posts about its benefits that you can find on internet, so I'll not explain those things on this. 

What I want to mention about Swift is, it looks easy to build up, but it isn't for enterprise service in my personal opinion. I've been worked for this storage cloud from early of last year for KT, and some other companies in Korea. Actually, the Swift system for KT is built by Cloudscaling, and I've been worked for it's hardware issue support first. After then, I wrote some automation codes for it with chef, and it's working on some services. So, I think I can speak something about its infrastructure, and it may helps your concern about how to build Swift infrastructure. 

In perspective of infrastructure, it has to be scaled-out as maximum as it can. That means, the network has to be designed for it. Not only for the network, we have to consider about commodity also. More important fact is, scale-out does not mean high performance every time. Here's the key factors to consider when we build Swift cluster. 

a. Low SSL performance of Swift proxy. 

b. Network design - Management, Service, and Storage. 

c. Number of disks per processor.  

d. Spec of each nodes. Network, CPU, memory, network. 

e. How to balance the load. 

f. Number of zone ( it's different with compute cluster ) 

g. Account, Container service load. 

f. Automation 

...and more and more.. 

To be honest, it's not easy  to archive all of those things yet. because there's many things are not proved yet. And those factors cloud be differentiated by service use cases. In other words, most factors shown are depends on type of services, because the most basic performance is based on disk I/O. So, if we are building this Swift cluster for A service, not for public such like Amazon's S3, then we can aim it's best performance architecture by type of service. I might can say, it can be adjusted by number of disks per processor. If the service is used for larger objects, then we can put more disks per processor. If not, then we can reduce the number of disks per processor, and can see how it effects to the performance. One another way to improve disk performance in Swift is, using Raid controller which has Cache memory for write back and BBU for maintain cache information. Basically, Swift recommends to do not use Raided volumes, so you might have some curiosity for this mention. The way to implement this, make every single disk as Raid 0, and creates volumes on it. So, if you have 12 disks on storage node, then you have to build each one as Raid 0, then you'll have 12 Raid 0 volumes which is not stripped. The reason why using this is, to use "write back" function. If we use this way, then a storage node will writes to the Cache memory of Raid controller, instead of actual disk ( or buffer on disk ). This will gives many performance effect for disk I/O. But remember this, put Raid controller to every single node will make higher costs. Here's some good article for this kind of issue, but it's based on S3.    

Not only for the number of disk per node, the network is great pain point when we build this. Currently, most cloud services are using MLAG for non-blocking network. Well, it cloud be ok for Swift clusters. However, as we all know, the MLAG is vendor specific, and it needs layered architecture when we want cluster to be scaled-out massively. It also needs more expensive switches for top layer ( which can be called as "backbone" ), based on chassis, which is very expensive thing.  And more, a single node's network status change effects to the core network also because it's based on L2. As an example, if there's huge number of nodes, then the core network should have all of the information of each node's mac-ip map. In various MLAG architecture, it's not flexible for its status change, so it makes serious problem sometimes. And as you can easily imagine, there's more possibility of failure when number of server increases, then the network cloud be slower because of the actions such like update, reference on the table. And there's some MLAG doesn't support active-active, means the bandwidth utilization cloud not be maximized. So, I would like to suggest to use L3 from ToR architecture, such like an iBGP/OSPF with ECMP. Dan Mihai Dumitriu from Midokura and I are figured out how it works for cloud network, and it's effect is exactly same as we wanted. Faster fault tolerance, single node's status change does not effect to core network, not hard to expand, and can be massively scale-out. It's explained in old post at this blog, but unfortunately, it's wrote in Korean. But you can easily understand because it has diagram, and actual switch configurations. Anyway, it solves most problems of MLAG or L2 based architecture, and we can archive more good model for enterprise services.

One more important thing is, the problem of SSL. Swift is based on REST API, and it means the authentication / authorization information are flow through HTTP header. It means we need to secure every single HTTP request with SSL, so the HTTPS is basic thing to make service secure. However, the Swift proxy is built with python, so it's SSL performance is terrible. Therefore, we need to put something else to make it work faster. That is, process SSL with reverse-proxy or software based load-balancer, then pass the requests to Swift proxy which runs without SSL. If you are an infrastructure engineer, then you 'll have some ideas about how to solve this issue. There's some ways to solve this, but one thing that I want to notice is do not use hardware based load balancer which has OpenSSL hardware integration feature. Well, it could be better than single machine's performance, but it's more expensive than software. Don't forget you're building a cloud service, which means there's possibility that every components of service could be added or expanded. 

The last thing is ( even if there's much more things ), about the zone. All of the Swift document says the "5" zone is needed by default. If you have a good understand of Swift, then you already know about what the zone is, which has strong relationship with cluster expansion. And more, it's related with every single rack ( cabinet ) design. This is about how to make expansion model, and also how to make fault tolerance model. As I raise this issue, so you can think about why the "5" is considerable number, and how it effects to the cluster when we change the number to another. 

Not like many other open sources, this solution cloud not be used with "default" options. If you can remember how the linux worked in late '90, then you may understand what this means. There was "no" defaults, and sometimes we need to change some source codes before we compile, and use it. Back to the Swift, I didn't say about DB updates when massive requests comes into Swift cluster yet, because it's needs to figure out by your hands, for your services. There are many decision points exists, it means sometimes it's too hard to go on with this. It's strongly needed to do many tests to build this cluster for right service, then you can solve many of this issues. In my experience, if you build this cluster in production level without PoC or massive tests, then you'll fall into deep problem that cannot escape.  

Many contributors are working hard for Openstack, and I give thanks to them. It's get better and better as time goes by, and I believe that all of its component will be stabilized with more features in not far distant future. Swift is considerable solution when someone want to build huge size storage cluster which will used through Web. But there's always tuning point exists, so our major job to figuring out "butterfly effect" in cloud. The small change will makes huge difference for performance, and costs.

Thanks for reading this post in ugly English, and please feel free to write your opinion about this. 

(, Younjin Jeong )