Cornelia's Weblog

my sporadically shared thoughts on, well, whatever is capturing my attention at the moment.

Diversity in Tech, Openspace at Devops Days

I was at a conference last week, Devops Days Pittsburgh, that had lots of Openspace (aka unconference) time built in, and one of the sessions I attended was on diversity in tech, or shall I say, the lack thereof. It’s a tough topic, the problem is so vast and multi-faceted that it’s hard to even know where to start, so I’m afraid I cannot report any major breakthroughs or deep insight. I will share some observations, opinions and emotions though – sorry, it is an emotional topic.

Where we decided to start was by going around the room and having each person introduce themselves and say a bit about why they chose this session. Some people, men and women, reported personal interest, others, aspirations to improve the situation at their workplaces, and some brought a concern on behalf of a loved one, usually wife or daughter. Though we acknowledged that there are many underrepresented groups in technology fields, we spent most of the time talking about gender. The discussion was quite active, jumping around a whole lot, but there are three things that are really sticking with me.

First, during our round of intros one gentleman said he was there because he was looking for help on how to have the conversation without sounding bad. This resonated with several others in the group and many agreed this was a laudable goal. But then another gentleman spoke up, expressing his dismay that there was such interest in (paraphrasing) “white men wanting have people help them not sound like bigots.” OMG. Okay, I love you. I learned a little bit about microaggressions from an extraordinary woman I recently met (Jane, lunch some time soon? I promise to eat this time ;-) ) – and have a look at this link, it’s really interesting. I’m dense – most of the time I don’t see them until they are pointed out, just as in this case; I didn’t see the microagression until it was pointed out. Women, and other under represented individuals, are the ones with the problems that need solving, not the white guy. Let me emphatically point out that the guys who were looking for this help are good people, I don’t slight them personally at all. We all have biases that have been ingrained into us from a very young age and, in a way, I see their question as a way of expressing that they want to eliminate them. Maybe that’s the lesson here – that we should all work to get a handle on our biases so we can then work at them.

The second moment came when a woman in the group spoke up, said that she had never experienced any discrimination and that women had only themselves to blame for being underrepresented in technology fields. Are you freakin kidding me?! May I present computer engineer Barbie and Bad Ass Programmers? I have another link for you, lady, read this article on ambient belonging. It’s real. It discourages not only adult women from going into tech, but even more alarmingly, filters out girls so that by the time they get to high school they couldn’t be more adverse to going into these male dominated fields. Point is, the fields are male dominated even at the primary school level and there are very real reasons for that. It’s not the girl’s fault! Argh!

And then, at some point we talked about maternity leave policies in the US, and, of course, it was pointed out how anemic they are here, relative to other countries. [Hmm, all of a sudden I want to see statistics on representation of women in tech in different countries. If you’ve got some links, please comment.] This then lead to a conversation about how difficult it is for a mom to take some years off for family, and return to the work force. One man told us that, in fact, he discouraged his daughter from going into computing for exactly that reason. NO WAY! I was, and am, devastated by this! I cannot believe that even well meaning, technology savvy parents are doing their daughters such a monumental disservice. I personally [and this is a very personal subject for me] cannot imagine having found a career that I love more than the one I have, and if I had gone into something else, just to plan for an eventuality that might or might not pose a challenge… well, “catastrophe” is the only word I can think of.

This really hurt me in a deep way. I spent a good bit of the weekend thinking about it. The challenge is real, no question; in certain, significant swaths of the computing industry, the technology changes so rapidly, and so significantly that taking a few years off can leave someone very far behind. Take time off for the kids to be fully grown, and whoa. When my son, who is just finishing his first year in college, was born, very few people even had cell phones. So how could someone who had taken a young lifetime off possibly have the skills needed to build mobile apps? But this problem IS SOLVEABLE. First, there are ample opportunities to stay relatively current in your skills. There’s Coursera and many other online learning sites, and programs that can help jumpstart (or restart) people, particularly those with background and drive. Never before have I seen so many easily accessible communities – women who code, black girls code, Data Driven Women and lots of other groups focused on just about any technology. I’m not saying a mom (or dad) needs to work full time on this while raising kids, but there are lots of opportunities to make things work. What we’re talking about here is not a “job,” it’s a career and needs some care and feeding.

And when I was talking to my husband about it, we also realized that there are some technology industries that don’t change as rapidly as others. He works in aerospace, for example, and feels that someone could take years off and reenter that space with little difficulty.

So there are some pragmatic things a woman can do that would allow her to be a mom AND have a rewarding career. Please, for god’s sake, do NOT discourage young women from going into tech because they might become pregnant some day! Talk about micro, no, macroaggressions.

This is a test (to be followed by some Cloud Foundry love)

Oh my, is it possible that It’s been more than three months since I posted here? In my defense, I have another blog that I now post on – well, a few actually, including blog.gopivotal.com and blog.cloudfoundry.org. I’ll post a summary of my recent submissions with links soon.

I am, in fact, about to go live with a new post over there and I’m experimenting with embedding some vine. So this is a test.

And not to leave you hanging, this vine highlights a feature of the Cloud Foundry PaaS – one (spoiler alert: of 4!!) way that the platform keeps your apps running. When configured to use two “availability zones”, the elastic runtime part of the PaaS will evenly distribute your application instances over all AZs so that if you loose one AZ you are guaranteed that application instances are still available and serving traffic. This is one of the things that beautifully demonstrates the benefit of PaaS over IaaS.

I’m sure you’ve all heard the stories about an AWS AZ going down and taking with it a bunch of web sites/apps. You know what? That’s not Amazon’s fault; there is no way that Amazon can or would guarantee constant uptime of all AZs – that’s why they offer you more than one. As an application devops leveraging AWS you have the option of deploying your applications across AZs so that if one fails your app is still running. But you are responsible for planning this and executing on it. When an AWS AZ goes down and takes an app with it, you know the option was not exercised. With Cloud Foundry we take care of that for you. You just push your app with multiple instances and we’ll take care of the distribution.

Oh, and I should note that Cloud Foundry availability zones are supported even when running on IaaS other than AWS – in case you didn’t know, Cloud Foundry runs over vSphere, OpenStack, AWS, CloudStack and more.

There’s more to come but that’s enough for this simple test.

Changing quota definitions

Recently I’ve had several customers ask about whether quota definitions can be changed, both in the OSS product and in PCF.  The short answer is “yes”. Here’s the longer:

First, for those of you who might be unfamiliar with quotas, they are assigned to organizations and basically limit consumption within that organization.  A quota definition sets limits in certain areas, such as memory, number of service instances and so on.  The Cloud Foundry deployment has any number of named quota definitions defined, and an organization is assigned a single quota_definition.

Viewing the defined quota_definitions

Currently the quota definitions for your elastic runtime deployment are only viewable through the command line client.  As you surely know, we currently have two clis, the old (v5), which I will henceforth refer to as “cf” or v5, and the new (v6), which I will henceforth refer to as “gcf” or v6.  To see the list of quota_defitions you can use the following commands:

cf -t set-quota

That is, execute the command with no arguments and you will first be shown a list of the available quota definitions.

Or with the new cli, first set the env variable CF_TRACE=true and execute:

gcf quotas

In both of the above cases we’ve run with tracing turned on so that you can see the details of the quota definitions. This will be key as you will need to know the format and key names when issuing curl commands described below.

Setting org quotas

Given existing quota definitions, assigning one to an organization is done very simply with the following (v5 and v6) CLI commands:

cf set-quota --organization myorg --quota-definition trial

or

gcf set-quota myorg trial

The final argument in each of these is the name of an existing quota definition.

Creating, deleting or updating quota definitions

Quota definitions themselves are created one of two ways; either at the time that you are deploying the elastic runtime, and/or post deployment, using the api.

At deployment time

This is done by simply including quota definitions in your deployment manifest properties, for example:

properties:
  quota_definitions:
    free:
      memory_limit: 0
      total_services: 0
    paid:
      memory_limit: 10240
      total_services: -1
    runaway:
      memory_limit: 102400
      total_services: -1

PCF does not yet expose quota definitions in the Operations Manager (deployer), however, the good news is that regardless of how you deploy the Elastic Runtime, you may edit the quota definitions to your heart’s content after deployment.

Post deployment

To create a new quota definition using the API you simply need to issue a POST to the quota_definitions resource.  The simplest way to do this is with the (v5) cf curl command, as this will include your current token (obtained with a cf login) in the request; if you wish to use your favorite REST client you will have to futz with the auth token yourself.  Note that v6 does not yet (!) support gcf curl.

To craft the appropriate request you will need:

  • The method: POST
  • The relative path of the resource: /v2/quota_definitions
  • The body which will contain the details of the new quota definition:
    {"name": "default2",
     "non_basic_services_allowed": true,
     "total_services": 100,
     "total_routes": 1000,
     "memory_limit": 10240,
     "trial_db_allowed": true}

Putting it all together, the request looks like this:

> cf curl POST /v2/quota_definitions -b '{"name": "default2",
                                          "non_basic_services_allowed": true,
                                          "total_services": 100,
                                          "total_routes": 1000,
                                          "memory_limit": 10240,
                                          "trial_db_allowed": true}'

And, voila!, you have your new quota definition.  Go ahead and check.

Updating an existing quota definition is just as easy.  Recall from above that you can use v5 or v6 commands to get a list of the current quota definitions, with the details by having tracing turned on.  You need to find the relative path of the specific quota definition resource by looking at those details.  Then craft the appropriate request:

  • The method: PUT
  • That relative path: /v2/quota_definitions/286b4253-95de-442e-a2e8-d111c2adb2e2
  • The body; simply include the properties you wish to change:
    {"memory_limit": 20480}

Putting it all together, the request looks like this:

> cf curl PUT /v2/quota_definitions/286b4253-95de-442e-a2e8-d111c2adb2e2 -b '{"memory_limit": 20480}'

In response you will get the updated quota definition:

{
  "metadata": {
    "guid": "286b4253-95de-442e-a2e8-d111c2adb2e2",
    "url": "/v2/quota_definitions/286b4253-95de-442e-a2e8-d111c2adb2e2",
    "created_at": "2014-01-14T19:26:43+00:00",
    "updated_at": "2014-01-14T19:28:43+00:00"
  },
  "entity": {
    "name": "default",
    "non_basic_services_allowed": true,
    "total_services": 100,
    "total_routes": 1000,
    "memory_limit": 20480,
    "trial_db_allowed": true
  }
}

We are planning to add cli commands to facilitate this (stories: here, here, here and here), and are adding these capabilities to PCF Operations Manager, but in the mean time you have at least this level of control.

And one more note – we are just about to officially release the v6 cli (it’s been in beta) at which point we will rename it to “cf” – so just to keep you on your toes, in my next post on the subject, the new will be v6 and called “cf”. ;-)

Running a Local (as in, on your laptop) Cloud Foundry Instance

More than two years ago on the CF blog the availability of Micro Cloud Foundry was announced – this was a version of Cloud Foundry that would run on a laptop.  It came in the form of a vmdk that you could run either using VMWare Workstation or VMWare Fusion.  Later posts talked about a release lifecyle that would provide frequent updates to the Micro release, so as to keep it in synch with the constantly-being-updated public offering running on cloudfoundry.com. It was an interesting experiment that had legs for more than a year, but didn’t end up panning out all that well; I can share a personal anecdote that illustrates one of the reasons. More than a year ago when I was with EMC, I was working with some product groups that wanted to use CF, but not via cloudfoundry.com, but rather with a customized version of CF.  They deployed a private cloud in their data center but then wanted to enable their developers with a matching “micro” instance running on their personal workstations.  We looked into the process of getting a customized version of Micro CF but in the end the answer was convoluted enough to keep us from making any progress. As it happens, we weren’t alone.

And now, here we are on the verge of our first supported release and I’m happy to report two way cool things.  First, there are now numerous, very viable options for getting a local CF instance up and running, and second, the CF community has become so vibrant that two of the three options I’ll describe have come from outside of Pivotal. In this post I’ll share with you my personal experiences with and opinions on these three options.

I’ll go through these in the order in which they appeared on the scene, which is, incidentally, the order in which I began using them.

CF Vagrant Installer from Altoros

In April of this year the good folks at Altoros released the cf-vagrant-installer.  This project required you have vagrant and a hypervisor (no license fees if you use Virtual Box), as well as ruby installed on your host machine (laptop), and then allowed you to clone the git repo and run the “vagrant up” command.  Worked really well if you timed it just right, but because the standing up of the CF components was done via chef scripts that are a part of the project, keeping the installer up to date with the CF releases has proven challenging.

This was the first of the three options that I used earlier this year and I was incredibly grateful to have it.  That said, because of the aforementioned challenges in keeping this in synch with the evolving CF releases, and dwindling activity in the project, I no longer recommend it.

CF Nise Installer from NTT Labs

Just a month after the release from Altoros, NTT Labs released the cf_nise_installer.  The install can be run with the help of vagrant, or you can create an Ubuntu 10.04 machine yourself (virtual or physical) and run two commands WITHIN that VM.  You don’t need to install ruby on your host (really nice if your laptop is a Windows box) – in fact, if you run the installer in your own Ubuntu machine then there are absolutely NO dependencies on software running natively on your laptop.  The basic mechanics of how the Nise installer does its thing is that it leverages a BOSH emulator, also created by NTT Labs, and the install then 1) clones the cf-release repository (which is the official release vehicle for the OSS Cloud Foundry), 2) creates a release using the BOSH cli and 3) deploys that release onto the VM via this BOSH emulator, called Nise BOSH.  Because the deployment leverages cf-release directly, it doesn’t get out of synch with the CF release – that is, it IS the CF release.  By default the cf_nise_installer will install the latest “final build” as found in the releases directory.  It’s true that the BOSH emulator has to keep up with changes to BOSH and the deployment manifest must also be updated, but yudai is remarkably responsive and it’s a rare day where someone reports an issue on one of the mailing lists and doesn’t have a resolution from him in just a couple of hours (I’m honestly not sure that the guy ever sleeps ;-) ).

In spite of the third tool I am just about to describe, the cf_nise_installer remains the option I most often recommend for standing up a local cloud foundry quickly, because, well, it just works – and in about 30 minutes.

BOSH-lite

While the commit history dates back to June of this year, it was more like August when bosh-lite matured enough for consumption.  This project was initiated from within Pivotal and, most simply put, it is an implementation of a Warden CPI for BOSH. If your not thinking this, let me seed this in your consciousness…

A warden CPI. Oh, now, That. Is. Cool!

BOSH is a product that allows you to deploy and manage complex distributed applications, like the Cloud Foundry Runtime and Services, over the top of any IaaS.  BOSH just needs to be able to provision virtual machines and then it lays things down onto those machines and monitors them too.  We’ve had support for VMWare, Openstack (provided in partnership with Piston Cloud) and AWS for some time, and NTT recently announced one for CloudStack (and there are more in the works too). Warden is a system that serves up light-weight Linux containers – that is, super-lean virtual machines.  The CF runtime provides application isolation by hosting deployed applications inside of warden containers, and now bosh-lite uses that same technology for the VMs into which CF components are deployed.

With a Warden CPI for BOSH, you now have the ability to run full BOSH on your laptop, and hence you can do a full BOSH deployment of CF on your laptop.  I know I said it already but, whoa, that is way cool.

That said, there are a few gotchas.  First, if you clone the repo today you will find that by default, the vagrant config uses 6GB of RAM – pretty hefty (I have been able to do it with 3 or 4 Gig).  You will also need to install a number of things on your laptop to be able to do the install.  All told, I’ll confess that it took me the better part of a day to sort through all of the dependencies, clean up my laptop sufficiently and get it all up and running.  And I recently witnessed a partner have roughly the same experience.

One of the best things, however, is that once you have bosh-lite installed, then the deployment of CF proceeds in the same manner as it would if you were deploying CF into the cloud. This has the obvious advantage of allowing you to gain necessary experience with an important toolset, BOSH, but that is, admittedly, a fairly significant burden for the newbie.  Another advantage that the bosh-lite approach has over CF Nise is that because the CF components are isolated from one another in the Warden containers, you won’t have issues with various CF pieces bumping into each other – CF Nise cannot, for example, deploy the CF runtime and most of the services on the same box.

To summarize:

  1. I no longer recommend the vagrant installer from Altoros.
  2. If you are looking to get the simplest CF instance stood up on your laptop, don’t need CF services right away (but you might be working on creating your own service broker – that’s okay), then I highly recommend cf_nise_installer.
  3. If you have enough RAM and are looking for a full-blown deployment of CF with services, and/or you are primarily interested in learning BOSH, then go with bosh-lite.

I have options 2 and 3 operational on my laptop and depending on the task before me I choose one over the other.

Canaries are Great!

(cross-posted from the Cloud Foundry blog)

First a little background, and then a story. As Matt described here, Cloud Foundry BOSH has a great capability to perform rolling updates automatically to an entire set of servers in a cluster, and there is a defensive aspect to this feature called a “canary” that is at the center of this tale. When a whole lot of servers are going to be upgraded, BOSH will first try to upgrade a small number of them (usually 1), the “canary”, and only if that is successful will the remaining servers in the cluster be upgraded. If the canary upgrade succeeds, then BOSH will parallelize up to a “max in flight” number of remaining server upgrades until all are completed.

And now the story.

For the last few weeks I’ve been pairing on the Cloud Foundry development team here at Pivotal. I’ve had a chance to work on lots of cool things, I’ve seen the continuous integration (CI) pipeline in action, and today I got to be one half of the pair that did a deploy to production – that is, to run.pivotal.io. As a brief aside, the CI pipeline is way cool, with a number of different systems automatically running test suites with passing tests automatically promoting things. But when it comes to production deploys there is still a person that pushes the “go” button. Of course, just as any other thing we do here at Pivotal, that production process has been tooled so that, it really is usually a matter of pushing a metaphorical button. Today, however, we did do a bit of an upgrade to the tooling before we used it.

And we goofed.

Yes, we had tests for the tooling code. Yes, the tests were passing. But when we ran things with the production manifests as input, an authorization token was wrong. BOSH did tell us of the change, but we got a bit overzealous and didn’t catch that change until after we said “yes” to the “are you sure you want to deploy this” question. Once we realized our problem, BOSH was already upgrading our marketplace services broker. Doh.

Could have been a very bad day. But, thanks to canaries, my heart didn’t even skip a beat.

Our production deployment runs multiple instances of the marketplace services broker. Enter the sacrificial canary. When BOSH started the upgrade it only took down one of the brokers, upgraded the bits and tried to restart it. In this case the air in the coal mine was toxic and our little bird did not survive :-( . As a result, we sent no additional souls in, leaving the other service brokers fully functional :-) .

We fixed our problem, pushed the “go” button again and this time the canary came up singing, BOSH upgraded the remaining service brokers and our production deploy was complete.

It was a good day. A really, really good day.

Intel Developer Forum Hackathon

During a conference-rich week two weeks ago, Pivotal was invited to participate in a coding contest that was a part of the Intel Developer Forum (IDF) held at Moscone.  The event hosted 15-20 college students, forming two teams that would compete against an Intel team; the task: build an innovative application for junior high math students. The main thing the applications would be judged on was cloud-readiness, and this is where Pivotal came in – the applications would be deployed to Cloud Foundry.  I had the great pleasure of representing Pivotal, Pivotal Labs and Cloud Foundry at this event and it was a blast!

The event spanned two half days with official coding hours from 11-3 each day, though I heard that an all-nighter might have transpired between the two. The kids (and given that most are roughly my son’s age, I will call them kids ;-) ) almost all came from Contra Costa College where they have had the great fortune of being taught by a fantastic Professor Tom Murphy, who was also in attendance. These budding programmers had prior experience with HTML 5 and javascript, and the teams included designers as well. Intel, who have been engaged with Cloud Foundry for some time, brought with them some brand new, just announced at the IDF, hardware, onto which they deployed Cloud Foundry – that is, they stood up a private cloud for the event.

The team’s submissions would be judged on certain criteria including:

  • Design for failure
  • Stateless computing
  • Scale out (not up)
  • Event driven
  • Web services
  • Security
  • Prefer eventual consistency
  • DevOps/NoOps

I’d like to share several reflections:

One of the main things that I learned from the experience is that thinking in terms of distributed systems not is immediate.  Well, duh – of course it is not.  It made me think back to my days in school where the initial classes were introductory programming, data structures and algorithms, and computer architecture.  In chatting with these students, those are the same courses they have been taking.  I think it’s inarguable that a course in distributed computing MUST be a part of any undergraduate computer science curriculum today (but I’m not sure that it is??).  And it needs to come early. When I was in graduate school at IU I often TA’d the first course that computer science majors took, an Intro to Programming course taught in Scheme. This was brilliant because, in part due to the nature of the language, they are taught recursion in week 3 (as opposed to week 13 when I took it in my intro Pascal programming course) and it turns out not to be hard (in week 13 it was hard for a lot of my fellow students). Recursion became a foundation for them, not an add-on. I’d like to see distributed computing be a part of that foundation today.

Because of former training and experience the students all built their applications to run HTML 5 and javascript in the browser. The UIs these kids built were way cool, with spaceships shooting answers at meteors that carried math problems, for example, and because they are standards-compliant, will work in pretty much any browser.  Major Kudos to Tom Murphy for laying this incredibly valuable foundation.

In the bucket of “stateless computing” we detailed that state should not be stored within the compute node. While the game play ran entirely in the browser, things like user profiles and high scores would be stored in some database.  Intel had stood up their private PaaS with several choices for persistence, including both relational and NoSQL databases.  It probably won’t surprise you to hear that the Intel team stored in relational and the student teams both chose MongoDB. I wish I had been able to dig in a bit more on why they made this choice, but by the time the question came up we were singularly focused on get things working. (If you were on one of the student teams, please post a comment and tell us why you made that choice.)

In the end, neither student team was able to connect their browser-based application to cloud-based persistence (entirely our fault for not anticipating what would be needed here), though they did get the applications pushed into the cloud, which on it’s own is pretty cool.  The first team to get their app pushed into Cloud Foundry, using the cf cli, was literally jumping up and down and whooping and hollering.  They didn’t have to stand up a VM, load the OS, install an app server and then deploy their app – they just wrote their app and pushed it to the cloud.  That is PaaS.

Finally, I want to extend congratulations to Cathy Spence and the whole Intel team for the event. While I was there I came to understand that Intel does these types of events quite regularly and while it’s probably a bit tricky to quantify the ROI, there is no question that Intel benefits in terms of positive PR and recruiting.  And the benefit to these students is great!  I’d really love to see my parent companies, EMC and VMWare, take a lesson here, and while Pivotal is much smaller and may lack some resources, I’m certain there are ways we can creatively engage in a similar manner.

Tools for Troubleshooting Application Deployment Issues in Cloud Foundry

Our standard demo for Cloud Foundry has us in a directory where either some source code or an application package (war, zip, etc.) is sitting and then we do a

cf push

A handful of messages will appear on the screen, like:

Uploading hello... OK
Preparing to start hello... OK
-----> Downloaded app package (4.0K)
-----> Using Ruby version: ruby-1.9.3
-----> Installing dependencies using Bundler version 1.3.2
Running: bundle install --without development:test --path vendor/bundle --binstubs vendor/bundle/bin --deployment
Fetching gem metadata from http://rubygems.org/..........
Fetching gem metadata from http://rubygems.org/..
Installing rack (1.5.2)
Installing rack-protection (1.5.0)
Installing tilt (1.4.1)
Installing sinatra (1.4.3)
Using bundler (1.3.2)
Your bundle is complete! It was installed into ./vendor/bundle
Cleaning up the bundler cache.
-----> Uploading droplet (23M)
Checking status of app 'hello'....
1 of 1 instances running (1 running)
Push successful! App 'hello' available at http://hello.cdavisafc.cf-app.com

This shows that the application was uploaded, dependencies were downloaded, a droplet was uploaded and the application was started.  And that is all fine and good, but what happens when something goes wrong? How can the application developer troubleshoot this?

The answer is multi-faceted and in this note I will try to organize things a bit.

First, let me list the different tools someone might have at their disposal, and briefly what app troubleshooting things it offers:

  • the cf cli
    • the cf apps command – This should be very familiar, it simply shows you the apps you have deployed and an indication of their health
    • the cf logs command – This will show you the contents of the files found in the logs directory of the warden container – these contents will vary depending on where in the app deployment process you are when investigating
    • the cf files command – This will show you the filesystem contents of the warden container – these contents will vary depending on where in the app deployment process you are when investigating
  • the bosh cli
    • the bosh logs command – This will tar up and download the files found in the /var/vcap/sys/logs directory on the targeted VM.  In general, the logs from the dea will probably be the most helpful (dea logs and warden logs), with perhaps something of note in the cloud controller logs.
  • ssh into CF VMs
  • wsh (warden shell) into the warden container for the application
    • this is only possible if the application was entirely staged and is up and running. In the event that the application is “flapping,” the warden containers are likely getting killed and recreated on some pretty short interval and it will be hard to get much from wsh-ing in.

Here’s the thing… ultimately your application developer will only have access to the first of these things (the cf cli) and once your cloud is stable, this should be sufficient.  While you are getting the kinks worked out of your PaaS deployment, however, the other tools can be very helpful.  One other thing to note is that if your developers are enabled with some type of micro-cloud foundry on their workstations, then while they may not have bosh, they would be able to ssh into that machine and poke around, for example, getting to the dea logs directly.  I do this all the time on my laptop.

Okay, so now with this list of tools, I’ve crafted the following diagram to give some guidance on what tools will help when investigating things during different stages of the application deployment process.  There is definitely a bit of a trick to figuring out where in the lifecycle something went wrong, but even trying to use a prescribed tool for something will give you a hint.

App Restarting in Cloud Foundry

A couple of weeks ago, right before going on what turned out to be a glorious vacation in the sun, I stood up a local Cloud Foundry on my laptop using the cf-vagrant-installer from Altoros.  Turns out there was a bug in a couple of the configuration files (pull request has already been merged) which offered a beautiful learning opportunity for me and I want to share.

Here’s what kicked it all off.  I went through the cf-vagrant-install and pushed an app.  Sure enough, it all worked great.  Then I shut down my vagrant machine, started it back up and expected my app would similarly restart, but it didn’t.  Instead when I ran a cf apps it showed my app with 0% of them started.

cdavis@ubuntu:$ cf apps
Getting applications in myspace... OK
name    status   usage      url
hello   0%       1 x 256M   hello.vcap.me

Hmm.  Okay, so let me try something simpler – who knows what the cf-vagrant-installer startup scripts are doing, maybe the left hand can no longer see the right after a restart.  So I cleaned everything up, pushed the app and it was running fine.  I then went and killed the warden container that was running the app (a separate post on how I did that coming soon) – and again, the app didn’t restart. And it stayed that way. It never restarted. Yes, it’s supposed to. So I dug in and figured out how this is supposed to work:

There are four Cloud Foundry components involved in the process, the Cloud Controller, the DEA, the Health Manager and NATS.  The Cloud Controller (CC) knows everything about the apps that are supposed to be running – it knows this because every app is pushed through the CC and it hangs on to that information. To put it simply, the CC knows the desired state of the apps running in Cloud Foundry. The DEA is, of course, running the application. The Health Manager (HM) does three things – it keeps an up to date picture of the apps that are actually running in Cloud Foundry, it compares that to the desired state (which it gets from CC via an HTTP request) and if there is a discrepancy, it asks the CC to fix things. And finally NATS facilitates all of the communication between these components. (BTW, Matthew Kocher posted a nice list of the responsibilities of the Cloud Foundry components on the vcap-dev mailing list).

Here is what happens.

The DEAs are constantly monitoring what is happening on them – they do this in a variety of ways, checking if process IDs still exist, pinging URLs, etc. If the DEA realizes that an app has become unavailable, it sends a message out onto NATs, on the droplet.exited channel with details.  The HM subscribes to that channel and when it gets that message does the comparison to the desired state. Note that an app instance could have become unavailable because the CC asked for it to be shut down – in which case the desired state would match the actual state after the app instance became unavailable. Right? Assuming, however, the app crashed, there would be discrepancy and the HM would tell the CC that another instance of the app needed to be started.  The CC would then decide which DEA the app should start on (that is (part of) its job) and lets that DEA know. The DEA starts the app and all is good.

That’s a bit confusing so here’s a picture that roughly shows this flow – you shouldn’t take this too literally, especially the sequencing, for example, the HM asking the CC for the desired state is something that happens asynchronously, not as a result of the DEA reporting a crashed app. This picture is just intended to clarify responsibilities of the components.

Another place to see this in action is by watching the NATS traffic for a given app.  (I’m writing another post to talk about this and other tools I used in my investigations, but for now, just enjoy what you see.) What this shows are the heartbeat messages sent out by a dea showing the apps that are running. Then we get a droplet.exited message that starts the whole thing going. Eventually you see the heartbeat messages again showing the app as running.

06:03:53 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007
06:03:58 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007
06:03:58 PM dea.heartbeat            dea: 0, crashed: 0, running: 1
06:04:03 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007
06:04:08 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007
06:04:08 PM dea.heartbeat            dea: 0, crashed: 0, running: 1
06:04:13 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007
06:04:13 PM router.unregister      app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007
06:04:13 PM droplet.exited            app: hello, reason: CRASHED, index: 0, version: b48c1871
06:04:14 PM health.start                app: hello, version: b48c1871, indices: 0, running: 0 x b48c1871
06:04:14 PM dea.0.start                  app: hello, dea: 0, index: 0, version: b48c1871, uris: hello.172.16.106.130.xip.io
06:04:16 PM dea.heartbeat            dea: 0, running: 1
06:04:16 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011
06:04:18 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011
06:04:18 PM dea.heartbeat            dea: 0, crashed: 1, running: 1
06:04:23 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011
06:04:28 PM router.register           app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011
06:04:28 PM dea.heartbeat            dea: 0, crashed: 1, running: 1 

One last thing you might ask is this. What if somehow the message sent by the DEA that an app has crashed goes missing? We are NOT depending on durable subscriptions (which would be a grind on performance) so what is our mechanism for ensuring eventual consistency?  Remember that I said that the HM does three things, including keeping track of the actual state of the system. It can do this because every 10 seconds each DEA sends a heartbeat message (as you can see above) out onto NATS reporting how the apps that are running on them are doing.  If the HM doesn’t get the direct message that an app has crashed, from the heartbeat messages it will eventually see an actual state that doesn’t match the desired state. At that point it will contact the CC just the same as described above.

I’ve not yet grown tired of killing apps in every which way, destroying the warden container, going into the warden container and killing the app process, restarting the DEA and so on, and watching the state of the system eventually (within a few seconds) come back into equilibrium. Way cool. So very way cool!

The Intuition of Installing BOSH

We have Cloud Foundry running in the lab, BOSH deployed, in a vSphere environment.  While I worked with a colleague to deploy that Cloud Foundry, and I focused on understanding what was going on with services deployments, I was perfectly content to have him get us to the point where we could do that Cloud Foundry install.  In other words, he installed BOSH itself.  Recently, however, when he and I were trying to track down an issue we were having with BOSH, I found I really wanted to understand how BOSH is installed.  The short answer is, as reported on many blogs, that it’s just like installing Cloud Foundry – that you install BOSH using BOSH.  But what does that really mean? What are the details?  In this blog I’m aiming to explain WHAT happens when you go through the install steps. You can find those steps here or here, and my goal is not to repeat those instructions, but rather to explain a bit further what’s going on.

The first thing that you do is do some setup on the IaaS, in our case vSphere; networks, folders, etc.  The folders will be used by various installation processes to hold files used during the install. That is, the IaaS itself is used to facilitate the installation, not just host the installed system.  This is a key point that I haven’t found stated elsewhere, but if you watch what is going on in your IaaS (i.e. watching your vSphere console) during a “bosh deploy” you’ll see machines getting created and ultimately deleted – they are in service of the deploy itself.

Okay, so now you have an IaaS environment ready that you can install BOSH onto.  BOSH is itself a distributed application, running on a multitude of machines working in cooperation.  In theory you could create the VMs you need (here it says you’ll need 6) and install the appropriate bits onto each of those machines – put the bosh director on one, the health monitor on another, the message bus on another, and so on. But wait, isn’t that what BOSH is meant to do, provision VMs and install a distributed application onto them?  It is, and that is why the BOSH team has provided something that allows you to install BOSH using BOSH.  The trick then, is how to bootstrap the thing.

Check out the video behind this image - very cute

This is done with two things.  First, the BOSH team has essentially created a virtual machine that has all of BOSH already preinstalled on it. That’s right, BOSH director, the health manager, workers, message bus, etc – all of it on a single VM.  And second, they have given you a tool to install that virtual machine into the IaaS environment that you set up in the first step.  So the next few steps in the instructions have you download the micro-BOSH stemcell (essentially the VM) and install a bosh cli & deployer. The cli has the protocol for interacting with the IaaS built into it, so you can just issue a few cli commands and it does the micro-bosh deployment for you. Some instructions have you create a virtual machine from which you do these steps, but this isn’t strictly necessary; you just need an Ubuntu machine with ruby, git and a few other things on it.  I personally always have Ubuntu VMs I am using for various development activities and are perfectly suited to this set of steps.  So after you’ve installed micro-BOSH following the instructions here or here, you actually have a BOSH up and running.  You can now take “releases” which contain things like packages, jobs and deployment manifests and deploy them with BOSH. And that install was really easy!

But, hang on, the instructions go on – they have me install BOSH.  But, didn’t I just install BOSH? Yes, you just installed (micro) BOSH. And yes, you now need to install (full-blown) BOSH.

Remember that micro-BOSH is deployed on a single VM.  If you are deploying (and monitoring!! – but I’ll talk about that more some other time) a distributed application that isn’t too big or complicated, you could use micro-BOSH and it would work reasonably well.  But ultimately our goal is to deploy more significant applications, like Cloud Foundry, and for that a single VM BOSH just couldn’t cut it.  But the good news is that we now have micro-BOSH to help us install the more sophisticated BOSH deployment.

One of the components in that micro-BOSH deployment is something called the BOSH director, which, as the name implies, orchestrates all of the things that BOSH does for you. So to deploy anything with BOSH you basically tell the BOSH director what you want to do, where it can find the bits it needs for the job, and what environment you want to install those bits into. Let’s take those things in reverse.

Remember that BOSH will stand up VMs onto which it then installs things.  What does that VM look like?  This is where stemcells come in again.  The instructions have you download, and then upload via micro-BOSH, the bosh stemcell.  The BOSH stemcell is different from the micro-BOSH stemcell in that it contains only a base operating system plus a BOSH agent.  Recall that the micro-BOSH stemcell has all of BOSH on it.  I find it a bit ironic that the micro-BOSH stemcell is actually much “bigger” than the (full) BOSH stemcell.  Really! Have a look at the file sizes for the two stemcells you’ve downloaded.

The bits that are needed for the application installation are bundled in a release.  You find the BOSH release in the bosh repository on git.  The structure of a BOSH release has been covered in numerous blogs and videos, and I won’t cover the details here. The instructions will have you git clone this repository to get all the bits you need; you’ll have to modify a few settings for your vSphere environment.

Finally, you need to tell the BOSH director what you want to do, and this is accomplished with a series of bosh cli commands, as covered in the installation documents.  What is important to note in the instructions is that when you are installing BOSH (using micro-BOSH) you target the BOSH director that is a part of micro-BOSH.  Once the (full) BOSH install is done you need to change which BOSH director you are targeting.  If you don’t and you try to do something like a Cloud Foundry deployment you’ll be attempting that with the micro-BOSH and that is not likely to go well.

Bonus track: Deploying the Echo Service via the service_broker

Part IV in a three part series ;-)

In Part II I went into a little bit of detail on how the echo service gateway and node work, in collaboration with the actual echo server. Or rather, in this case, how the node does not work with the echo server at all.  The echo_node implementation simply looks up some metadata values from it’s config file and returns them, via the gateway, to the requesting party (the dea).  You might question why the node was in there at all.  The answer is that in this case you don’t really need it.

Included in the vcap-services repository is an implementation for a service broker which can be used to act as a gateway to services that are running outside of cloud foundry.  Recall from part I in this series that the echo server, the java application that listens on a port and parrots back what it hears there, really doesn’t have anything to do with cloud foundry; again, not even the echo_node communicates with it.  So really, you can think of it as a server running external to cloud foundry.

The service_broker is a service_gateway implementation that stands up a RESTful web service that implements all of the resources that any service gateway resource does; resources that allow you to create or delete a service instance, bind or unbind it from applications, etc.  In this case, the gateway implementation simply offers a metadata registry with entries for each of the brokered services. In a “normal” cloud foundry service implementation, in response to these web service requests the gateway will dispatch a message to NATS, a node will pick it up, fulfill the request and communicate back to the gateway, which in turn responds to its client.  In the case of the service_broker there is no node and hence no need to dispatch messages onto NATS.  The web services of the gateway simply look up  the values in its registry and return them.  This begs the question of how things get into that registry; the service_broker gateway offers a RESTful web service resource for the registry itself, supporting POST to add things and DELETE to remove them.  You can either issue those service invocations yourself or you can use the service broker cli.

So here’s what I did to go through the exercise of deploying the echo server as a brokered service:

Step 1: Start with a vcap devbox, following the instructions you find in the vcap repsitory, EXCEPT that you also need to start up the service broker; see this stackoverflow thread for how to do that.

Step 2: Register the echo service with the service_broker.  I used the service_broker_cli which will already be on the devbox you set up in step 1.  Running the cli with no arguments will register what is found in the services.yml file found in the config directory, so I ran it as follows:

bin/service_broker_cli -c config/echobrokeredservice.yml

with the contents of the echobrokeredservice.yml as follows:

---
service_broker: http://service-broker.vcap.me
token: "changebrokertoken"
service:
  name: echo
  description: cloud foundry sample service
  version: "1.0"
  options:
    - name: service
      acls:
        users: []
        wildcards: [*@emc.com]
      credentials:
        host: 192.168.1.150
        port: 5002

Notice the “credentials” section – there are the same values that were in the echo_node.yml file in part I but now instead of the echo_gateway dispatching to NATS and the echo_node simply returning the values from the echo_node.yml file, the service_broker_gateway just looks those values up in its registry and returns them.

When you now run a

vmc info --services

you should see the echo service listed.

Step 3: Deploy your client app and bind to the echo_service.

Seriously.  That’s it.

But before you think, “why did I go through all of those other hairy steps with the node and gateway and BOSH, etc.?,” remember that this echo server is just a sample, and a very, very simplified one. Servers that run within cloud foundry (and are hopefully deployed with BOSH) benefit from management, monitoring, logging and other cloud foundry capabilities, and the gateway/node combination is a loosely coupled mechanism for offering lifecycle operations on those services.  A very valuable part of the cloud foundry services story.  That said, there is clearly value in brokering external services and that’s why you have this bonus track. :-)

The Echo Client

So just a bit more on step 3 from above. The Echo client application that was posted all the way back on the original support forum article is a bit rigid.  It accesses the details of the bound services through the VCAP_SERVICES environmental variable, and that’s fine, but it then looks for a service type named (exactly) “echo” and an instance name of (exactly) “myecho.”  This hard coding always bugged me but because of the way that brokered services are named as a part of their configuration, I finally did something about it.  It required some changes to JsonParseUtil.java where I now take in a string and look for a service type name and instance name that contains that string.  So in an updated version of the echo client source code that you can find here, the client will look for a bound service with the type and the instance name containing the string “echo”; so the service type of “echo_service” with a name like “echo_service-a302″ fits the bill.  Oh, and one other slight add to the app – this one now prints the contents of the VCAP_SERVICES environmental variable – helps when you are first getting started with such things.