Concept, Goal, Development and Support

If you are looking for how to get and use this start here

Skip to Development | Support

What the device is.

This a device created to be a cost effective way of diagnosing issues with networks.

The device is essentially a small, commodity-hardware based device which has a custom built Debian Linux Operating System
and some specially built software for monitoring network traffic in applications

They are designed to be quick to assemble and setup, simple to deploy, and as maintenance free as possible.
At their core is an open source base on which is built some very specific tools for rapid analysis of feed captures.

The software on the device includes a proprietary microburst analyser written in C++ (so it's really fast)
which is designed to perform rapid and accurate microburst measurement on large PCAP format capture files
over wide range of window sizes down to 1mS.

What the device is not.

The device is not something which can currently be purchased from any other supplier.
These devices do not exist as a product anywhere else for any price.

The devices are not intended to compete with any existing product in any way.
They were created with a very specific purpose in mind and are not intended to replace any existing solution.

The aim of the microburst analyser was to extract information out of the stored PCAP files on the device
from the application traffic suitable for tuning policers.

Why does this or similar not exist?

I get asked this alot by people. This is my explanation:

  1. This was developed with close interaction with a high profile customers' people and built a requirement for a solution at their time of need,
  2. It was developed from a good understanding of what is practical and robust on an enterprise platform,
  3. Very experienced software developers created and tested this on live ticker feeds,
  4. This gave us the ability to create and test solutions which have never been developed previously and compare the results with other market products.

This was developed from the support team on a premier account with a financial customer.
Previous experience of embedded software engineering in the mobile market, development of many enterprise products.

Companies like Corvil, Netscout and Accedian are focussed on maximising profit (just as all companies are),
which means they may well not be attempting to build something which is low cost.
A low cost device is not as profitable as something with a premium price that other people are willing to pay.

Also, the people at these companies may have the understanding of what can be done,
but they do not have the ear of the customer at the time it counts: i.e. when they want something solved quickly.

On the other hand we had the ability to bounce ideas off the customer with the view to getting their problem solved quicker.
This means they were willing to brainstorm a bit on that and thrash out a requirement for a product which will provide what they ask.

Concept

The original point of this was to simply and with the minimum of resources fill a void in diagnostic abilities.
The void is the ability to capture customer traffic for long periods.

This void still exists and this is the only product which fills this void.

In order to do this it needs to be in place and capturing before the problem is seen.
This means we need to have them in all locations.
This is the main reason it has to be cheap.

The point of the project is to have a large number of these monitoring everywhere for long periods.
So to keep the unit cost down means we can absorb the cost by decreasing time spent solving problems alone.

Goal

The goal of this project is to capture every packet going across a network and store it for a month.
Thus when customers ask what happened to packet X at time Y you can provide an answer.

The goal of this project is NOT to have equipment available to be deployed for troubleshooting.
Others have equipment for this already.

Why is deploying equipment for instances not enough?

Deploying equipment after the event means we have already missed the boat.
The event of which the customer concerned has already passed and cannot be captured again.

This has a number of direct effects which mean you end up spending more time (and therefore money) dealing with the customer's problem:

  1. Deploying equipment itself takes time and involves data-centre technicians
  2. You need to witness a repeat event which takes time
  3. More people become involved in the customers problem (e.g. Managers/Directors) as they escalate the problem.

Reducing this would cover the cost of this product.
There are also secondary effects which would result in covering the cost:


The goal is to raise the bar of the services you provide and therefore would be in a position to charge a premium for our services.

Development

Initial stages

Development started when the first units were purchased and assembled to be used in a specific customer problem.
These units had a relatively green install of Xubuntu (a Debian Linux variant) with some open source tools to do capturing

At the time it was not known how these could be deployed by datacentre technicians,
so myself and Eddie (network engineer) went down to UK1 to install the first one.
We worked with Bryan Franke in US to get the machine set up.

The second unit was set up at Deutsch Bank, Croydon by me (Graham North)
I was working with an on-site technician and gaging his reaction to the deployment.
He seemed quite happy with the installation of the device.

The third device was installed in SG8 (Singapore) by the technicians and was the first device to be deployed this way.
There was a problem with a missing converter, but once this was dispatched, that device was online too.

Second Phase

All 3 of these required some manual intervention on-site in order to get the unit online.
So, using that knowledge, I took some time to develop an image for the device kit
with the aim having a unit which can be set up without any remotes hands,
other than physically cabling up the kit.

The next deployments (TK1 and TK2) used this new image, and it worked well.
Although the kit would now be accessible it still required someone to go on the kit to start captures,
So I adapted the image to configure the device to start capturing as soon as it was deployed.

I have also created a PXE image deployment system (which includes release notes) so the devices can be prepared quickly.
This is currently on my desktop as development is still continuing.

I have refined the process so now it takes less than 30mins from receipt of the parts
to being in the mailroom ready to go onsite.

Third Phase

A central access point is planned so to make it quicker to download captures.
Also this means we can have network intelligence to spot patterns of problems.

Support

A key issue raised about this project, and seems to be the main problem currently, is the issue of support.
The issues seem to be in 2 key areas:
  1. device failure
  2. volume management

To expand on these the concerns are about us relying on a tool which may have a limited future.
These concerns are actually unfounded as is explained below.

Device Failure

First: the devices are built from mature off-the-shelf PC-based parts, so there is no issue with parts becoming obsolete.
Newer generations of parts will become available so if individual parts become obsolete, these can be replaced with the newer generation ones.

Secondly: the units themselves are low cost.
I am currently dispatching these in pairs to the same sites, so if one fails the other can be used.

In the event of failure, a new device is deployed to replace it and the failed device is returned to salvage parts.
Alternatively the failed device can simply be disposed of locally.

So far, with many devices actually in the field up deployed to a year ago, we have had no failures.
So the MTTF is currently over 1 year, the estimate is 8 years due to the MTTF of the single moving part: the disc drive.

Volume Management

This comes neatly to the volume management problem.

So the concern here is that these devices may require many resources in order to maintain the devices and the service they provide.
I developed these devices to be independent units to do a single task: capture traffic over long periods.

The current image does this as soon as it is powered up.
If the device seems to be in a failed state rebooting would put the unit back into service.

These devices are accessible principally using ssh as with all our network and server kit.
Accessing to maintain the device is no different to any other piece of network kit.

There are currently 15 devices in the field and we have spend no time maintaining the equipment beyond initial deployment.
If this was scaled to, say, 200 devices, this currently zero, although one expects failures.

If we say MTTF is 1 year: for 200 devices this is about 1 every 2 days. The estimated MTTF of 8 years put this to about 1 per week.

As stated about the time to deploy is currently 30mins, so this is 30mins per week of time in maintenance of 200 devices.
This would be offset against the time spends closing cases.

Premium service charging

At an operational cost of $1000/hour to the customer this would be $500 per week charges for the premium service.

As a test case we proposed $1,500 deployment costs and $150 monthly cost per unit.
The customer was happy with those charges.

This means to offer this service for just 5 customer products would cover the all maintenance costs worldwide.
The charge to add the service for 40 customer deployments would cover the all the costs of rolling this out worldwide.

This would then be pure profit for us as reduced time in closing cases.
Also we would probably see a decrease in churn and an increase in new business.