All posts by Jeff

Running Graphite and Grafana in Docker

I’ve been playing with Docker quite a bit lately and think it is some really cool stuff.  Check out the blog of one of my colleagues for more information on using it.  In short, entire application environments can be pulled from a catalog and started on a host with a single command.

I decided to find some Graphite and Grafana images and modify them for use specifically with Harvest and OPM.  The Graphite image has custom retention periods in the storage-schemas.conf for Harvest and OPM and the Grafana image is streamlined to work with this particular Graphite image.  You can now have Graphite and Grafana up and running with just a few commands now.  All the hard work is out of the way!

  1. Get a few utilities to make things a little easier later and install perl and some libraries that we’ll need later.


    yum install epel-release nfs-utils nano wget net-tools perl-Time-HiRes perl-XML-Parser perl-Net-SSLeay perl-libwww-perl device-mapper-event-libs –y


    sudo apt-get install nfs-common wget libwww-perl liblwp-protocol-https-perl libxml-parser-perl libmath-round-perl
  2. Install Docker and start the service.  You may need to disable the firewall on CENTOS/RHEL.


    yum install docker-io
    service docker start


    wget -qO- | sh
    service docker start
  3. Create /opt/graphite/storage/whisper directory so that the Docker Graphite container can store data in the host file system rather than within the container itself.  This allows a lot more flexibility for later.
  4. Pull and run the Docker image for Graphite.  This command will pull and launch the Graphite image using TCP/8080 as the port to access graphite-web and expose TCP/2003 to send data to carbon-cache.  The reason we use TCP/8080 for graphite-web is so that we can use TCP/80 for Grafana.  There are several Graphite images on the Docker registry, but I modified this one to set the retention periods for Harvest.
    docker run -d \
    –name=graphite \
    -v /opt/graphite/storage/whisper:/opt/graphite/storage/whisper \
    -p 8080:80 \
    -p 2003:2003 \
    -p 8125:8125/udp \

    Verify it is running by browsing to http://<dockerhost>:8080.  You should see the default Graphite browser.
  5. Pull and run the Grafana image.  This image uses the Graphite container in the previous step as the data source and InfluxDB for dashboard storage.
    docker run –d \
    –name=grafana \
    -p 80:80 \
    -p 8083:8083 \
    -p 8086:8086 \

Browse to http://<dockerhost> and you should see the Grafana default screen.

OPM External Data Provider Update

For those that are having issues with OPM forwarding incomplete datasets to Graphite, or if has just stopped sending any data at all, here is an update.  Set the “Transmit Interval” parameter to 10 minutes instead of 5.  I’m not sure it is necessary, but restarting the OPM services or simply rebooting the OPM server is probably a good idea.

Don’t forget to modify your storage-schemas.conf file for the 10 minute interval.  The steps I went through to make sure that everything is right going forward:

  1. Stop the carbon-cache service.  It may be called “Graphite” on some distributions.
  2. Modify storage-schemas.conf to reflect 10 minute intervals.
  3. Delete all of the previously generated whisper files in the “netapp-performance” directory.  This is usually at /opt/graphite/storage/whisper.  If you have changed the “Vendor Tag” parameter in OPM you should obviously delete the files that match that directory instead.  If you really want to save this data, do an internet search for some whisper python scripts that can change the data layout of existing whisper files. 
  4. Restart the carbon-cache service.

In my lab where I have 6 nodes and a handful of SVMs and volumes, full population took about an hour.  You can speed this up if you increase the value of the MAX_CREATES_PER_MINUTE parameter in the carbon.conf file.  I would just wait though.  If you do change that parameter, remember to restart carbon-cache.

What’s In a NAME?

NetApp customers that heavily utilized MultiStore (vFilers) are aware that naming conventions are important, but with clustered Data ONTAP, naming conventions are critical.  Many administrators name objects, such as volumes, after the application or purpose of the data being stored.  Great!  Keep doing that, but we need to do a little more. 

One of the things that I always do is prepend the name of the SVM that owns the volume to the volume name.  When cluster admins do a volume show from the ONTAP CLI, they get the SVM name in the results table for the volumes.  Similarly, SVM admins only connect to the particular SVM they manage so the SVM name isn’t really necessary.  Everything is great as long as we stay in the clustershell CLI or System Manager.  We simply don’t do that though.  OnCommand Unified Manager, OnCommand Performance Manager, and the nodeshell CLI make matching a volume to an SVM a little more difficult. 

Let’s take a look at what I’m talking about.  I fired up my trusty simulator and created 3 SVMs named SVM1, SVM2, and SVM3.  On each of these SVMs has a volume named vol1_NFS_Volume.  From clustershell everything is pretty easy to keep straight.  While the volume names are the same, the name of the SVM that owns the volume is in the field beside it.


What happens when we drop into nodeshell and try a vol status command?


Since the node shell doesn’t understand the SVM concept and volumes can’t have identical names, numbers are appended to the volume name.  Granted, there aren’t a lot of reasons to need to do this, but it is still nice to be able to instantly know what you are looking at without needing a secret decoder ring or spreadsheet.

The only real downside to this is that the mount points and exports are a little long and complicated by default.  Of course those can be changed if you don’t mind the mount points and junction paths not matching the volume names. 

Similarly, LIF and aggregate names should also have some sort of naming convention that allows identification on sight.  For example, in my lab I’ve named the aggregates after the node that owns the aggregate and the size and type of disk in the aggregate.  (FP means FlashPool and I could’ve gone further with adding the disk type and size and the SSD size, but that is overkill for my environment.)  If I rehome an aggregate permanently using ARL, I simply rename the aggregate for the new node name. 


Also remember the things that shouldn’t be in the naming convention.  For example, it may seem like a good idea to include the name of the aggregate in the volume name to help identify that too.  This isn’t a good idea because of NDO.  Volumes are meant to be portable, so that name can easily be misleading after a volume is moved to another node or aggregate. 

Please use some naming convention; I really don’t care which one.  As I recently told a colleague, as long as you are at least thinking about it and work through some of the possibilities, you are most of the way there and are going to be in good shape – as long as you stick to it.  Follow these guidelines and come up with a naming convention that works for you.  The second part of that is the most important.  The rule that I follow is modeled after a quote from Albert Einstein – “Everything should be made as simple as possible, but not simpler.”

Performance Dashboards Step 2 – Installing and Configuring Graphite

Now that OnCommand Performance Manager (OPM) is ready to send data, we need to configure something to catch that data. Graphite was originally written by the operations team at way back in 2006 or so. Since then it has become the most widespread graphing solution of its type and is in use at many enterprises. The Graphite data format has become as close to a standard as there is in this space, so many things work with it. One of the reasons for the data format popularity is its simplicity. Timestamp, metric, and value are the only things in the datastream.

Graphite runs on most flavors of Unix or Linux, but the most common platforms are Ubuntu and Centos or RHEL. There are many good articles out there on how to install Graphite on these platforms. Your search engine of choice can find some for you, but I’ll put links at the end of this post pointing to the ones I’ve found to be the best or most helpful. There are also docker images to download and use if you like. It also helps to look at some other pages to understand the architecture of Graphite, because the Carbon and Whisper components are something you’ll need a high-level understanding of. Check out for exactly what the URL says it is.

Things to think about before installation

What I really want to talk about are some of the configuration choices that you’ll need to make. One of the first things to think about is your retention policy. If you choose incorrectly, you run the risk of exhausting all of the disk space on the server within minutes of receiving your first measurements. So before we make our choice on retention policy, let’s talk about how Graphite stores data.

Important Configuration Choice #1 – Configure your retention periods for collected measurements before you start to receive data, but do so wisely.

Whisper is the database component of Graphite that stores our measurements. It is similar to RRD (used by MRTG) but overcomes some of the limitations of RRD. (See for more information.) Whisper stores each metric gathered in its own file. When Whisper receives the first measurements, it looks for matches in the retention policy file, storage-schemas.conf. The matches are found using regex so you can specify different retention periods for different metrics. In this file you specify how often datapoints should be stored and for how long. It then pre-allocates all of the space to the database file that matches that particular metric. Each datapoint stored is 12 bytes. The default policy is to store a measurement every 10 seconds and keep those measurements for 90 days. Simple math (90 days/10 seconds) gives us 7,776,000 datapoints using that policy. That means almost 89 MB of disk space per metric is used the first time Graphite receives a measurement of that metric.

89 MB doesn’t seem like much, but consider how many metrics you are collecting. At the highest detail level that OPM exports, you are getting 8 metrics per volume, 11 metrics per LUN, 6 metrics per NAS LIF, etc. It is easy to get to over 500 metrics gathered per cluster. At 500 metrics, that is almost 43.5 GB used in in the first hour that this is running. Add a volume, 712 more MB used. Add a LUN – just shy of 1 GB used the first time it is measured.

Further, the way Whisper works is that changes to the retention period don’t take effect until you run a script that resizes the database files. This is why it is wise to set your retention policies before you receive any data. If you charge into this half-wise (as I did initially) and decide to change the default from 10 seconds: 90 days to 10 seconds: 5 years you get about 361 MB per metric. Tracking about 1,000 metrics for that period, or over 352 GB of data – and I didn’t even factor in leap year J – doesn’t work too well on a VM that has an 80 GB disk.

And honestly trying to keep 5 years of data at a 10 second granularity was massive overkill. Are most of us going to go back 4 and a half years to see what read latency on a particular LUN was between 9:36 AM and 9:47 AM? Probably not. Luckily, Whisper can set multiple retention times per measurement. This means we can keep data at different granularity levels for different periods. I would’ve been much smarter to create a retention policy that says something like 10 seconds:90 days, 5 minutes:1 year, 1 hour: 5 years. That gives me 5 years of data (at a much lower granularity level) for just over 90 MB of space. Whisper doesn’t calculate values on retrieval, but it actually store 1 measure for each retention period, so that is why more space is used. Each set of retention periods is additive per metric is a simpler way to think of it. Go to for details on this.

Of course if you wanted to actually retain 10 second measurements for 5 years, NetApp storage is the place to do it. In the beginning our compression and deduplication rates would be phenomenal!

Important Configuration Choice #2 – Plan your Graphite architecture before receiving data in large environments.

Remember that OPM will forward data to exactly one external server. Carbon, the data receiver component of Graphite, can forward metrics that match a particular regex pattern to another or even multiple servers. This is done by configuring the Carbon Relay service. This can all be done later, but without a lot of planning, you may end up with gaps in the collected data.


Actually Installing Graphite on Ubuntu

These are the steps that I use to install Graphite. Anything that is non-default I’ll give a reason for, and don’t feel that I’ve necessarily made the best choices, although I did research things before I did them. If using a VM, plan the size of your virtual disk using the factors mentioned above. I did this on Ubuntu Server 14.04.1 without incident and I only know enough Linux to play an admin on TV.  The lines with the shaded background are console commands or in text files that you’ll be editing.  For text file edits, the text is bold is what you add or change.

  1. Install Ubuntu Server and choose only OpenSSH and LAMP Server as the components installed. Also, set the hostname and configure DNS so that resolution is working both on the server and from the clients.
  2. After installation at the first boot of the server, run
    sudo aptitude update
    sudo aptitude safe-upgrade OR sudo aptitude full-upgrade
    Note that if you are like me and don’t like typing “sudo” before the real command, just run “sudo –s” after you login and you’ll be root and not need to use sudo for the rest of the session.
  3. Install Graphite and the required Apache modules. Also install wget because we’ll use that later to install some other software.
    sudo apt-get install graphite-web graphite-carbon libapache2-mod-wsgi wget
    On my server it installed a lot of dependency and suggested packages, so 24 new packages using 45.9 MB of additional disk space was used.
    It will ask if you want to keep the whisper files if you remove Graphite in case you want to keep them in case you reinstall. I chose yes, but it this isn’t really that important either way.
  4. Initialize the Graphite database.
    sudo graphite-manage syncdb
    It will ask to create a superuser for the Django subsystem. Answer yes and keep track of the username and password that you use because you need it for the next step.
  5. Edit /etc/graphite/ to reflect the credentials created in the last step. I used “root” for the username and “graphite” for the password. It should look something like this:
            ‘default’: {
            ‘NAME’: ‘/var/lib/graphite/graphite.db’,
            ‘ENGINE’: ‘django.db.backends.sqlite3’,
            ‘USER’: ‘root‘,
            ‘PASSWORD’: ‘graphite‘,    
            ‘HOST’: ”,
            ‘PORT’: ”
  6. Synchronize the database again and make sure there are no errors.
    sudo graphite-manage syncdb
  7. There are a few permissions that need to be changed for Graphite to run properly.
    sudo chmod 666 /var/lib/graphite/graphite.db
    sudo chmod 755 /usr/share/graphite-web/graphite.wsgi
  8. Edit the file “/etc/default/graphite-carbon” to enable the Carbon component of Graphite to start automatically when the server boots. Simply change the word “false” to “true” so that it looks like this –
    # Change to true, to enable carbon-cache on boot
  9. Now start the carbon service.
    service carbon-cache restart
  10. Now we need to make a couple of changes to Apache to server Graphite. We are just going to link a file included with Graphite into some Apache configuration directories.
    ln -s /usr/share/graphite-web/apache2-graphite.conf /etc/apache2/sites-available/

    ln -s /usr/share/graphite-web/apache2-graphite.conf /etc/apache2/sites-enabled/

  11. Since we are running Graphite and Grafana on the same server, and Grafana is the interface we’ll be in most of the time after setup, we should change Graphite to run on some port other than 80. I typically use 8080. That way the end users of Grafana won’t have to worry about browsing to a subdirectory. To do that we’ll change the /etc/apache/ports.conf file and the /usr/share/graphite-web/apache2-graphite.conf to use a different port for Graphite.
    In /etc/apache2/ports.conf add a line just under the line that says “Listen 80” that says “Listen 8080” so that it looks like this –

    # If you just change the port or add more ports here, you will likely also

    # have to change the VirtualHost statement in
    # /etc/apache2/sites-enabled/000-default.conf

     Listen 80
    Listen 8080

    <IfModule ssl_module>

    Now update /usr/share/graphite-web/apache2-graphite.conf so that the first line looks like this-
    <VirtualHost *:8080>

  12. Now restart Apache to activate the changes.
    sudo service apache2 restart
    Now if you browse to the server on port 8080 you should see the Graphite interface. Browser the tree under the Carbon leaf to make sure you are getting data. Browse to the end of the tree and pick “commitedPoints” and make sure the value is non-zero.
  13. The next few steps will be to install and configure InfluxDB. InfluxDB is used to store the dashboards we create with Grafana. None of our metric data is stored in InfluxDB so it won’t require much space or memory. Use wget to download InfluxDB and then install it using dpkg.

    sudo dpkg -i influxdb_latest_amd64.deb
    service influxdb start

  14. Create a user and database in InfluxDB to store the dashboards for Grafana. InfluxDB has an easy to use GUI that you access by browsing to the srever on port 8083. The default username and password in InfluxDB is “root” and “root” – you probably want to change that.

    Next create a database and user for the dashboards. I’m very creative and original, so my database is named “dashboards” and I’m using “graphite” for both the username and password for this database.

    After you create the database, click on the database name and create a user. I made this user an admin user also. I’m not sure if that is required. When you create the user it will say admin is false, but wait a few seconds and refresh the page. Now we are finished with configuring InfluxDB, but remember the database name, username, and password that you just created for the next step.

  15. Download and install Grafana. Let’s back up the old index.html that came with Apache as the first step, then we download and install Grafana.
    cd /var/www/html
    sudo mv index.html index.html.old

    Then browse to to see the latest version and download link. Copy the link address for the tar file and paste it into your terminal window and download it with wget. I’m getting version 1.9.0.
    sudo wget
    Extract the contents of the tar file.sudo tar xfzv grafana-1.9.0.tar.gz -C /var/www/html/ –strip-components 1
  16. Modify Apache to allow Grafana to connect to Graphite. We need to modify the Apache conf file, enable header, and then restart Apache. Modify /etc/apache2/apache2.conf to add the following lines just before the end so it looks like this –

    # Include the virtual host configurations:

    IncludeOptional sites-enabled/*.conf

    Header set Access-Control-Allow-Origin “*”
    Header set Access-Control-Allow-Methods “GET, OPTIONS”
    Header set Access-Control-Allow-Headers “origin, authorization, accept”

    # vim: syntax=apache ts=4 sw=4 sts=4 sr noet

    Enable header –
    sudo a2enmod headers

    Restart Apache once again
    sudo service apache2 restart


  17. Modify the Grafana configuration to connect to the local Graphite installation. Start by copying the sample Grafana configuration file in the /var/www/html directory to the name of the file that Grafana actually uses.sudo cp config.sample.js config.jsNow we need to modify the configuration to use the InfluxDB database we set up earlier. The sample configuration that we copied has blocks of examples that are commented out. We are going to use Graphite as the datasource and InfluxDB for dashboards, so we’ll be mashing together the first two example sections. The easiest way to do this is to copy the InfluxDB example and the Graphite/Elasticsearch example to notepad, make the edits and paste that section back into the config.js file. Make sure that the section you paste back in is not commented out with /* and */ before and after the block you edited. It should look like this –
    return new Settings({
    /* Data sources

    * ========================================================
    * Datasources are used to fetch metrics, annotations, and serve as dashboard storage
    * – You can have multiple of the same type.
    * – grafanaDB: true marks it for use for dashboard storage
    * – default: true marks the datasource as the default metric source (if you have multiple)
    * – basic authentication: use url syntax http://username:password@domain:port

    datasources: {
      graphite: {
      type: ‘graphite’,
      url: “http://graphite.local:8080”,

    grafana: {
      type: ‘influxdb’,
      url: “http://graphite.local:8086/db/dashboards”,
      username: ‘graphite’,
      password: ‘graphite’,
      grafanaDB: true

    // InfluxDB example setup (the InfluxDB databases specified need to exist)


    Make sure that the block you paste back in is not commented out.

  18. Test that Grafana is working. Just browse to your server’ address on the standard http port.

You now have a fully functional Graphite/InfluxDB/Grafana server ready to catch data. In my next post we’ll talk about what to do next and how to build a simple dashboard showing data exported from OnCommand Performance Manager.

Performance Dashboards Step 1 – Configuring NetApp OnCommand Performance Manager to Forward Data to an External Data Provider

The first step in getting your data into Graphite or any other application is to configure OnCommand Performance Manager 1.1 (currently in Release Candidate 1) to forward the data it gathers.  OPM 1.1 added a new option in the maintenance console – option 6) External Data Provider.

OPM external data provider step 1

If you take option 6 you will get another menu.  This is the External Server Connection Menu.  From here you can display, add, delete or modify the connection information for the external server.  You can also modify the configuration settings for the server, which I’ll cover in a moment.  Note the difference between “Server Connection” and “Server Configuration.”

OPM external data provider step 2

The Server Connection options allow you to configure where you are going to send the data.  To do that you need the target server IP address or hostname, target port, and level of detail to send.  Of course DNS has to be working probably on the OPM server and the target server must be in DNS to use the hostname.  If you are using Graphite or a compatible listener, the default port is TCP/2003.  That is all that OPM needs to start forwarding, but that isn’t all there is to think about or configure.

Option 3) Modify Server Configuration lets you change what data is sent to the external server that you just set up.  There are 4 things to configure under this menu.  Here you choose what to send, how to tag it, when to send it, and whether or not to send it.

OPM external data provider step 3

Let’s go through each one of these in a little detail.

Statistics Group – This is the set of counters to send.  There are 3 options, numbered 0, 1, and 2.  Option 0 is Performance Indicators, which is a high level set of counters aimed at the total operations and latency of logical objects.  Resource Utilization, Option 1, deals with utilization of the hardware in the cluster at the node, port, aggregate, and disk levels.  It also includes all of the counters from the previous level.  The last option is Drill Down and that looks at the logical level – Volumes, LUNs, and LIFs – in addition to the previous two levels.  Another way to look at the levels is low, medium, and high with respect to the amount of data sent.

Go to this link to see the product documentation page on this setting.

Vendor Tag – This is the top level heading for data sent to Graphite.  In my example above you see the vendor tag is “netapp-performance” and you can verify that by looking at the Graphite tree.

OPM external data provider step 4

In that screen cap you can see that the “netapp-performance” heading matches the vendor tag used in OPM.  (You can also see that the statistics group used is “Drill Down” because I can see counters all the way down to the volume and LIF levels.)

IMPORTANT – If you keep the default vendor tag you will be able to use shared dashboards more easily.  The dashboards will be covered in another post coming soon.

Transmit Interval – This is how often to send data to Graphite.  OPM polls at 5 minute intervals so the values available are multiples of 5.  Keep it at 5 and you’ll send your data as often as it is gathered.  More detail, but more storage space required.  If you are deploying this for troubleshooting, keep it at 5.  If you are deploying this for long-term trending, then 10 or 15 minute transmit intervals will be better and reduce the storage space consumed by a factor of 2 or 3.

Current Transmit Status – Enabled/Disabled.  Pretty self-explanatory.  I don’t know what effect a non-responsive listener has on OPM when it tries to send data, so it might be best to disable sending data if you know the Graphite listener is offline.

 And that is it.  That is all to configure on OPM to send data to Graphite or a compatible listener.  Our next topic will be on configuring Graphite to accept and store the data.  This topic is important because of the way that Graphite stores the data and you can easily get into trouble if you don’t understand how things behave.

Performance Dashboards for Data ONTAP

As many NetApp customers have told us, the lack of performance monitoring and performance charts have been real barriers to adoption of clustered Data ONTAP.  We actually do have a performance monitoring application in OnCommand Performance Manager (OPM), but it doesn’t let you see most of the information it collects and it lacks some other basic features.  We recognize some of these shortcomings and have done some interesting things to overcome them.

Starting in OPM 1.1, we can now forward collected data to an external system.  The format that we use is compatible with Graphite, an open source graphing application.  The Graphite data format seems to be the de facto standard for this type of use.  Graphite is easy to install and configure and available for most *nix environments.  You might get lucky and a customer is already running Graphite – I’ve only discussed this with a handful of my customers so far, but a sizable fraction is already using it.  For those that are, just configure OPM to forward the data (just get the IP address and TCP port Graphite listens on) and a few minutes later data just starts to appear in Graphite.  Configuring OPM to do this is about a 60 second process.

There are lots of documents available on the Internet for installing and configuring Graphite.  Installing on Ubuntu is pretty simple and straightforward.  Centos/RHEL 6.5 are also pretty simple, but Centos/RHEL 7.x is a little more difficult.  I’ll be adding my own for this purpose in a later post.  Grafana installation and configuration is also pretty simple.

Setting this up and making it work isn’t that difficult, but setting it up correctly for long term use requires some work.  Some of the things that I’m working on are storage requirements for the collected data at various retention periods and polling intervals.  More detail on this is coming later this week, along with step by step instructions for upgrading and configuring OPM to forward to Graphite.

The Transition Tactician Kickoff

This site is focused on transitions to enterprise infrastructure technologies for those that already have them.  I’ll mostly be writing about the tactical decisions and activities required to successfully implement new infrastructures.  From time-to-time I may comment on strategies for adopting new infrastructure or whether or not you even should, but I have friends and colleagues that handle that much better than me and I’ll link to those instead of rehashing their posts.

I’m always open to suggestions about articles and simply suggestions in general.  While I can rarely definitively declare I know the best way to do something complex with many different paths to a strategic goal, with great confidence I know good tactics for technology transitions.  Continual improvement is always necessary, so please jump in if you see something I missed or have questions.  With even greater confidence I speak authoritatively about transition methods that are what we’ll just say are “less than optimal.”  Some of these tactics I’ve been personally involved in, but most of the bad experiences I know about were ones I’ve been called in to either save or simply determine where it all went wrong.

I chose the name of this site for a three reasons.

  1. “Transition” was easy because of my work – I’m a Principal Architect at NetApp focused on clustered Data ONTAP transitions.
  2. “Tactician” was also pretty easy to choose.  I am a tactical thinker.  If you are interested strategies for how new technology platforms and infrastructures will improve your business, I’m not the best  consultant to work with.  However, if you are interested in how to adopt these technologies in the most efficient way, I will definitely have some valuable things to offer.  But a totally different but just as important reason I chose “tactician” is because I love competitive sailing and working on ways to get the boat to the finish line ahead of the others is challenging.  I love challenges.
  3. The domain name was available.

So while this post doesn’t actually have any content, you now have a clear understanding of what will follow.  I’ll be writing about practices and tactics that are both good and bad.  Sometimes I’ll even be correct about which are good and which are bad.