It's time I built a framework.
AppEngine is a good choice for hosting infrastructure as it's fairly inexpensive, supports Python, includes methods for user authentication, and provides simple database storage. It's also exceedingly handy to have integration with a few other Google products including BigQuery and Compute Engine.
On the other hand, AppEngine requires a lot of work to implement federated logins, secure form handling, extensible templating, and tight integration with other services, like Github and Twitter.
In the past, I've used GAE-Boilerplate to implement several sites, including StackGeek, TinyProbe and Utter.io. Each time I do a new site, I think to myself how much easier it would be if only I had a slightly different framework at hand to quickly implement those sites. I started doing a small amount of contributing to GAE-Boilerplate last year, but became quickly disappointed with the lack of focus on core features for building features required by a startup site, and the ease of its setup.
Deciding to write a framework on AppEngine is one I'm doing primarily for personal reasons. Doing this will help me learn more about programming and hopefully give me the tools I need to quickly deploy and develop new ideas. A side effect is that it may also help other startups implement the first version of their product in a short amount of time.
All of this will be Open Source and available under the Pink Panthers project on Github. I'm naming after the Serbian crime syndicate, The Pink Panthers because, a) they are fearless badasses and b) they are clearly opinionated. Like me!
So, what does a startup need from a framework?
Let's take a look at a few features a software startup may need from a framework. BTW, when I use the term 'startup', I mean an early stage company, pre-product, with a few founders hammering out code. I'm definitely not referring to a later stage startup with B round funding, 1000 customers, and 120 employees! Those are not startups in my opinion.
An opinionated view of what features an early stage startup will need consists of the following:
- a domain serving content
- simple design for displaying content
- simple way for non-technical people to create content
- simple deployment methodologies, including testing
- email registration for interested users
- user registration for early beta testers
- mailing list to email users periodically
- forums to hold discussions
- blogging framework for writing posts and making comments
- social networking features
- issue tracking via Github
- source code versioning via Github
- simple analytics dashboards
- simple monitoring of the service
- an A/B testing framework
- an integrated payment system via Stripe
Installation, initial configuration, and creating a project on AppEngine will be done using a build script. A configuration webpage may be used for speeding up the installation. See Wordpress.
Installation documentation will be provided where scripts or GUIs cannot provide assistance in setting up the framework.
Installation should be able to be done by someone who has a marginal amount of technical expertise. It should be approachable by end developers who can do simple HTML editing and basic command line actions like checking out repositories via Git and editing content online at Github.
Features for managing users will be a priority. A startup depends on users for it's eventual success, and those users need ways of communicating with the startup and promoting the startup's offerings to other users. This means features such as blog posts, mailing lists, social media integration, commenting, analytics and A/B testing are available on the site from day one.
The layout and content of the framework should be made obvious with a sample site repo which others can clone and easily edit or repurpose within the limits of the underlying frameworks.
The content of the sample repo will drive the homepage page for Pink Panthers and the page it builds will contain blog posts and documentation for using the framework. The code for Pink Panthers will live inside the startup project's directory as a git module.
Dependencies which are being actively developed in their own git repos will live in git modules under the Pink Panther's repo. Dependencies which aren't updated regularly, or require custom modifications to work in the framework, will be cloned and added to the Pink Panther's repo.
Storage and Security
Content should be stored independently of any other service, with the exception of the startup's own Google account services and Stripe. Company critical data like user account information should be kept secure. Credentials or sensitive data like credit cards should have secure and simple methods for storage and processing.
Methods for backing up or exporting data should be readily available within the framework.
The code for the framework is Open Source, which allows it to be independently audited.
If you have any suggestions, it's time to make them in the comments below. I'll update this post as needed over the next few weeks time.0 Comments
I've been off consulting in the OpenStack Outlands for the past 6 months. In my absence, the Logging River Valley has apparently been busy: Splunk Storm went free, Logentries and Loggly both launched a new look and feel using their respective hauls, and Google's BigQuery changed their pricing structure and added a slew of new features.
Now that Google has changed their pricing on BigQuery and have added features I need to complete TinyProbe, I decided to jump back in and see where we were at with pricing.
WTF is up with all the knobs?
I worked at Splunk and founded Loggly (obviously I'm no longer there), so I know my way around the various terms used in the industry when discussing log management offerings. Fancy phrases like retention times and indexing volumes rolled off my tongue on a daily basis, mostly due the long hours I spent with the best of the best discussing pricing models.
Today, it seems only one company really gets it when it comes to pricing event indexing and storage as a service: Google.
Google is big on transparency and flexibility nowadays, so much so in fact that they have a section titled pricing philosophy on their pricing page for BigQuery. Their pricing strategy is simple: charge once a month for storing the data and charge again when you search it.
In comparison, here's a compiled list of all the the other company's features/knobs/limits/crap you have to wade through to figure out how much sending them your logs is going to cost you and how long you'll have access to your data:
- GB/day pricing
- GB/month pricing
- flexible pricing
- contract length
- GB storage pricing (less the full text searchable index)
- GB sent pricing
- variable retention time limits (7, 15, 30, infinity days)
- max storage size retention limits
- daily volume limits
- monthly volume limits
- indexing hard limits
- overage fees (daily/monthly)
- extended support options
- free offerings
It's no wonder they need all those fancy plan names like Development, Free, Gold, Most Popular, Platnium, and 'Pay for what you need' - it's fucking confusing figuring out how much it's going to cost you!
Leveling the Field
Given I'm going to eventually charge something for TinyProbe, I decided to do a cost comparison across services, based on a logging volume of 20GB/day and a desired retention of 30 days. At that volume, you should have a maximum of 600GB of data available and searchable in your account at the end of each monthly billing period.
Here are the results of my research:
I should explain a few terms I use here before diving into each service's pricing. First, the term retention hit refers to the amount of time before you hit storage OR time retention limits if you tried to shove in 20GB a day or 600GB a month to the service. For example, Logentries has a 7 day retention hit because accounts are limited to 150GB/month total.
The term max retention is used for indicating a service's apparent hard limit on retaining searchable data.
The term daily limits refers to whether or not the the pricing model shows GB/day rates, which could possibly imply to the end user there are daily limits to sending in data. In reality, I don't think any of the services above marked 'Yes' in the daily limit column rate-limit their inbound data due to data loss concerns.
Starting out, we have Splunk Storm. I always figured Splunk would use Storm as lead generation, and here we are and I was right: it's now completely free. You can send in up to 20GB/month of data and indexes and storage are trimmed monthly. Indexing stops after 20GB though, so it's tough titties if you go over. That's why there's an N/A in the projected cost column. Hey, it's free. Sell your first born, buy their software and run it yourself if you don't like it.
Logentries has two types of pricing: plans and metered. I used the Plus plan rate above to calculate the cost per month because I couldn't figure out what the hell they were talking about on the metered page and what the limits were. They charge a combination of GB sent per month and GB stored per month for metered. I suppose it might be $1.99 + $0.69 = $2.68/GB month sent and stored, but that seems more expensive than the $1.66 for the Plus plan. I gave up on that tactic and multiplied their top plan price by 4 yielding about $1K a month for 600GB of data. Imagine you have four accounts with them, paying about $250/month each to get around the hard limits.
Loggly's monthly cost projection is confusing in a similar way because the site shows they only retain data for 15 days. Like Papertrail's primary pricing page, they are showing them charging you a monthly rate for a thing they do on a daily basis: indexing your data, and something else they do on a semi-monthly basis: storing your data. One way to think about it is that you pay them monthly for storing half the data you've sent them during the month.
*Note: Shortly after posting this, Loggly contacted me saying a) they didn't have limits on retention, b) their pricing was $1,350 for 30 days retention of 20GB/day and 600GB/month storage, c) they didn't have daily GB limits. I've since changed the table to reflect the email I received from them pricing, but have left references to the 15 day retention times as that's what their site says it is. I added a paragraph above explaining my reference to daily limits.
And, it's still confusing!
Papertrail wins the most expensive service award, which isn't surprising given they have a jumbled set of pricing pages with massive numbers of buttons on them. The first pricing page takes a similar tactic to Loggly's where they only keep the index for half the month, but then 'store' it for you for a year. It took me a while to find their pricing slider which seems to indicate they have monthly volume limits with daily or weekly based retention times. You can send in up to 500GB/month (a real month) and store and search logs for up to 4 weeks (a fake month). As an aside, I actually invented the idea for the pricing slider in a frustrated fit of pricing creativity one day at Loggly. Good to see it still in use somewhere.
Ah, domo SumoLogic. It took me a good 5 minutes wading through their pages to find their pricing page. Like Papertrail, Sumo's pricing is done on a sliding scale and I like how they seem ready and willing to provide larger retention times - up to and over a year. This makes sense as the founders are from ArcSight and compliance use-cases are worth a lot of money and require hella long retention times. Once you find the page, cost calculation is simple. There's no talk of index or search limits, and they are nearly as cheap as Logentries while supporting higher volumes. Good stuff.
Google's simple philosophy shows well here price-wise, but the lack of a UI is certainly a big barrier to entry. It's also a business opportunity for some given how ridiculously cheap it is compared to the other offerings. Pricing is simple and stupid cheap at just under $50 for 600GB/month stored and searchable. The result is a price that is an order of magnitude cheaper than other offerings.
Google recently added streaming and table decorators to BigQuery, which makes things a little more approachable, logging-wise. Of all the offerings, Google chooses to charge extra for searching the data, which raises the interesting question of how much it's going to run me to use it. I honestly don't know the answer to this question, but I can speculate a bit.
Speculating on Charging for Search Usage
There are literally hundreds of event/log management use cases. Analytics. Monitoring. Alerting. Troubleshooting. Compliance. Most of the more common use cases like monitoring require a regular timed job run on all the data that gets indexed in the system. For example, if you must alert on the term error then you must search all 20GB/day of data for that term if that's how much data you are indexing.
Google does several interesting things with its indexes in BigQuery to help with this. First, when you do a query on BigQuery, only the fields specified are searched, which means fields not used in the query are not included in the quota. Second, you can search for things like "error AND failure" in a single query, which means you can lump together certain monitoring queries.
As for user interactions, consider when you manually search using these services you are searching a bounded time range. That means you are necessarily limiting your query to a smaller data set. This would translate to smaller costs for searches on BigQuery.
It's hard to tell, but my gut says that for any given data set a user sends into a log management system, they end up searching through all that data roughly 10 times on average. Keep in mind you could search subsets of this data hundreds of times over as needed and still fall under the 10 time average estimate, especially with the tricks BigQuery plays with data.
If we use Google's cost of $35 per TB of data processed as a guide, that means our 600GB/month of data would cost us about .6 x 10 x $35 = $210 to search for the month. Add that to the measly 48 bucks above, and you get a total cost of about $258 - nearly a tenth the cost of the competitors. Google give breaks for doing batch queries as well, so it's probably cheaper still than I outline here.
After all this analysis I've decided to stick to Google's new pricing model and charge customers based on the amount of data sent in and, separately, searched on TinyProbe. I'll probably create a couple of dirt cheap accounts which have hard limits and a single reasonably priced metered plan for larger data sets.
And if it weren't obvious by now, I'm using BigQuery as the storage engine for TinyProbe. Thanks to Google I get a twofer on my offering - fantastic search capabilities coupled with simple, non-confusing pricing.0 Comments
I've been noticing the term data scientist being bandied about and loosely coupled to the term analyst. I think it brings a little zing to older dogeared terms like big data and cloud computing! Watch...
Friend: What do you do for a living?
Me: I'm a frickin data scientist, yo!
Actually, I'm Just a Programmer
If I remember correctly, my college degree's title was Computer Science/Mathematics. I suppose it's fair to state my degree technically makes me a computer scientist. To ride that logic train a bit further, as a programmer I can't really get away from dealing with data, so all of us developers just become data scientists don't we?
Analysts? Well, I suppose they know what questions to ask, but it's unlikely they know how to code. That begs the question: Which is harder, teaching an analyst to code, or teaching a coder to analyst? Analyze? Whatever.
Frankly, seeing how a lot of developers end up being tasked with writing dashboards and admin panels for their company's web applications, I think it's OK we start calling out these developers as the real scientific heros. After all, physicists regularly take up cellular biology as a career, because they are uniquely qualified to become experts with specific bio-technologies. Why can't computer scientists do the same with business logic?
They can and they do.
Application Monitoring vs. Application Intelligence
Over the last 5 years of my foray into operation management software, I noticed a trend while talking to customers about monitoring their applications: They wanted to know more about how users used their software using the same data they were using for monitoring. I know, earth shattering stuff, right? It seems obvious, but why exactly were they having that conversation with me when we started out talking about monitoring log files?
The answer lies in what role I was in at the time and to whom I was talking. As a director of the developers program at Splunk and more recently as a founder/CEO of Loggly, the people I talked to first were usually product/project managers. As the conversations progressed, they invariably fell to application intelligence, because these managers also worried about whether or not people used their software and how.
If you are collecting a ton of data for monitoring your app, it's not a stretch to extend it to collecting data about signups, features usage patterns or any other random metric you might dream up while sitting awake at 2AM worrying about your conversions. Why the hell wouldn't you track this stuff, given you had a tool that would do it?
The answer is, you would. All you need is a developer.
Apps to the Rescue! Not.
SalesForce was one of our first really BIG customers at Splunk. As many of you are probably aware, the awesome thing Splunk has going is its backend. It's essentially a non-relational key-value store shoehorned into a massively scalable search engine. This ability to scale and search a massive amount of logs is what originally got SalesForce's attention. What kept them interested and eventually made them purchase Splunk, was the ability to write apps to query that engine and display graphs to their customers.
I ended up working with a handful of developers assigned to the Splunk project at SalesForce. Over the course of a month or so, I helped those developers write a few of the early prototypes for showing mail campaign reports/graphs to customers using SalesForce. Basically the idea was you'd email a bunch of your company contacts with some offer and then SalesForce would show you how many of those emails bounced. You could then search (using Splunk) for the records that bounced and update your contact entries as needed.
It was a simple idea, but one that got us thinking: Was it possible to write a bunch of useful apps like this and sell them to others? At the time, a lot of us at Splunk sure thought so - that's basically what SplunkBase is all about, after all.
After doing Loggly and thinking about it a bit more with a ton of other customers, I'm not so sure it makes a ton of sense.
Keep IT Simple
The primary problem with a one-size-fits-all application intelligence app is the data ingested needs to all be uniform. And when I say uniform, I mean uniform data coming out of your webapp and my webapp. The data flows differ.
If I signup customers using a simplified flow and you don't, how do I write an app that tracks conversions accurately for both of us? The short answer is you'll just end up using an app to track whether someone signups up or not. And to hell with any data gleaned in between the first page load and the signup.
And there's the problem. It's just stupid: throwing away all that valuable user data.
But we do it everyday. With Google Analytics, MixPanel, ChartBeat, Geckoboard, and countless other application tracking sites who claim to offer detailed analysis of your user's activities. In reality all these applications offer is a low friction way for marketing peeps to glean some data from the torrent of user data your site gets everyday. Those people aren't data scientists. Or programmers.
And they are leaving a ton of questions about your business on the table.
With TinyProbe, I aim to change the way we approach data science by empowering the people who implement analytics solutions: developers. If you are a developer hacking on analytics for your applications, stick around. I think TinyProbe going to blow you away.0 Comments