Ethereal Computer Architecture: 2013

Monday 21 October 2013

Using Enterprise Architecture at a Media Company -- Part three: TOGAF

What is TOGAF?

TOGAF stands for "The Open Group Architecture Framework". But, I suppose that really doesn't tell you much about what it is. I like to think of TOGAF as analogous to a big box home improvement store, such as Home Depot, Lowes or Rona (Canada only). You go to a home improvement store when you need tools or materials for a building something or a home improvement project. You might not get everything you need there and you may need to adapt some of the materials and tools to work properly in your home.

Similarly, if you are trying to construct or improve an Enterprise Architecture, you can go to the TOGAF "store" and get at least some of the materials and especially the guidance that you need. You don't need to get everything there, as you may be using other tools such as the Zachman framework. You will also likely need to adapt what you get from TOGAF to your needs, just like you adapt the materials you buy at a home improvement store to your home.

Big box home improvement stores can be somewhat daunting places. It is sometimes hard to find what you need, because the stores contain so much merchandise. In my opinion (and the opinion of others that I have talked to) TOGAF is similar. It contains megabytes of text that makes suggestions and recommendations about how to do an Enterprise Architecture. In my opinion, there are a lot of good ideas/guidance there, but it can be a bit overwhelming, especially at first. From talking to others, some people give up on it because TOGAF seems like far too much effort, especially for a small to medium size business. Also, many Enterprise Architecture groups simply cannot get the necessary corporate buy-in to implement a large part of TOGAF as a single "big bang" type project.

Luckily, just like you don't need to buy everything in your local Home Depot in order to fix your house, you also don't need to use every word of TOGAF to build your enterprise architecture. TOGAF actually mentions this and spends quite a bit of time talking about how it must be integrated into existing initiatives (like project and portfolio management).

In this blog entry, I will give a really high level view of TOGAF, and then, in the next entry, discuss how it was used in "the Sphere", a fictitious medium-sized media company, that has a small architecture team: one enterprise architect and five solutions/information architects (some of which are only part time architects).

A Fairly Quick, High Level View of TOGAF

I like to think of TOGAF as having two main parts: the architectural development method (ADM), and the enterprise continuum. The ADM is a recommended set of phases for the development process of an Enterprise Architecture. The enterprise continuum is divided into two smaller parts: the architecture and solutions continuum. Both of these contains building blocks (for architecture and solutions) respectively ordered from the most generic to the most specific (the four categories, from the most generic to the most specific, are: foundation, common, industry and organization). You can populate the architecture and solutions continuum with your own architecture and solutions building blocks, essentially creating a customized library of re-usable components. This is a great way to encourage consistency between projects. You can also get building blocks from industry reference models, and you can use the TOGAF technical reference model as an extremely generic starting point. TOGAF recommends that you use some sort of repository tool to store all the artifacts/documents/models that you produce.

Keep in mind that there are megabytes and megabytes of documents describing TOGAF, so we really are just scratching the surface in this blog entry. Much of the TOGAF documentation contains suggested ways to use the components of TOGAF. The user of TOGAF is free to use as much of this advice as he or she wishes.

The Architecture Development Method (ADM)

The TOGAF phases (source: Stephen Marley/NASA/Sci)

TOGAF is divided into ten phases, each of which is further divided into steps and sub-steps. With the possible exception of the Preliminary phase (at the top of the diagram), TOGAF assumes that we constantly iterate through the steps in order to produce architectures. Although the iteration looks like it starts in phase A and continues sequentially to phase H, new requirements may force us to go back one or more phases and revise our work (which is why there is a "requirements" circle in the middle of the diagram). Realistically, we may also need to revisit a phase because we forgot something. After revisiting a phase, we may need to revisit the subsequent phases to deal with the effects of the changes we made.

Preliminary and Architecture Vision Phases

The preliminary phase is essentially the initiation of the Enterprise Architecture effort within an organization. The organization needs to decide to what extent they are going to use Enterprise Architecture, how it impacts the org chart and how it impacts business planning, operations management (technical systems and development) and project management. The organization needs to create and agree on basic architecture principles (such as "only one authoritative source for each data element", "avoiding overlap and duplication of applications", and "systems should be designed to operate in spite of component failures") as well as a customized version of TOGAF, which will likely evolve over time. Identifying a place to store architecture documents (such as Sharepoint, Confluence, Wikis, etc) is also desirable and some architecture tools/methodologies (ERDs, UML, etc) may be chosen.

The Architecture Vision phase is the start of an iteration of the architecture process. Depending on how one sets up TOGAF, it may be triggered by a "request for architecture", which can be part of the project management/project initiation process in an organization. The architecture vision phase needs to first identify the stakeholders and should ensure that their concerns and requirements will be addressed. With this information, and the architecture principles defined in the previous iteration (realistically, you might need to go back and add some principles), TOGAF recommends that you attempt to define the scope of the required architecture activities and create rough business, data, application and technical architectures. Although this may seem obvious to some, you should also make sure that the project isn't beyond the capabilities of the architecture team and that any necessary business transformations are possible. TOGAF recommends that you attempt to achieve consensus among various stakeholders before proceeding in order to avoid the (not unknown) scenario where you finish the architecture project and it is never accepted. Finally, you should produce an architecture vision document which contains at least the preliminary architectures and requirements, and may contain a resource plan, KPIs, milestones, etc. This should be approved by a sponsor and/or architecture board before proceeding to the next phase.

Architecture Phases

Common steps for all architecture phases

In TOGAF, the three architecture phases are broken down into the same series of steps (which are actually done somewhat iteratively):

Select reference models, viewpoints, and tools
Develop baseline architecture
Develop target architecture
Perform gap analysis
Define candidate roadmap components
Resolve impacts across the architecture
Conduct formal stakeholder review
Formalize the architecture
Create architecture definition document

The first step is basically the preparation step for the architecture phase. It asks us to consider whether there are any well known generalized models for what we are doing. For example, if you are architecting a call centre, you may wish to use the One-VA technical reference model. You should also determine the relevant stakeholder's viewpoints for the architecture you are doing, the models you need to cover them and if any of the architectural principles apply. You also need to decide what tools you want to use (e.g. Visio, Rational Rose, etc).

The first step is usually broken into the following sub steps, some of which you may not need to do in detail or may wish to skip:

Determine the overall modelling process, selecting models needed to support all the views/viewpoints.
Identify possible building blocks from the architecture repository
Identify the matrices you want to build (or are required).
Identify the diagrams that you will need
Identify the types of requirements that you will need to collect.
Select appropriate services, ideally using combinations of services from the TOGAF TRM.

The second step is to describe the existing architecture, at least to the extent necessary to show how the target architecture will change things. Where possible, existing building blocks should be used for this. It's possible that all or part of the existing architecture was done as part of a previous architectural iteration.

The third step is build enough of a target architecture to show how things will change in order to accomplish the architecture vision. Sometimes the third step is done before the second. Where possible, existing building blocks should be used for the step, but it is quite possible that you may need to define new building blocks, which will need to be added to the architecture repository and placed in the architecture continuum.

The fourth step is to perform a gap analysis. This means that you look for missing parts in the baseline and target architectures (and presumably fill them in) as well as look for the gaps between the baseline and target. You also look for conflicts between views of the system and resolve them by examining tradeoffs.

The fifth step is to create a roadmap that takes you from the baseline to the target architecture. The roadmaps built in the architecture phases can often be used to in the migration planning phase.

The sixth step is to consider whether the target architecture will impact any other architectures. You may removing building blocks that are used in other architectures, or perhaps adding capabilities that will be useful elsewhere.

The seventh step is to review the architecture with the stakeholders (possibly the ones whose viewpoints you considered in step 1) and make sure they agree you are accomplishing the architecture vision and they approve of how you are doing it.

The eighth step is to finalize the architecture, essentially filling in any necessary details in architectural deliverables (such as building blocks) and ensuring requirements traceability for what you've proposed.

The ninth and last step is to create the architecture definition document, using the deliverables and make sure it goes into the architectural repository.

Business Architecture Phase

Like all architecture phases, the business architecture phase goes through the sub steps given above. Given that you are working on a business architecture, it is important to keep the business drivers in mind. As with all architectures, you also need to make sure you are considering all stakeholders, including business owners and users.

A key activity in the business architecture phase is modelling the baseline and target architectures. You can use a lot of different models for this, including data flow diagrams (where the processes are business processes), entity-relationship diagrams (or class diagrams in UML -- the entities or classes would represent high level business entities), information exchange matrices (and node connectivity diagrams) showing what information is exchanged between business groups, use cases and structured analysis (which breaks down key business functions and assigns them to organizational units). It's also possible to use UML sequence diagrams to document business processes, so you might want to use them as one of your business architecture models.

One of the activities in the first step in any architecture phase is investigating whether there are any industry reference models for what you are doing. The U.S. Veterans administration has many such models available as do industry groups like the telemanagement forum. You should also look internally for re-usuable building blocks (essentially little bits of business architecture). For example, you might be able to re-use architectures for things like invoice processing and approval.

Information Systems Architecture Phase

The Information Systems Architecture phase includes data architecture and applications architecture. Data architecture is often done first, but not always. In some situations, the two are almost done concurrently because changes in one tend to result in changes in the other. As with the business architecture phase, the basic architecture steps (given above) are followed.

Data Architecture

As with all architecture phases, we need to start by considering the business drivers for architectural change, determine who the stakeholders are (don't forget the operations team or the internal/external auditors). We also need to consider how any new data we introduce will be created, distributed, migrated, secured, and archived. If the gap between the baseline and target data architectures is very large, we will need to make sure we plan a data migration as part of the implementation effort.

As with all architecture phases, we will use models to represent the target and baseline architectures. These models should represent the viewpoints of all the stakeholders. The changes in the target models should ideally be traceable back to the business architecture and/or the business requirements. Where possible, we should start by considering any industry reference architectures that may exist (such as ARTS for retail, and Energistics for the Petroleum industry). Probably the most basic data model is the list of data elements. From there, we can model the data using an ERD or UML class diagram. ERDs and class diagrams can be done at various levels of detail, including conceptual, logical and physical. It may make sense to do all three in this phase, although a really high level ERD or class diagram (which may essentially count as the conceptual model) may have been done in the business architecture phase. There are a number of other models which may be useful to cover viewpoints, including data flow diagrams (logical and physical), You may also wish to write documents relating data elements to business entities, business functions, access restrictions, and cover any interoperability requirements for data (such as data formats for interchange between components).

Application Architecture

The application architecture follows the steps given above for all architecture phases, and considers the baseline applications and how they must change, based on the business and usually the data target architectures in order to arrive at a target application architecture.

You can use a UML component diagram to model all applications and their flows (for both baseline and target architectures). You may also wish to create application portfolio catalogs and application migration documentation if the target architecture requires large changes to the set of applications.

Technology Architecture Phase

Technology architecture covers the hardware and physical infrastructure and may also deal with "software infrastructure" such databases and software which implements services from the TOGAF Technical Reference Model (TRM). Like the other architecture phases, the technology architecture phase follows the steps common to all architecture phases, which are outlined above. The starting reference architecture for this phase is generally the aforementioned TRM.

It's possible to use a UML component diagram or a Visio network diagram to draw the hardware/software infrastructure for the baseline and target technology architectures, being sure to remember to document locations and communications/network requirements between components. It may also be useful to produce documentation that ties the abstract services of the physical and technical infrastructure to those of the TOGAF TRM (see above). When you need to add new technology, you should ideally follow existing technical standards, where they are applicable. For any new technology that you wish to add, quite a few factors need to be considered (and of course, should ideally be put into the target architecture) including: performance, maintainability, location (and resulting network latency), requirements for availability and you will need to do sizing, costing, capacity planning, and migration planning. New technology might also be subject to your organization's technical governance processes (for example, PCI, audit compliance, etc).

Solutions and Opportunities Phase

In this phase, we go through the baseline and target architectures that we previously created, decide what needs to be done to bridge the gaps (and if it can in fact be done), and then create a preliminary plan which is finished in the next phase.

When we examine the gaps in the various architectures, we will need to look at whether business priorities will impose constraints on what we would like to do. For example, if we would like to replace a call centre, but opening new stores has a higher business priority, we may need to come up with an alternative to replacing the call centre. We will take this sort of thing into account as we consolidate the gaps from the architecture phases and then examine each one in order to make decisions on how the gap should be addressed. These decisions will then be documented (possibly into an "implementatin factor assessement and deduction matrix") as we go.

As we are examining the gaps, we may notice common requirements that span several business functions. These should be consolidated (or "factored out") to make sure we only address them once. As we determine solutions, we may end up considering applications that will need to interoperate. We need to make sure that we keep track of these interoperability requirements and deal with them by changing input/output specifications or introducing adaptor components.

We next need to look for dependencies between our gap-bridging solutions so that we can determine the order in which the solutions can be built, determine possible deliverable dates, and so that we can start grouping the solutions into work packages. As we continue to look more closely at the solutions, we always need to ask whether it is within the organization's capabilities to implement them. If not, we will need to find other solutions. We also need to decide if a solution is completely new ("greenfield"), directly obsoletes existing systems ("revolutionary"), or gradually changes existing systems ("evolutionary"). It's sometimes also valuable to identify the "quick win" solutions and distinguish them from the ones achievable in the middle and longer term. Quick win solutions can be helpful to show that you are making progress in an implementation, although not everything can be a quick win. Besides classifying solutions, we also need to determine whether the components of our baseline architecture will continue to exist in the new architecture, will be gradually phased out or will be replaced as a result of the current effort.

Once we know what we wish to do (aka the work packages) we need to determine whether we will move from baseline to the target architecture in a single step or whether we will need to plan intermediate steps or transition architectures. Ideally, transition architectures should deliver some (business) value or else they may be hard to justify.

Once we know what our transition architectures will be (or if we are going to use them at all), then we create the initial versions of the migration documents that we will refine in the next phase: the architecture roadmap, and the migration and implementation plan. We also update the architecture vision document, the architecture definition document and the architecture requirements as necessary.

Migration Planning Phase

In the migration planning phase, we finalize the migration documents we started in the previous phase. The migration phase essentially completes the architecture activities for the current iteration of the ADM.

The first step of the migration phase is to determine how/if the changes necessary to implement the target architectures will affect project/portfolio management, business planning and operations management. It may make sense to have one of these three management frameworks deal with some of the changes rather than the enterprise architecture process. It's also possible that the people who govern these frameworks may want modifications to the work packages, so it is best to find out.

Next, we try to assign a business value to each work package, considering ROI, strategic fit, or ultimate value to the business (value chain analysis). This analysis should ideally help get the work packages approved and is a good double check on whether the architecture is aligned with business objectives. Critical success factors can also be defined so that the success in implementing the work package can be measured.

The third step is to figure out what resources (such as people) are required to do each work package, how long it will take and determine whether or not the resources required can be made available for the required time.

We then try to prioritize the work packages based on cost-benefit and risk and then get the stakeholders to agree to the prioritization.

With all of the above information, we can finalize the architecture roadmap, update the architecture definition document (basically the baseline and target architectures) and generate the implementation and migration plan.

At this point, we are done the architecture activities of the current iteration of the ADM

Implementation Governance Phase

Often, we move from the baseline to target architectures in a series of intermediate steps, called transition architectures. In this phase, we monitor the implementation work that takes us from baselines to target architectures, possibly monitoring each transition architecture implementation as a separate step. Generally, there is some sort of formal or informal review process (such as a steering committee that meets periodically) to make sure the implementation is proceeding as planned and conforms to the transition or target architectures. During the implementation it is important to prepare any necessary changes to business processes or operations processes and make sure these are in place when the implementation is complete. Once the implementation is complete, it makes sense to do a "lessons learned" session.

Architecture Change Management Phase

Architectures will need to change -- that is a given. Technology is always changing and so are business priorities. The former means that sometimes a better solution comes along, while the latter can mean that the solution you architected is longer needed or needs to change significantly. Architecture change requests may be received through the architecture governance process and may possibly originate from operations management (possibly because a solution is not performing as it should) or from business process management. We might need to make changes because of the need to reduce costs, a certain technology becoming unsupported or because we have decided to standardize. Sometimes the required change may be small and doable very quickly without significant (or perhaps any) re-architecture work. It's also possible that the requested change is already accounted for in a transition architecture that hasn't yet been implemented. Other times, it may trigger a "request for architecture" and we may need to iterate through the ADM.

Assuming that an enterprise architecture is being used to try and realize value in a organization, we need to monitor how well it is meeting business and operational objectives and make changes where there are problems or gaps between what is desired and what is being delivered. We also need to consider the effects of new technologies, which may make it possible to better meet requirements or meet them more cheaply. We should watch carefully for changes in business strategies to make sure that the enterprise architecture continues to meet the needs of the organization, otherwise, we risk having an Enterprise Architecture that was perfect for the business strategy two years ago, but not now.

When we or someone else discovers a problem or gap, we or they need to prepare an architectural change request. This request needs to be analyzed to determine the risks of implementing it and to determine how well a solution will fit with the current enterprise architecture. It's also important to determine if any service level agreements or whether the business value currently being delivered by the existing system will be affected. We need to propose changes to the change request if necessary in order to mitigate risks, ensure that SLAs are met and to ensure that systems continue to provide business value.

Once all of this is done, we need to hold a meeting with the architectural council (or the appropriate governing body, depending on how the change needs to be handled) to get their feedback, buy-in, and hopefully approval. Assuming we get approval, then we need to initiate the process to implement the change, possibly starting another ADM iteration, if the change warrants it.

Requirements Management

Requirements management sits in the centre of the TOGAF ADM diagram because it operates continuously during all the phases. Essentially, it contains all the activities that collect, organize and get approval for requirements. The changes that result from the requirements are done in the other (outer) phases. Because requirements management runs continuously, it is hard to describe it as a phase. Therefore we will refer to its steps as belonging to the "requirements management activity". We will refer to the outer circles in the ADM diagram as the "outer ADM phases".

The outer phases in the ADM (especially the architecture phases) identify new requirements and then these are conceptually passed to the requirements management activity, where they are prioritized, approved by the necessary stakeholders and then put into a repository. At the same time that this is occurring, the ADM phase that identified the requirement(s) will modify or add to the requirements it is considering, noting the priorities determined in the requirements management phase and possibly changing them again. The requirements management activity then updates its repository as necessary, communicates the changes to stakeholders and architects, gets their buy-in, deals with conflicts with other requirements and prepares a requirements impact assessment. The current outer ADM phase is then responsible for determining the impact of the requirements on its activities and deciding whether to revisit earlier outer phases or to defer the requirements to a later outer ADM phase. As always, if the requirements change, the requirements management phase needs to update the requirements repository and try to get stakeholder and architect buy-in.

The requirements management activity also needs to note any requirements changes that occur during the architectural change management phase. These requirements will often (if significant enough) be resolved in a subsequent iteration of the ADM (i.e. by starting at the preliminary phase and going through all the phases again).

Tuesday 8 October 2013

Enterprise Architecture and Conflict Resolution

It happens to all Enterprise Architects sooner or later. Someone doesn't agree with your architecture and they would rather implement a solution in a way that doesn't match the current technical road map. Sometimes the resulting discussions can be quite civilized and profitable for both parties. However, to make that more likely, and to prevent a disagreement from escalating beyond where it should, here are some ideas that I have found useful.

When you are first told that someone wants to deviate from your carefully thought out architecture, don't get defensive. I think this is quite natural when the objection to your ideas comes from someone more senior than you in the company and it may happen regardless. You don't need to immediately defend your work. It's best to say something like, "That's an interesting idea, let me think about it a bit" and then go somewhere else and do just that.

Likewise, back away if you feel yourself getting angry. Showing anger at someone else's ideas when they happen to conflict with your own is not going to help you.

In some organizations, an Enterprise Architect can "pull rank" and simply inform people that disagree with them that they will follow the Enterprise Architecture. Giving people orders is probably not the best way of dealing with the situation. You may get public compliance but they could be resentful and may look for a way to circumvent you at the first opportunity.

Let the people who disagree with you know that you value their input and that you are glad that they feel strongly enough about enterprise architecture to come talk to you. Set up a time (usually not right away -- the next day or later would be good if the matter is of utmost urgency) to discuss their concerns in more detail.

When the meeting starts, let them do most of the talking. You should stick to asking questions, pointing out areas of actual agreement (sometimes this works really well) and admitting to any mistakes that you might have made that actually led to the disagreement. I know the second thing is hard to do, but sometimes when one side in a disagreement admits mistakes, the other side makes concessions as well. The key thing is that you are not trying to "win" the argument. You are trying to bring the sides together.

When they are done presenting their points, don't argue with them. Thank them again for caring enough to come to you and tell them that you will get back to them as soon as you can. Then, go away and think carefully about what they said.

You need to think carefully about the degree of disagreement. Is this an issue where you need to go to the wall (i.e. the CIO/CTO) and insist on getting your way? Keep in mind that could be damaging in the long run, but it is sometimes necessary. Is there any middle ground that is reasonable? Is it possible to use some sort of objective test to decide between the alternatives? Is this an issue that should go to the Architecture Review Board? Normally, if you don't have enough support at the Architecture Review Board on the issue in question, then perhaps you should consider accommodating the request.

If you do have to oppose the request, try to refer to the organization's architecture principles when doing so. You might also want to give the person who disagrees with you a chance to address the architecture review board and make their case. You should find out if it is possible for them to do a proof of concept for their idea. The key thing is that you need to make sure that they do not suffer anything resembling humiliation as a result of disagreeing with your architecture. Many times, the person opposing you will be a valued member of the organization. Keep in mind that they may be more valued than you are :-) and treat them the way you would expect to be treated in their shoes.

Sunday 29 September 2013

My Definition of Enterprise Architecture

Someone recently asked me to define what an Enterprise Architect is. I'm usually pretty good with definitions, but this time I was stuck. I went to Google and found out I wasn't the only one.

After some thought, here is my attempt at a definition:

Enterprise architecture is a practise concerned with:

Business - technology alignment
Disciplined Innovation (innovation where it is needed)
Disciplined delivery (not re-inventing the wheel; consistency with past efforts; repeatability)
Proactive solutions (proposing solutions when systems no longer support needed business capabilities)

I hope to have time to elaborate more on this a bit later.

Thursday 19 September 2013

Using Enterprise Architecture at a Media Company (part two, Zachman framework)

As mentioned in part one, the Zachman framework is a taxonomy for organizing architecture artifacts. In this blog post, we will discuss how we can take use the Zachman framework to guide our thinking about what needs to change in the Sphere's (our imaginary media company) architecture to accommodate a metered pay wall.

If you are not familiar with the Zachman framework, the wikipedia entry is a good place to start. This article also provides an interesting take on Zachman. Please note that all the information about the Zachman framework used below was either taken from publicly available sources or from discussions with other enterprise architects. The information about the Sphere's metered paywall system is a based on an actual implemented system, with some simplifications.

Please note that this blog entry is a bit of a work in progress. I'm hoping to improve it a bit and perhaps add some diagrams. Hopefully it is not too long for people to read.

Perspectives, Fundamental Questions and Paywalls

The Zachman framework is conceptually a grid, whose cells represent types of architectural artifacts (e.g. written documentation, diagrams, models, etc). You don't need to create artifacts to fill in all the cells if it is not useful. However, I find it is helpful to think about all the cells to try and figure out if we are forgetting some implication of a business requirement on our architecture.

The rows in the grid are the perspectives from which the architecture is viewed, or alternatively the stakeholders involved in getting something planned and built. The generic names for the perspectives are: planner, owner, designer, builder, subcontractor, and enterprise. It is Zachman's assertion that these perspectives/stakeholders exist no matter if you are architecting a company, product, building or a software system. Typically, for software systems, we use the following more specific terms: scope, business model, system model, technology model, detailed implementation, functioning enterprise. It's a bit counter-intuitive, but the final row ("functioning enterprise") represents the completed product or software system and therefore does not contain any architectural artifacts.

The columns in the grid represent fundamental questions that need to be answered for each perspective/stakeholder. The columns are often labelled as what, how, where, who, when, why. Again, for enterprise architecture projects there is a more specific labeling (which, in my opinion, doesn't completely make sense for all perspectives): data, function, network, people, time and motivation. I think it is useful to remember both sets of labels for the fundamental questions, as sometimes the label from one set is more intuitive than the label from the other for a given perspective.

In the sections below, we will run through each of the cells, starting with the top row and considering the columns in the order: what, how, where, who, when, why. Ordering the columns this way is done only to make this blog entry easier to understand. I am not trying to break the Zachman rule that "columns have no order".

The Planner or Scope Perspective

Let's start by considering the how our paywall requirement affects the architectural artifacts in first or top row of the Zachman framework. The top row represents the planner or scope perspective and answers the fundamental questions from the point of view of a business planner or a project manager during project pre-planning. This row might also be useful in agile methodologies to orient the agile team prior to the first sprint. I guess it could also be argued that, in agile, this knowledge would be had by the product owner, or perhaps the leader of all the product owners.

The "What" or "Data" fundamental question

The first cell in the top row is the "what" or "data" cell. Let's assume that the artifact in this cell is the list of entities known to the company. The "sphere" has not previously had digital subscriptions to its website. Therefore, we need to add "digital subscriber" to the list of entities known to the company.

The "How" or "Function" fundamental question

The second cell in the top row is the "how" or "function" cell. Let's assume that the artifact in this cell is a list of business processes. We are going to need new business processes to manage the digital subscriptions and do billing. We also need to know some sort of business rule that determines when non-digital subscribers will be blocked from seeing an article.

The "Where" or "Network" fundamental question

The third cell in the top row is the "where" or "network" cell. Let's assume that the artifact in this cell is a list of places that the Sphere does business. The new paywall probably doesn't change this.

The "Who" or "People" fundamental question

The fourth cell in the top row is the "who" or "people" cell. Let's assume the artifact in this cell is a list of the organization units of the company and the company's business partners. The Sphere has an internal development team to write the paywall software. The Sphere's existing credit card billing service can handle billing the new digital subscriptions, although this should be confirmed. The Sphere's SAP Team within the IT department has also confirmed they can handle the billing through SAP. However, it appears that the Sphere's call centre is not adequately staffed to handle complaints from digital subscribers. Therefore, we must consider whether we are going to expand the call centre or outsource customer care for the new digital subscribers. After the business planner consults the executive sponsor for the paywall project, he/she informs the enterprise architect that a new organization needs to be added to the artifact in this cell: a call centre outsourcer.

The "When" or "Time" fundamental question

The fifth cell in the top row is the "when" or "time" fundamental question. Let's assume that the artifact in this cell is a list of all the cycles (or repeated processes and recurring deadlines) in the Sphere's business. There are two basic cycles implied by business processes defined in the "how" cell: the cycle that controls how frequently non-digital subscribers who get blocked from seeing articles are unblocked and the cycle that controls how often digital subscribers will be billed. Let's discuss the second cycle a bit further. The business planner consults with the SAP architect (because SAP is used for billing at the Sphere) and discovers that the existing print subscriber billing cycle could be used for digital subscriptions but that a print subscriber would need a separate digital subscription if subscribed to both print and digital products. The planner consults the executive sponsor and the executive sponsor is concerned that this will create problems offering bundled subscriptions (i.e. paying a single, discounted, price for both digital and print subscriptions). After much discussion with the SAP architect, they decide a trade off can be made that combines both subscriptions into one, but which slightly reduces the amount of revenue that the Sphere will collect from bundled subscriptions. The executive sponsor indicates this is ok in the short term, and so the enterprise architect notes in the artifact that the existing print billing cycle will be used initially, but that another billing cycle may need to be added in the future. Although the actual architectural architect didn't change much, the process of considering the "Time" fundamental question during planning raised an important issue that was resolved before the project got underway.

The "Why" or "How" fundamental question

The sixth cell in the top row is the "why" or "how" fundamental question. Let's assume that the artifact in this cell is some sort of list of general business strategies. Adding a metered paywall is a substantial change in business strategy. Something along the lines of the initial few paragraphs of the first part of this series should be added to the list of general business strategies to explain why the Sphere is building a metered paywall.

The Owner or Business Model Perspective

Let's now move on to the second row in the Zachman framework, which represents the "owner", "business owner" or "business model" perspective. Zachman and others have used a construction analogy for this perspective, comparing it to that of the owner of a building being designed. The building owner cares about things such as which way window are facing, how the building is partitioned into rooms, etc, but does not necessarily care about the where the support columns are or where the water pipes are run. In the same way, a (business or product) owner of a software project does care about what the software does, but not necessarily about whether it uses an Oracle or SQL Server database. In my opinion, people who discuss the Zachman framework often think of a business analyst in the owner role, because the architecture artifacts tend to be things analogous to high level entity-relationship diagrams or data flow diagrams. Also in my opinion (which may not be the opinion of many other practitioners, to be fair), the owner is often a business or product owner who wants to see mockups and sometimes market research rather than data flow diagrams. However, if the owner is, in fact, a business analyst, then high level data-flow diagrams and entity-relationship diagrams may be the correct approach.

The "What" or "Data" fundamental question

We return to the first column in the Zachman framework, but this time for the "owner" or "data" perspective. In the planner perspective, we dealt with this same fundamental question by adding a new type of entity called a digital subscriber. Assuming the owner is a business analyst, the artifact in this cell may be a document describing entities and some of their high level attributes and perhaps their relationship to other entities. In this case, we probably want to add a digital subscriber entity to the document as well as some of the attributes (information) that the owner has decided should be collected when a digital subscriber signs up. In the modelling exercise that this document is based on, using the Zachman framework resulted in a spirited discussion of how much information should be gathered for a digital subscriber. It was useful to have this discussion early in the design process. The business owner also had a marketing research firm produce profiles of imaginary digital subscribers. In my opinion, these could also be considered architectural artifacts that would fit into the "what" or "data" cell in the owner perspective.

The "How" or "Function" fundamental question

The second cell in the owner or business model perspective's row is for artifacts which define business processes from a high level business perspective. Sometimes the artifacts are something very similar to high level data flow diagrams, showing business processes and how they accept inputs from and pass outputs to each other. If we use this approach, we will need to add details about the processes we defined for the cell above this one in the planner row. These processes would probably include signing up a subscriber, modifying the information for an existing subscriber, cancelling a subscription, performing periodic billing for a subscriber and cancelling a billing. In my opinion, an artifact consisting of UML use cases or agile user stories might work just as well and might be easier for some business owners to deal with.

The "Where" or "Network" fundamental question

The third cell in the owner or business model perspective answers the fundamental question "where" and is usually concerned with the locations a company operates from and the logistics between those locations. Because the business owner has decided that the customer care team for digital subscribers will be outsourced, it is probably wise to add the outsourced customer care team to this artifact. We will assume for now that the logistics will consist of a dedicated network connection between the headquarters of the Sphere and the outsourcer.

The "Who" or "People" fundamental question

The fourth cell in the owner or business model perspective answers the fundamental question "who" and has artifacts which show the interactions between the people involved in the system. The people are usually grouped somehow, possibly into departments or other organizational units and workflows are shown between the groups. For our metered paywall, digital subscribers will interact with customer care agents, so digital subscribers, and customer care agents will need to be added to the artifact as will the basic workflows that occur between them (modifying a subscription, creating a subscription, stopping a subscription).

The "When" or "Time" fundamental question

The fifth cell in the owner or business model perspective answers the question "when" and has artifacts which show the cycles and (therefore implicitly) the critical recurring deadlines for the company. When we looked at the "when" fundamental question in the planner perspective (the cell immediately on top of the one we are dealing with now) we determined that we would use the print subscriber billing cycle for digital subscribers. The other cycle that needs to be considering has to do with non-digital subscribers viewing digital content. The metered paywall will block them (and ask them to subscribe) after they have viewed a certain number of articles in a given time period. The business owner needed to decide at this point what this time period would be. The business owner actually threw us a curve ball and said there should be different types of articles, some of which could be viewed without restriction by non-digital subscribers, others that would be subject to a limit and still others that would be never be viewable by non-digital subscribers. It was good we discovered this! We had to go back to the "what" cell and add some attributes for articles and go back to the "how" cell and modify our article viewing procedure a bit. Obviously, it would have been better to catch these when we were considering the "how" and "what" fundamental questions, but we still caught them relatively early. Realistically, I think that modifying or creating the artifacts for a single perspective is a bit of an iterative process in which, like in this case, a discovery while working on one cell may affect a cell that you have already been working on.

The "Why" or "Motivation" fundamental question

The final cell in the owner or business model perspective answers the question "why" or explains the motivation from the business owner perspective. The artifact for this cell is often a business plan. Business plans vary in their content, but usually describe what is to be done, why it needs to be done (usually to reduce costs or drive revenue or both) as well as some basic targets along with any necessary strategies to meet these targets. In the case of the metered paywall, most of this information had been prepared prior to us starting the Zachman process and we were mostly able to use existing documents/e-mails to create a business plan, which can be added the set of business plans that maintain for this artifact.

The designer or logical model perspective

If you've done any data modelling or solutions architecture, the perspectives we've just covered may have been what you considered as "requirements" and may have been gathered by a business analyst or part of the knowledge of the product owner. The designer or logical model perspective is typically the point at which architects and data modelers get involved with a project and many (possibly all) of the artifacts in this perspective will be familiar to them.

The "What" or "Data" fundamental question

The "what" or "data" cell in the logical model perspective contains at least one artifact which describes entities and their relationships. Not surprisingly, an entity-relationship diagram might very well be used here. I like to use class diagrams (from UML), but mostly leave out the methods in this cell. This is because we have a tool for easily drawing class diagrams and we hoped it would save us some time when we got to the next cell and the next perspective (but I'm getting ahead of myself here and sort of breaking the Zachman rule that an artifact can only go into one cell). We created classes for digital subscriber and article group (which would define both a set of article groups, as well as the maximum number of articles in that group that a non-digital subscriber could access). We added an attribute to the article class which defined which article group an article was in. We added a class that counts the number of articles seen by a subscriber in a particular article group. We also created an sap billing document class to hold any billing information that might need to be passed to SAP. We finished by adding a few easy relationships: an article has a many to one relationship with an article group, a digital subscriber has a one to many relationship with an SAP billing document, etc.

The "How" or "Function" fundamental question

The "how" or "function" cell in the logical model perspective contains at least one artifact which describes the user visible functionality of our systems. I've seen people use essentially block diagrams of the components of an application with flows indicating the information that a user can access in each component, as well as sometimes the information that the components pass between themselves. Because we have already started to think in terms of classes, a UML sequence diagram can, in my opinion, be an acceptable artifact for this cell. Sequence diagrams in UML show how user tasks or business processes are executed by classes (which correspond somewhat to Zachman's idea of application components). The sequence diagram ideally can borrow from the cell above it, which contains business processes and use these as the processes it is illustrating. In our case, this meant that we had to do a sequence diagram for the subscription sign up process and the process that happens when a user (either digital subscriber or not) views an article. The sequence diagram can also borrow from the "what" cell in the same row and use the classes defined there as the things in the sequence diagram that make method calls to perform a business process. Realistically, when you start building the sequence diagrams, you may find that you are missing classes and that is exactly what happened to me (We realized we needed a class that counts the number of articles accessed in an article group by a particular user, and then we added it to the class diagram in the "what" cell).

The "Where" or "Network" fundamental question

The "where" or "network" cell in the logical model perspective contains at least one diagram that shows how distributed system components (if any) communicate by drawing lines between them, for example, a web server will often communicate with an application server, which may in turn communicate with a database. In our case, we had to add a link to SAP to retrieve billing information. This raised PCI concerns (because our SAP system stores credit card information and must follow PCI requirements) and so we ended up moving some of the billing functionality inside our PCI web server environment and including it on the diagram. We also realized that our existing web cache servers were not going to allow us to block non-digital subscribers from seeing articles, so we had to modify our distributed system diagram to show the use of a third party CDN (Akamai) that had the required capability.

The "Who" or "People" fundamental question

The "who" or "people" cell in the logical model perspective contains at least one artifact that gives a high level (or architectural) view of the user interface. Zachman states that this artifact should model roles which are connected to deliverables. In my experience most user interface/usability professionals don't really know what this means. For the metered paywall project, we used the wireframes (simple UI mockups with minimal design detail) and basic representations or mockups of the interactions that our two types of users (digital subscriber and non-digital subscriber) would have with the system (which can be done as a series of powerpoint slides if you want). For example, at one point we (or more correctly the product manager and ui team) developed a very simple set of powerpoint slides showing the rough series of web pages that a user sees to sign up as a subscriber. Similarly, we used the initial mockups of the web page that a non-digital subscriber gets when they try to access an article that would cause them to exceed their free article quota. These artifacts are a lot more visual (and perhaps more concrete) than what Zachman seems to have intended, but they are easier for usability professionals to create and work with.

The "When" or "Time" fundamental question

The "When" or "Time" cell in the logical model perspective contains artifacts that describe the way the business cycles uncovered in the corresponding cell in the business owner perspective will be mapped to the cycles of the Sphere's systems. We first determined that, because of the requirements in the previous perspective, we would need to use the print subscriber billing cycle for digital subscribers. This cycle is largely within an SAP ERP system and therefore we engaged the SAP architect to help us with the logical model. We discovered that we would need to define the typical billing frequency, that is, the time that elapses between billings if a subscriber does not temporarily suspend their newspaper. We also discovered that, because of the requirement to bundle digital and print subscriptions, subscribers with bundled subscriptions would essentially extend their billing date if they suspended their newspaper. Many people thought this was not completely desirable, however, after much discussion, we decided that the alternatives were too costly. Therefore, we added this cycle to our logical cycles artifact (a spreadsheet), noting that it was tied to the print subscription cycle and that it would be affected by newspaper suspensions when a customer had a bundled subscription (in Zachman terms the cycle is controlled by the receipt of recurring payment event and the event generated when the subscribers total payment is amortized). We also flagged this as something that should be revisited later. The other cycle that was uncovered in the previous perspective was the period for which a non-digital subscriber would be blocked from seeing some articles after their they exceeded their allowance of free articles. As discovered in the previous perspective, the allowance should reset back to the maximum at the beginning of each month. Therefore, we added a cycle to our logical cycles artifact to reset the number of free articles at the beginning of each month (we can also say that the reset is triggered by a "beginning of month" event).

The "Why" or "Motivation" fundamental question

The "Why" or "Motivation" cell in the logical model contains business rules which can be implemented in a rules engine or possibly code (but we are getting ahead of ourselves). Formal specifications for business rules do exist, but we generally just try to use concise, english statements in the artifact (generally a spreadsheet) for this cell.

Here is a sample of some of the business rules that we discovered by reviewing the business case and getting necessary clarifications from the product owner:

Each article in the system will be assigned one of several colours: green, red and yellow.
Only digital subscribers can view green, yellow and red articles without restriction.
Non-digital subscribers cannot view red articles.
Non-digital subscribers can view green articles without restriction.
Non-digital subscribers can view only n unique yellow articles per month, where n should be a configurable threshold.

The Builder or Technology Model Perspective

I've always had some trouble distinguishing this perspective from the logical model perspective above it and the detailed implementation model perspective beneath it. I think part of the problem is that we often skip parts of the technology model perspective when we do actual projects because it is easier to think about something either as part of the logical model or detailed implementation. The trick I use to try and figure out this perspective is to go back to Zachman's construction (of a building) analogy. The technology perspective corresponds to the builder's perspective. A builder needs to know the materials that will be used but not necessarily the exact details of exactly how they will be fit together to make a building. Extending this to technology projects, the technical leads need to know what technologies will be used, for example, java classes (possibly with methods defined) and Oracle tables, but not the exact algorithms or the data types and indexes of the Oracle database.

The "What" or "Data" fundamental question

I have a bit of a bias towards UML, so I like to use a class diagram as the artifact for this, but a detailed ERD might be more appropriate if you are a purist. When I use a class diagram, I generally only include the classes that are actually persisted (that is, saved into some sort of database or file) entities, as well as how they will be persisted (e.g. file, Oracle, Solr, Cassandra, HBase, MongoDB, etc). Since the metered paywall is part of a larger web system, the classes or entities for it can be added to the class diagram or ERD for the whole web system (assuming it exists!).

The "How" or "Function" fundamental question

Again, I like to use a class diagram as the artifact for this. However, it should include all known classes, their relationships (cardinality and inheritance) and the classes should be annotated with the language in which they will be written, a description of what the class does and the methods and attributes of the class to the extent that these can be known at this point. Again, since the metered paywall is part of a larger system, it's classes would normally be added to the class diagram for that larger system.

The "Where" or "Network" fundamental question

The existing system that we are adding the metered paywall to has a somewhat complex distributed architecture, for which there is an existing network diagram. As previously discussed, the metered paywall will require a new content distribution network (Akamai), which we will need to add to the network diagram. We will need to pass information to Akamai about whether the user is a digital subscriber, and, if not, how many yellow and red articles they have so far viewed this month. We will also need to tell Akamai what type of article (green, yellow, red) they are viewing. These flows were noted on the network diagram (for the larger web system of which the metered paywall is a part), with additional annotations that encrypted cookies will be used to communicate whether the user is a digital subscriber as well as the the number yellow and red articles that they have viewed so far this month. The encryption algorithm will be triple DES and another flow will be added to the network diagram to indicate how the key is exchanged. The colour or group to which an article belongs will sent using Akamai's edge side includes. This will also be indicated on the network diagram.

We also need to add a network flow between our SAP environment and the PCI environment in which will be used to serve web pages that allow a user to subscribe. We augmented this link with a note that communication will be by sending encrypted files in order to make PCI compliance easier.

The "Who" or "People" fundamental question

We used the detailed screen mockups done by the design group for this artifact. For convenience, these can be sequenced in powerpoint presentations to show how the system works for various types of users in various scenarios. For example, powerpoint presentations were built to show the subscription process as well as the user experience when a non-digital subscriber exceeds their monthly quota of yellow articles. I also like to include some web architecture guidelines (such as how to use JQuery, Javascript, the need for CSS, etc) along with the mockups to round out the set of artifacts for this cell.

The "When" or "Time" fundamental question

The artifact for this cell is a spreadsheet listing the cycles of the larger content delivery system, with some information on what components will be used to implement them. After consulting with the SAP functional analysts, we decided that the digital subscriber billing cycle will be done as part of the subscription monitoring process in SAP. After consulting with the web architects, we decided that resetting the monthly quota of yellow articles for non-digital subscribers will be done as part of the Akamai edge side include logic. These decisions were noted in the spreadsheet.

The "Why" or "Motivation" fundamental question

The artifact for this cell is a spreadsheet that contains business rules along with information on how they will be implemented. We decided not to embed a rules engine in the code for the metered paywall. Once this decision was made, the only way to implement the rules is to modify code, in one of four places: jsp or java on the application server, javascript/front end or in the Akamai edge side includes. We decided to implement the rules that I discussed in the system model perspective by using edge side includes and java in the application server. It needs to be done this way because sometimes articles that expire from the Akamai cache (edge sides includes work on Akamai) get fetched from the application server.

The Subcontractor or Implementation Model Perspective

The artifacts in this perspective are essentially very detailed models or descriptions of what is to be implemented. In my experience, many of these models may not be actually be stored anywhere -- they might exist on a whiteboard for a short period of time, they may be rough drawings on scraps of paper or they may just be conversations between members of a (often agile) team. In some cases, I think that some of the artifacts in the implementation model perspective never leave a technologist's mind. In the metered paywall project, these models were mostly done on whiteboards or in conversations, as I mentioned above. However, I will discuss ways in which they could have been recorded (and may have been in some cases).

The "What" or "Data" fundamental question

Some of the data that was persisted in the metered paywall project was stored in Oracle, while other data elements were in Cassandra (a NoSQL database). The creation scripts for the Cassandra keyspaces and column families and the Oracle tables are actually a good artifact for this cell, in my opinion. They get preserved as part of the code deployment process, so they remain available after the project is done.

The "How" or "Function" fundamental question

As an architect, I would have preferred if the UML models developed in previous perspectives for this fundamental question had been augmented with comments and possibly pseudo code for the methods and used to produce code. In reality, there were no artifacts produced for this question as the agile methodology used by the development teams favours working code over documentation. I don't think that Zachman envisioned the effects of agile development on his framework :-).

The "Where" or "Network" fundamental question

The primary architectural issue for this question is the connection to Akamai (our content delivery network) and the information that needs to be passed via cookies and contained in Akamai's edge side includes and configuration so that only digital subscribers have unrestricted access to yellow or red articles. Because some of this work required collaboration between the Sphere's developers and Akamai, network diagrams and detailed written descriptions of cookie formats had to be produced, even though the in-house teams were using an agile methodology. In some ways, our relationship with Akamai was essentially what I believe Zachman envisioned when he created the subcontractor perspective.

The "Who" or "People" fundamental question

In the implementation model perspective, this fundamental question intuitively should be something like a specification for the user interface widgets as this follows nicely from the previous perspective (at least in my opinion). Many user interface technologies are built by responding to user events (this includes web technologies and perhaps surprisingly, SAP screens) and creating a diagram or document that shows user interface widgets and how user interface events are processed would seem to be a rational choice of artifact for this fundamental question. In an agile project, like the metered paywall described in this blog entry, it is very likely that these things are discussed but not written down (due to the agile preference for working code and face-to-face discussion over documentation).

I have read Zachman discussions that argue that security architecture artifacts should be placed in this cell. To me, this doesn't quite seem intuitively correct, however security is important and it definitely needs to go somewhere. In the metered paywall project, we dealt with security in the logical and implementation perspectives and in the "network" fundamental question in this perspective. In systems like SAP ERP and Netweaver (which have extensive role-based security), it is possible to separate out the security configuration and include security artifacts as the answer to this fundamental question. In SAP, for example, we typically have a spreadsheet that lists users and the roles that are assigned to them and then another spreadsheet that lists roles and describes the authorization objects. In SAP this is definitely an implementation level document that is often completed only a short time before a system goes live (and sometimes not until after go-live, unfortunately). Therefore, I think that using this cell for security artifacts can make sense, but it requires a fairly sophisticated security sub system that you can configure separately from everything else. This is not always present in web projects and wasn't part of the metered paywall system that this document describes.

The "When" or "Time" fundamental question

The artifact for this fundamental question in most representations of the Zachman framework that I have seen is a fairly low level (almost assembly language) specification of how events/periodic processing should be implemented. The metered paywall project is a combined web and SAP project and so it does need to deal with concepts at that low a level. Earlier in our analysis, we decided that the billing cycle for digital subscribers would be tied to that of print subscribers and implemented on our SAP system. The implementation model artifacts for doing this in SAP are very well defined: an initiated change control ticket with the necessary SAP configuration objects defined. The required code changes will later be attached to this ticket and the ticket will be retained indefinitely, making it a very suitable artifact.

At the start of every month, we also need to reset to zero the count of yellow articles that non-digital subscribers have viewed, allowing them to view as many yellow articles as the threshold permits in the new month. We decided to do this by modifying java and having Akamai modify configuration and edge side include code. Again, because we involved Akamai, we produced some documents that outlined the rules in a sort of pseudo code that explained how code should be written to detect when a user counter cookie was from a previous month and then reset the count in the cookie. This pseudo code could serve as an artifact for this cell.

The "Why" or "Motivation" fundamental question

As with the "How" fundamental question, the agile development process that we use means that there were very few recorded artifacts for this fundamental question. The business rules that we identified in the previous perspectives were simply implemented by developers by modifying the necessary java, javascript, or edge side include code. It would be possible to create some pseudo code showing how the classes, javascript and edge side include code was modified and this could be the artifact for this fundamental question.

Some closing thoughts

This post was a lot longer than I thought it would be when I started. The Zachman framework can produce a lot of documentation, and I guess even trying to describe it at a high level (as I have done above) can take quite a few words. Overall, it probably does save some time by forcing you to think of things up front. However, it can be hard to justify using it, because it adds overhead to the beginning of a project and produces little in terms of demonstrable results. When starting a project, I like to at least mentally run through the perspectives and fundamental questions. Even if there is not time to properly produce each model, I find that it is a useful tool for thinking about a project.

The Zachman framework has been criticized for not really defining a process for enterprise architecture. In the next post in this series, I will talk about my attempts to take ideas from TOGAF to create a process. I also experimented a bit with lean Enterprise Architecture methods, and hope to produce a blog post on that as well.

Tuesday 17 September 2013

Using Enterprise Architecture at a Media Company (part one)

This post gives a fairly small example how Enterprise Architecture can be used in a media company. This is based on actual work, but, as they say on TV, "the names have been changed to protect the innocent". In order to make the size of this post (and the other parts) manageable, I'm only going to take a single aspect of the Enterprise Architecture for a company which I will refer to as "the Sphere". I'm also only going to deal with it on a fairly high level, in order to make the example clearer, although hopefully it will be straightforward to see how it could be made more detailed.

The Problem

The Sphere runs a newspaper and a website, which has content roughly similar to the newspaper. People are no longer paying for the newspaper because they can read all articles on the website, which is generates revenue from digital advertising. Digital advertising however, only brings in a small percentage (15%) of the revenue necessary to support the company's costs; ads in the newspaper have traditionally brought in about 70% of the Sphere's revenue, with newspaper subscription fees bringing in the remaining 15%. The Sphere's print advertisers have realized that people aren't reading newspapers as much as they once did, and are therefore shifting their advertising dollars elsewhere, seriously impacting the largest source of revenue at the Sphere.

A metered paywall is special software code that allows website visitors to see a certain number of articles without paying, but requires the visitors pay to see additional articles.

The Sphere hopes that, by introducing a metered paywall to their website, they will encourage people to continue to buy the newspaper (thus making it more appealing to print advertisers), and get revenue from a new source: subscriptions to the website that allow visitors to see as many articles as they want.

What's an Enterprise Architect to do?

Enterprise architects need to make sure that a company's technology is aligned with its business strategy. Once the decision makers at the Sphere decide to add a metered paywall to the the website, the enterprise architect must take this information and determine how to adapt the company's technology to it. It is possible that existing technical architectures may need to change and/or new components or technologies have to be architected and built. Ideally, we do this in a systematic and disciplined way.

Usually, this involves examining and manipulating (that is changing or adding to), existing architecture "artifacts", which are usually diagrams, written documents, or models constructed using UML or some other methodology. Since the artifacts represent the company's technology, this gives the enterprise architect a way to think about what needs to be done and hopefully not forget anything. The enterprise architect can then begin discussions with business stakeholders, other enterprise architects, solution architects, development managers and developers to decide what must be done. In my opinion, an enterprise architect does not necessarily solve problems, but instead uncovers and frames them (and possibly has a recommended solution or two in mind).

It often helps if there is a sort of architectural change control process that specifies how the above happens, so that it doesn't happen in an ad hoc way every time business strategy changes.

It can be helpful to use two fairly well known approaches to deal with the artifacts and create a sort of architectural change process. The somewhat inappropriately named "Zachman Framework" is actually a taxonomy for organizing architectural artifacts, making sure they are well defined (that is, don't overlap), complete and that business requirements align with the resulting architectural designs (and the technologies that get built). The TOGAF framework is a sort of strategy for building an architectural change process, which might or might not use the Zachman Framework.

We will look at how we can use the Zachman Framework to handle the Sphere's need for a metered paywall in the next part of this series. We will look at how we can add the TOGAF framework to provide a sort of architectural change process in the third part.

Continue on to part two.

Sunday 15 September 2013

Cassandra as distributed cache?

NoSQL was born as a sort of reaction to the architectural design pattern in which you put a cache (such as Ehcache Enteprise, Redis, or memcached) in front of a relational database in order to better scale the database. One of the basic rationales for NoSQL is that, if a cache is sufficient to handle most of your database queries, then you don't really need a relational database. NoSQL then goes one step further and says that, if you can live without some of the relational database features, then you can trade them off for other useful capabilities like replication.

At the moment, the company which I work for is having trouble with the caching solution we are using in front of our relational database. I don't want to name the solution we are using, because we are not using it properly and the problems we are having are therefore more of our own making. However, we are looking at moving much of our infrastructure into an IAAS cloud solution (possibly Amazon AWS, Google Compute Engine or Rackspace). Our existing caching solution is not well suited to multi-datacentre deployment (which is probably one of the big advantages to using IAAS), so we need to look for something else.

Cassandra is really well suited to this type of cloud deployment for a number of reasons. The Cassandra data model can easily support a key-value store (we will talk more about the Cassandra data model later) and it is possible to put time-to-live (ttl) values on Cassandra columns, which means we can have cached values automatically expire. One big advantage of Cassandra over some key-value stores is that it can flexibly shard and replicate the data to multiple nodes and multiple data centres.

The multi data centre support is very useful. Cloud providers generally allow you to deploy to n data centres, where n is larger than two. You can get really good fault tolerance by dividing your infrastructure into n separate and autonomous units (that I like to call "pods"), putting each one into a separate data centre and then doing load balancing between them (most IAAS providers give you a way to do the load balancing fairly painlessly). This is a pretty powerful idea because you can potentially run on cheaper, smaller cloud instances and you don't need to effectively double your infrastructure like you often do when you deploy to two data centres. Assuming you have n pods, you can probably size your instances so that your applications can run using (n-2) pods. Assuming you can get n > 6, you will likely spend less than you would by spreading your infrastructure over 2 data centres which requires that you have enough infrastructure in each data centre to run in the absence of the other data centre.

As hinted earlier, Cassandra has the concept of data centres, and makes it easy to put at least one complete copy of your data in each. My thinking is that each pod should be configured as a single Cassandra data centre. I'm not sure whether it makes sense to have more than one copy of the cached data in each Cassandra pod, because if you have six pods, you will potentially have six copies of your data, which is plenty. Assuming there is reasonable connectivity between the pods, a Cassandra node failure will cause at least some of the data to be fetched from a different pod, which may be ok.

When cached data is updated in Cassandra, it will be replicated within a few milliseconds to the other pods. There is a risk of nodes in other pods getting stale cached data, which needs to be considered. Typically, I suspect that we will want to make user sessions somewhat sticky to the pod that they initially connect to, which should lower the risk a bit.

Another issue I can see, based on my organization's use of distributed database caches, is that we will sometimes need to invalidate a cache (remove all its entries). I can think of quite a few Cassandra data models that would allow you to invalidate a particular cache, but perhaps it is simpler if we keep each cache in its own column family. We could then drop and recreate the column family to clear or invalidate the cache. I guess we could also just truncate the column family, but my experience with the nodetool truncate command is that it does not work really well on multi-node clusters (it works pretty well on single-node clusters though, but I am sure most people don't have those in production).

Most distributed caches also allow you to place an upper limit on the number of items in a cache. This is generally done to conserve memory. In Cassandra, the cache can spill to disk, so memory is less of a concern. However, it might still be desirable to have a limit on the cache size. One way to do this is to have a row (called an "all_keys" row, probably using the row key "all_keys") in each cache's column family whose column keys are a time stamp (representing cache insertion time) concatenated with the cache key for each entry in the cache. These columns would have the same time to live (ttl) as the cached data. We could also define a counter column in each cache's column family which would keep track of the current number of elements in the cache. When this counter exceeds a certain value, we could have a daemon delete the oldest entries from the cache's column family. These could be determined by doing a column slice on the all_keys row. Having the "all_keys" row would allow us to invalidate the cache by doing a column slice to get all the cache keys and then deleting all the rows, instead of dropping and recreating the column family.

Tuesday 25 June 2013

Cassandra Part two: Stock Quotes Data Architecture

Having had some success with using Cassandra for our election night postal code lookup, it was time to try something more demanding: stock quotes.

Background

Our website has stock quotes. We have two fast connections to the Toronto Stock Exchange (TMX) which we use to get stock data for the TMX, TMX Venture, Nasdaq, and NYSE (as well as several affiliate stock exchanges, such as Nasdaq OMX/PHILX, NYSE Arca, etc). We also get index data from Dow Jones and Standard and Poors. The data feed we receive gives us data in real time (with incredibly low latency). However, for most of our customers, we need to delay the data by varying numbers of seconds.

Because of the need to delay the data, we need to store our incoming stock information somewhere. We've used a certain major relational database and we had a third party company build us a really nifty in-memory distributed system which processes the data, optionally delays it and makes it available through a REST-like web service. The third party in-memory solution was necessary because we found that the relational database simply couldn't keep up with the stock feed. The problem with the third party in-memory solution is that it was written in 32-bit Visual C++, and we are mostly a java shop. When the solution was originally done (over 10 years ago), we felt that java simply couldn't deliver the speed necessary to process incoming stock data.

A Possible Solution with Java and Cassandra

As Enterprise Architect, I was able to completely re-think the way we process stock data. I met with some of the people who were responsible for providing operations support and they pointed out to me that our stock data vendor actually had a java api, which suggested that perhaps java was now fast enough. There was still the matter of how to store the stock data so that we could delay it. Obviously, an custom in-memory solution was a possibility, which might have essentially become a java rewrite of the C++ system. There are several java in-memory libraries/databases that we might have used (or we could have just used standard java data types), however, given our recent experience with Cassandra, I wondered if it would work. Cassandra was attractive in that it was fairly easy to build a cluster that actually works across data centres and which can survive the total failure of one or more servers. This was important to us because our stock data system needed to have no single points of failure and, traditionally in our organization, our database had been a single point of failure.

Cassandra's Weird Data Model

We started this project back in 2011, and there wasn't a whole lot written about how to do data modelling in Cassandra. We needed to figure out how to put stock data into Cassandra, but it soon became clear that Cassandra's data model was a bit difficult. I found "WTF is a supercolumn" to be a really helpful starting point, even though it is a bit outdated now. If you are just starting with Cassandra, you should make sure you install the latest version and read something on the DataStax site, such as this documentation.

You might wonder why Cassandra's data model is so weird (as have many of the developers who work at my company). The most important thing to understand is that Cassandra really is a distributed database. This means that:

Your data set can be spread over multiple servers (also known as sharding)
Any given piece of data can be on more than one server, which gives Cassandra the ability to load balance requests and recover from server failures.

Distributed databases need to make certain trade offs. Cassandra, like many of them, does not support joins. It also has to make trade offs between consistency, atomicity and partition tolerance, as predicted by Brewer's theorem. Indexing is also somewhat limited.

To make matters more challenging, Cassandra has it's own terminology:

What relational databases call a "schema", Cassandra calls a "keyspace". What relational databases like to call a table, Cassandra calls a "column family". It's probably fair to give these things different names in Cassandra, given that they are not really exactly the same as what you get in a relational database. However, it does make for a steeper learning curve.

The first point that I learned about Cassandra data modelling is that because there are no joins, you have to de-normalize everything. That means that all the information for a given query pretty much needs to go into one column family. As a result, you really need to think backwards from your queries and make sure that there is a single column family that satisfies them (or, as a last resort, do joins in your code).

The next thing I learned about Cassandra is that you really need to think of columns differently than you do in a relational database. In a relational database, every row generally has the same columns (although people have found various ways to get around this, often by using varchar fields to hold string values of arbitrary data types). In Cassandra:

every row can have completely different columns,
every row can have a lot of columns (up to two billion)
the columns are essentially sorted in order by their name and so it is possible to (quite efficiently) retrieve ranges of columns in a specific row using a single query

The last point is really important and effectively determines the data structures that a Cassandra column family can represent. It also means that, if you want to do any kind of range queries, the data for the range query needs to fit into a single row. Because Cassandra rows are not necessarily stored on the same node, you can't effectively do a range query over several rows.

So, what data structure does a Cassandra column family represent? I tend to think of it as follows:

A Cassandra column family is analogous to a distributed java hashtable, keyed by row (or primary) key. Each row can be stored on a completely different server in the Cassandra cluster and may in fact be replicated on several servers (depending on how you configure your replication settings). The "value" in this distributed hashtable is essentially another hashtable, whose keys are the names of the columns in a row. These columns can be individually retrieved or can be retrieved by specifying the start and end column name for the range of columns that you want to fetch. As you may have just guessed, columns in each row are kept sorted by their column name (you can define the sort order by specifying a java comparator). The hardest thing about Cassandra data modelling, for those of us that came of age on relational databases, is that you can (and usually should) have a large number of columns in a single Cassandra row. Cassandra rows can contain up to about 2 billion columns, so there is no need to try and keep the number of columns in a row small.

A Cassandra column family is conceptually a distributed hashmap which maps row keys to rows, which are themselves hashmaps in which a column name is mapped to a value. Each row can be on a different server and may in fact be on more than one server (not represented in the above diagram), depending on your replication settings.

I think that thinking of Cassandra as a distributed hashtable with values that are hashmaps is useful for understanding the capabilities of the storage engine. However, this article makes a very good argument that Cassandra should normally be conceptualized differently.

On a conceptual level (for the architects out there that don't want to wade into the weeds), Cassandra rows can contain at least the following:

Single values, that are accessed by individual column keys. This is a lot like a relational database. For example, a column family that contains addresses might have one row with the columns: name, house number, street name, street direction, city, state, zip code, country (for a U.S. address, for example) and another row with the columns name, house number, street name, street direction, city, province, country, postal_code (for a Canadian address). Let's assume that this column family is used inside some sort of customer management application, and therefore, in addition to the columns above, each row would have an additional column, customer_number, which would be used as a row key. Don't forget the row (aka primary) key when you design Cassandra column families. Unlike relational database tables, every row must have a (primary) key in a Cassandra column family.

Two row in a customer column family. Notice that the columns are slightly different in each.

Lists or sets of values. For example, you may wish to store a list of phone numbers for each customer in the customer column family we talked about above. Unlike a relational database, you'd probably store a list of phone numbers in the same row as you stored the rest of the address information, instead of de-normalizing it and putting it somewhere else. Sets are really just lists in which every element can occur at most once (and in which the order of the elements is not important, so technically the "list: of phone numbers just mentioned is more properly called a "set" of phone numbers). Recent versions of Cassandra (see the datastax documentation for more details) support high level abstractions for storing lists and sets inside rows. If you aren't using a recent version, or like to do everything yourself, you can create lists and sets fairly easily by using composite columns or by just creating a column name that is the name of the list or set concatenated to a fixed width number representing the element's position in the list (in the case of a list) or to a string representation of the set element (in the case of a set).

Our customer data with a list of phone numbers added. The phone numbers are in columns named phone_xx, where
xx is a two digit number. There could be some concurrency issues with adding phone numbers using this technique if two different threads/servers try to append a phone number at the same time with the same index. This could be avoided by appending a globally unique identifier (possibly server name + pid) for the thread to the column name. If we wanted to store a set of phone numbers, instead of list, the column name could be phone_<phone number>.

Time series. A time series is just a special type of list in which elements are always kept in order by time stamp. If you create the column names using time stamps and make sure that your column comparator will sort the time stamps in chronological order, you have a very useful time series capability. You can use the Cassandra column slice feature (which takes a start row and end row name and retrieves all the columns whose name fall into the resulting range) to retrieve all the columns that occur in a specified time interval. You can use a reverse column slice to retrieve the columns in reverse chronological order (most recent first). This is often very useful. In our customer column family which we talked about in the above two cases, we could add a time series to keep track of customer contacts. The column names could be something like "contact" concatenated to a the date/time of the contact and the contents of the column could be a piece of text describing the customer contact.

Adding a time series of customer contacts to our customer records by using a column name that is contact_<timestamp>. The most important thing for time series columns is that they sort properly in (reverse) chronological order so that column slices are useful. Notice that we can put basic address information, a telephone number list and a custom contact time series easily in the same column family.

Combinations. As implied in the above cases and diagrams, it is possible to combine any or all of the above into a single row, if you are careful about defining the column comparator, which determines how the columns are sorted. You must define a column comparator that works in all the rows of a given column family, and therefore sorts all the columns the way you want them sorted so that any column slice queries will retrieve the correct columns in the correct order. Often, it is simplest to use a a very basic comparator for your columns, which will sort them in String or byte array order and then build your column names so that the sorting will be correct.

CQL3 In Cassandra 1.2+: Radical Change comes to Cassandra Data Modeling

Cassandra 1.2+ has many improvements over Cassandra 1.1, including virtual nodes and more off heap data structures (which allows more data to be stored per node). CQL3 is a really important change in Cassandra 1.2+. CQL stands for "Cassandra Query Language" and it is essentially an SQL-like language for Cassandra. It has been available since Cassandra 0.8. CQL provides another (besides the thrift API) way to query and/or store data in Cassandra which is easier for many developers to learn.

Up until CQL3, it was a bit awkward using CQL for Cassandra column families that had rows that contained many columns (that is, the time series, list, sets and maps cases described above). CQL3 puts a much more high level interface on the Cassandra storage engine that many developers will likely find easier to understand.

CQL3 adopts a different terminology than was prevoiusly used to describe data modelling in Cassandra. CQL3 now talks about "partitions" instead of "rows" and "cells" instead of "columns". CQL3 also allows special types of columns that store maps, lists and sets, which takes care of the most common reasons for wide rows. I still think an explicit time series column type would have been nice, but you can essentially get that by using a time stamp as part of a composite primary key (just don't use it as the first component!).

CQL3 introduces the concept of composite primary keys, which is a primary key that consists of several fields, such as "newspaper_issue_date, edition, page_number". From reading the (highly recommended) DataStax documentation, it appears that the first field in the composite key is used as what we used to call a "row key" and the remaining fields' values are used to build names for composite columns.

It was not immediately clear to me how fields that are not part of the composite key are actually stored when there is a composite key, given that these fields can be indexed using secondary indexes. I investigated by setting up a table in Cassandra 1.2.5 as follows (using cqlsh):

create table test1 ( listing_id_and_date TEXT, trade_time int, price float, low float, high float, close float, PRIMARY KEY( listing_id_and_date, trade_time ) );

I then added three rows (again using cqlsh):
INSERT INTO test1(listing_id_and_date, trade_time, price, low, high, close) values ('21221.20131202', 123, 12.50, 12.00, 13.00, 12.75);
INSERT INTO test1(listing_id_and_date, trade_time, price, low, high, close) values ('21221.20131202', 124, 22.50, 22.00, 23.00, 22.75);
INSERT INTO test1(listing_id_and_date, trade_time, price, low, high, close) values ('21221.20131202', 125, 32.50, 32.00, 33.00, 32.75);

I then used cassandra-cli to see how the storage engine is actually storing the above three rows:
RowKey: 21221.20131202
=> (column=123:, value=, timestamp=1372439472065000)
=> (column=123:close, value=414c0000, timestamp=1372439472065000)
=> (column=123:high, value=41500000, timestamp=1372439472065000)
=> (column=123:low, value=41400000, timestamp=1372439472065000)
=> (column=123:price, value=41480000, timestamp=1372439472065000)
=> (column=124:, value=, timestamp=1372439524262000)
=> (column=124:close, value=41b60000, timestamp=1372439524262000)
=> (column=124:high, value=41b80000, timestamp=1372439524262000)
=> (column=124:low, value=41b00000, timestamp=1372439524262000)
=> (column=124:price, value=41b40000, timestamp=1372439524262000)
=> (column=125:, value=, timestamp=1372439550103000)
=> (column=125:close, value=42030000, timestamp=1372439550103000)
=> (column=125:high, value=42040000, timestamp=1372439550103000)
=> (column=125:low, value=42000000, timestamp=1372439550103000)
=> (column=125:price, value=42020000, timestamp=1372439550103000)

As expected, the first component of the composite key is used as the row key. The value of the second component is used as the first part of the column name for the remainder of the columns (see the red text above) and is followed by a colon to separate it from the second part of the column name. The cell name (for fields that are not part of the primary key) specified in the create table statement is used as the second part of the column name (see brown text above). There is also a column that just has a name which is the second component of the primary key, followed by a colon and has no column value.

It turns out that there is an option "WITH COMPACT STORAGE" that you can use when creating a table in CQL3 when the table has only one cell that is not in the primary key. When you use this option, the first component of the primary key is used as a row key, the second and subsequent components are concatenated together and used as the column name and the column value is the single cell not in the primary key.

In a sense, CQL3 flips things around a bit conceptually to hide the fact that range queries can only be done on columns within a row (using the terms in the old sense). It does probably make Cassandra more accessible to people who are used to relational databases. There is a slight danger in CQL3 in that not everything that someone who has an SQL background intuitively thinks should work will actually work.

The remainder of this blog entry uses the old terminology to describe Cassandra column families that contain stock data. However, it is possible to rather easily map this to the new CQL3 way of thinking used in Cassandra 1.2+. All of the stock data column families described below have a row key which identifies the row. Each row has a large number columns whose column name is generally a timestamp or a timestamp concatenated to something (like volume).

So, it is possible to use CQL3 syntax to describe any of the stock data column families below. The general pattern would be something like this:

create table <column_family_name> ( <row_key> text, <column_name> text, price_information text, PRIMARY KEY (<row_key>, <column_name>) ) WITH COMPACT STORAGE;

Representing Stock Data In Cassandra

The Requirements

Our stock feed sends us events, which are essentially notifications that something about a stock, index or currency has changed, often the price. We have a java daemon that receives these events and decides what to store into Cassandra. Recall that you need to consider your queries when you build Cassandra column families. In our case, we need to support three main types of queries:

A query to get the most recent price for a particular stock, index or currency for users that have access to real time quotes. Similarly, we need to be able to retrieve the latest price that is at least n seconds old, for users that are only allowed to see delayed quotes. We designed a column family called Quotes to satisfy these two queries.
A query to return the all the one minute intervals for a given stock, index or currency that are between two specific times in a given day. An interval is a tuple that has the opening price, closing price, high, low and volume for a given one minute period. These are used to construct various types of charts. We designed a column family called IntervalQuotes to satisfy queries of this type.
A query to return all the price changes between a certain start and end time for all stocks, indexes and currencies. We use this query to get data to update a solr core that contains current prices. The program that does the updating wakes up periodically and requests all price changes since it last ran, and then updates only the stocks, indexes or currencies that have changed. Since many stocks trade very infrequently, updating only the what has changed is a very effective optimization. We built a column family called TimeSortedQuotes to satisfy these queries.

The Column Families

Background -- listing_ids, trades and ticks

We have created a special number, called a listing_id for each stock, index or currency for which we receive data. When a user types in a stock symbol, listing symbol or currency pair, we map it to the appropriate listing_id and then use it to do any queries. The advantage of this is that stocks, indexes and currencies all look identical after you get past the upper levels of our code. Doing this also allows us to transparently handle providing historical data when a company changes stock symbols -- the price data continues to be stored under the same listing_id and no special logic is required.

Our stock feed data tends to send us updates in two possible circumstances. The first circumstance is a trade, in other words when a stock is sold to a buyer. The second circumstance, which generally applies to indexes, is when a tick occurs. Ticks are just the price of a security (often a security that doesn't trade, such as an index) and associated information such as volume, open, high, low and close, at a given moment in time. Index ticks often come at regular intervals. For example, we receive ticks for the S&P500 index every five seconds from market open until about an hour after market close.

Quotes

The Quotes column family is intended to support queries for delayed and real time stock quotes, as described in #1 in the above list. For our purposes, a stock quote is the price of the stock, along with some other information, such as bid, ask, open, high, low, close, and cumulative volume. We can put all of this information into a single Cassandra column if we want by serializing it using a method such as JSON. If you are used to relational databases, having multiple values in a single column probably seems to be a bit weird. Remember though that we cannot do a range query over multiple rows and that time series data needs to be stored as columns in a single row. This means that we cannot break out the various fields in a stock quote into single columns, but instead we need to have all the fields for a single quote in a single column. We will construct the column names using a time stamp concatenated to the current cumulative volume and we will generally only write new columns when our stock feed tells us that there is a new trade or a new tick (for indexes). You might wonder why we use the volume as part of the column name, when we are constructing a time series that should really only need a time stamp. The reason is that sometimes two stock trades have precisely the same time stamp, so we use cumulative volume to break the tie.

If we restrict a given row to only contain the information for a single listing_id, we can satisfy a stock quote query for a given listing_id by going to its row and then doing a range (column) slice, specifically a reverse slice, which returns the columns in reverse chronological order. To do a real time query, we can get the most recent column in the row by using a column slice from the end of the current day until the start of time, specifying that we want only the first column in the slice. To do a delayed query, we change the start of the slice to (now - delay_value + 1ms) (and specify a cumulative volume of zero) and again only request the first column in the slice. We use a cumulative volume of zero because it is normally impossible and therefore guarantees that the column returned can be no more recent than now - delay_value (think of the way that the columns are sorted -- first by time stamp and then by cumulative volume -- and you will see how this works).

We ended up making a tweak to the above design: instead of using listing_id as the row key, we used listing_id concatenated to date as the row key. We originally did this because we were concerned about having really wide rows (that is, rows with huge numbers of columns). We were also concerned that stocks that trade frequently would always end up on the same servers. By putting the date in the row key, in means that the quotes for a given stock end up in a different row each day and therefore potentially on a different server each day. This tweak makes retrieving quotes a little less straightforward because we need to check the previous dates if we don't find a quote for a given stock for the current day. In practice, we find that this isn't a huge problem -- it doesn't make our code much more complex and it doesn't seem to create an performance issues.

One thing we added to the above design is time to live values for stock quote columns. We currently use a TTL of five days, which keeps the Quotes column family of manageable size. The only drawback with setting a TTL on a stock quote is that some stocks trade less frequently than every five days and consequently we won't be able to display a quote for them. We solved this problem by creating another column family that contains the last known quote for each listing_id. If we can't find a quote in the Quotes CF, we use the last known quote.

The Quotes family for storing stock quotes. The row key is the listing_id concatenated with the date, meaning that each row stores the quotes for a particular stock for a particular day. The column names are timestamps concatenated with the volume, in order to deal with the situation in which the price changes for a stock several times within a millisecond. The volume is preceded by a letter of the alphabet which essentially gives the number of digits in the volume (A means one digit, B means two, C indicates three digits, etc). This is done so that the volume will sort in the correct order using a String comparator. The actual stock price information in each column is just a big blob of text that is constructed to be easily parseable.

Interval Quotes

The interval quote column family stores the open, high, low, close and volume for a given stock or index over a period of one minute. The information is normally used to create intra-day stock charts, namely charts of intervals smaller than one day -- in our case, one minute, five minute, 15 minute and 1 hour. Intervals of longer than one minute are produced by combining the necessary number of one minute intervals together.

Each row in the column family stores the one minute intervals for a given stock/index for a given day. The row key is just the date concatenated to the listing_id for the index or stock. The column keys are the start time for the interval. Each column is essentially a composite data structure which stores the open, high, low, close and volume values. We keep interval data "forever" therefore there is no time to live set on the rows or columns. The interval data is sparse in that there is no interval stored for a stock if there is no trading for that stock in a given minute. This makes it a bit more complicated to process as the non-existent intervals need to be reconstructed in the software that does the querying (we have a DAO which takes care of this).

We need to be able to read the interval data as quickly as possible in order to accommodate web pages with several hundred charts. Cassandra stores the columns belonging to the same row in sorted order on disk (and therefore in the OS disk cache). We have found that the most important factor in determining how fast the data is read is the size of the data. Therefore, having sparse intervals is a big win. Keeping the column values as compact as possible also helps and Cassandra's ability to do compression makes a big difference. Because interval quote information does not change after the interval has finished, caching is very effective. We cache frequently accessed stocks in application server memory and only use Cassandra to retrieve the intervals not already in the cache.

The interval quotes column family, which stores open, high, low, close and volume for one minute intervals for a given stock or index. As in the Quotes column family, the row key is the listing_id concatenated with the date, meaning that each row stores intervals for a given stock or index for a given day. The column names are time stamps which are the start of the interval that the column represents. The column is a text blob that is constructed to be easily parseable. If there are no trades during a one minute interval for a given stock, then there is no column for that interval. Keeping the column family sparse is a performance enhancement -- essentially smaller rows can be accessed more quickly.

Time Sorted Quotes

We have a stock screener application that allows users to determine which stocks meet a set of chosen criteria. I won't go into the details of how this works (perhaps in another blog entry!), however it uses a solr core and needs to store current stock and index prices in that solr core. The latest version of solr (solr 4, aka "solr cloud") is quite good at real time updates, so it is possible to frequently populate recent stock prices in a solr core.

The problem is that we have about 15,000 stocks whose price we potentially need to update. Even if we can query Cassandra for each stock price in about 2-3 ms (which is possible), the querying alone would take about 30-45 seconds, which is far too long when you need to provide real time stock prices. It would be possible to use multiple threads to query Cassandra and shrink this time considerably (parallel queries can work very well in Cassandra when you have more than one node in your cluster as your parallel queries potentially go to separate servers). However, we'd still have to update solr with 15,000 prices and, even with multiple threads, we would likely need more time than is ideal.

Fortunately, there is no real need to query Cassandra for 15,000 stock prices for each solr core update. Although a few stocks trade hundred of times per second and therefore have frequent price changes, most stock prices do not change often. We only need to update the prices of stocks that have changed since the last update, which is generally not a large number when the updates occur frequently.

The Time Sorted Quotes column family is used to query the stock prices that have changed since the last update. We use the date concatenated with the current time, modulo 15 minutes (i.e. 9:30, 9:45, 10:00, 10:15, etc) as the row key, and each row stores all the price changes that occurred in the fifteen minute interval that starts at the time in the row key. Each column name is the time of a price change concatenated with the listing id of the price that changed. The contents of the column is a composite data structure with basic quote information (price, volume, open, high, low, close for the day). The columns have a time to live which is configurable (currently set to 30 minutes).

The daemon that updates solr calls a DAO with a start time and an end time. The DAO queries a range of columns in generally at most two rows to get all the price changes that occurred in the time interval. We have found this to be very fast, returning in 1-2 seconds or less, depending on the time range being queried.

The Time Sorted Quotes column family is a good example of the need in Cassandra to create new column families to support new (types of) queries. Although the Quotes column family could have been used to query all the prices for all stocks, it was not very efficient to use it, so it made sense to create a new column family that was better suited. The result is very efficient queries, at the cost of more development effort. In a relational database, we probably would have at least tried to index the time stamp column (it's a fair assumption that one exists) in a Quotes table and then we would have performed a range query on the index to fetch the rows that were updated in a certain time range. In some ways, the Time Sorted Quotes column family is the Cassandra equivalent of adding an index on timestamp. It is highly optimal, as quotes will be ordered on disk or in memory by time, whereas in the RDBMs solution, the quotes would not necessarily be arranged this way. The Cassandra solution is better in terms of efficiency (although only slightly if all data is kept in memory), scalability and redundancy. The RDBMS solution would be simpler to implement and perhaps more easily maintained.

The Time Sorted Quotes column family. The row key is a time stamp for the beginning of a five minute interval. The columns in the row are the price changes for all stocks during that five minute interval. The column names are the time of day concatenated to the listing_id for the stock, which, using a String comparator, causes the columns to be sorted in order by time of day. The data in each column is just an easily parseable text blob with stock quote information.

Closing Thoughts

Cassandra works extremely well for stock data, which is a type of time series data. The ability to have large numbers of dynamically created columns in a row which are kept in sorted order on disk and in memory is a really good fit. We have a "legacy" implementation that stores stock quotes in a relational database. Cassandra performs much better, especially on writes (more than 100x faster), which is important because stock data sometimes arrives quickly and needs to be persisted quickly if we are to present it in real time. Cassandra's ability to scale horizontally and its fault tolerance are also attractive because users of stock data expect current quotes and are intolerant of down time.