In the continuing Baby Steps to SOA series, we follow Doug and the IT team behind BuyMyWidget.com as they take steps to renovate their digital asset architecture. Previously, we introduced the problem and the team, started planning and analysis, decided on some metrics, and refactored the website applications. Most recently, the team has tackled identity management, introducing a CMS, and now continues with the migration of the data layer into web services.
The Evolutionary Roadmap
- Step One: Analyze and Plan
- Step Two: Measure It
- Step Three: Three Tiers for the Website
- Step Four: Single Sign-On
- Step Five: The Move to a CMS
- Step Six: Data Services
- Step Seven: Centralizing eCommerce
- Step Eight: Sharing the Business Tier
- Step Nine: Moving beyond the website
- Step Ten: Riding the ESB
Step Six: Data Services
Much smarter people than myself, like Jeff Atwood (in 2004!), Goodson and Bloomberg (2008), and others, have long held the belief that data layers should be behind web services. Of all the things we build in our applications, it translates most readily in our minds into something that can be separated out from the rest. I would venture to suggest that this happens because the data store itself that we usually directly query against is already separated from our software, allowing ourselves a nice easy breadcrumb trail between “data storage is separate” to “data layer is separate”.
In my mind, the primary advantage of moving your data access layer to a web service is the distribution of the ability to retrieve data to all of your applications. In many situations, we have SQL that is complex and extracts important data, but nobody can re-use it because it only exists in that hidden stand-alone app running on a dev VM on Jim’s machine. Why do we do this to ourselves?
Simple, really. It’s easier. We spent 4 days (and probably some all-nighters) trying to figure out that sequence of nested queries and GROUP BY clauses to get that MUST HAVE report done on time for the quarterly meeting, and there’s no way we’re spending another 4 days to make it generic for any possible input. Don’t even ask for a possibility of flexing the output format, because if you do we’re probably going to bash our head with our keyboard until the letter Q dislodges and sticks to our forehead.
It doesn’t have to be this way. Moving to a service-oriented architecture doesn’t mean it has to be generic and do anything anybody might want. At its most basic, all you are saying is “this service will do exactly what is written in this contract given this input”. The app with the single call to a method that has the ridiculous-and-complicated-mother-of-all-queries (RACMOAQ for short) just becomes a single call to a web service that houses the RACMOAQ. If we place a layer between, now ANYBODY can get the same benefit of the single-output of that query. Was that so hard?
Performance (the elephant in the room)
The biggest issue folks will raise about this is the discussion around performance. By introducing another layer, we are inevitably adding a layer of communication protocol that will slow the performance down. There is no way to get around this: at some point, you need to call a web service, and it will call the database. That is one additional step you didn’t have before. The question is: does it matter?
In most cases, the communication to the web service will be a negligible delay in comparison to the time taken by the database to execute the query and retrieve the data. However, there are some cases where this may not be true:
- Simple cached query with a lot of data being returned: Large amounts of data in XML transfer much more slowly than the same large amounts of data that are being retrieved from the database directly. If the query itself is quick, or the results are cached on the database side, you will need to introduce some sort of caching on the application side to prevent the unnecessary transfer of all that bloated XML across the wire.
- High latency to the web service: In some architectures, the data web services get hosted inside the firewall as close to the database as possible to improve performance of the code in the web service communicating to the database. However, if the network latency between the application and the web service is not quick, it doesn’t matter that the web service communicates quickly with the database. The application needs to wait on the web service to communicate to it.
- Unneeded data being returned: A common implementation of data services is to return all data related to the object being requested as the context of its use is unknown. Compared to a direct query that only retrieves the information absolutely needed at that time, this can be quite a bit slower as the object being returned becomes more and more complex. Application-side caching can reduce the number of times the data is requested from the service in this case. Instead of three queries asking for small parts of the same object, you can have one call to the webservice to get the data and cache it, and then three extractions of portions of the cached data.
Making the Move
Migrating the data access layer into a web service (or multiple services) is one of the easiest portions of the application to migrate in an iterative fashion. Because we are not changing the actual data storage, we can take portions of the application and move them piece by piece into a series of web services that will serve up the data. For each piece that is moved, we should do the following:
- Identify the business object: By now, we should be moving towards a RESTful service, and usually these types of data services are built around the concept of a single business object (such as “User” or “Address”). For any given data access query, we need to identify what business object we are dealing with. If we are dealing with a combination, is there a way to break up the query, or is the query mostly about one object and related data from another?
- Construct the service endpoint: Build a service endpoint that takes in the same input as your data access layer but returns a business object.
- Move the query: Migrate the query to a method behind the service endpoint. Update the query to take the arguments from the service input, and massage the output to populate the business object that is to be returned.
- Move the data: If you have multiple data stores for similar data, you may need to migrate the data from one system into the new centralized data store. Common pitfalls here are: mismatched schemas (especially uniqueness constraints), duplicate users with different information between systems, missing data required in the new schema.
- Refactor: Has a similar query that returns the same business object already been moved? Can two service methods be merged into a single call that returns the data? A common example would be as follows:
SELECT ID, Name from User WHERE Emailfirstname.lastname@example.org';
SELECT DateCreated from User WHERE Emailemail@example.com';
The above two queries could become a single web service call returning all 3 fields from the User. The application then only uses the portion of the data it requires.
- Update application: The application now requires an update to no longer directly query the database, but instead invoke the web service. This may require configuration settings for the location of the service, as well as a proxy layer for consuming the output from the service.
- Cache early, cache often: Many entities returned by the service should probably be placed into a temporary cache, even if only for a few seconds. For high-use applications where many concurrent requests are occurring, we need to minimize the number of trips to the back-end. This caching layer may already exist if your application was previously converting data query results into business objects and caching the business object for performance.
In Our Scenario
Lynn’s team definitely has their work cut out for them. All of their systems have their own data stores, and none of the queries are common between their applications. Doing a straight move from the application to a web service won’t be efficient. Instead, the team decides to analyze all of the applications in their domain and look for common requests. Lynn’s team identifies that many of their applications are all dealing with Address information in their own way, and that by centralizing this into a single service and data store would simplify the architecture greatly.
Having identified this common structure, the team starts with the primary eCommerce website first. The eCommerce site has the richest address data, so the team decides to use that schema as the base for all applications and builds a service endpoint around the existing data store. The eCommerce queries are migrated into the data service incrementally with no need for migration because the data storage is in the same place.
Having completed the eCommerce website, the team moves onto the Events website. The data structure for addresses is slightly different in this application, so the migration of the data requires a script to massage the information into the new format. During a migration analysis, Ted (one of the team’s DBAs) notices that uniqueness constraints are being violated on the email address field. Some rows cannot be migrated. Further analysis determines that many users have now saved address information in both databases, yielding duplicate information that is not necessarily the same. The team decides to use the most recent data based on the UpdatedDate in the rows to decide which address to keep, and use the email address as an identifier for determining duplicate users that need to be merged by the migration script. Because of the length of time to get the migration right, the team either needs to move everything at once, or build a complicated logic to have some data in one data store and some in another. The team opts for the big-bang on the Events site as there are far fewer queries for address information than some of the other applications.
This process continues throughout all the applications, adapting the Address structure to accommodate new fields or linked tables, and enhancing the service to return more data about the user to all applications. Eventually, the team has built their first data service that stores Address information about any of their users. Given the recent move to a centralized identity management system, it may be desirable in the future to move this type of information into the identity management system so that all information about a user is in one place. For now, however, the Address data service will suffice.
The team then returns back to the data access layers and identifies additional business objects and continues creating data services for each one. Eventually, all applications will no longer have direct queries to the data stores and all data access will be done via web services!
How much will this step cost?
The total cost here will entirely depend on the complexity of your data layers, and the number of services that need to be created. You will need to consider the following:
- Application Architect to identify business objects and possible refactoring points in software applications.
- Database Administrator or Application Architect experienced with data structures to identify data schema similarities and support data migration efforts to join multiple applications together.
- Web Service developer to build RESTful endpoints for the data.
- Application developer to update the existing applications to use the data services and implement caching.
- Testing resources for regression testing of application functionality and data migration accuracy.
- Support developer to resolve production issues as the system is rolled out incrementally. Data migration undoubtedly causes problems with end users and these need to be resolved ASAP, so you will need a dedicated resource that is not already tied down to the application overhaul.
This series continues with the team centralizing their eCommerce processes.