 
          The Challenge
Structured health records (medical and patient) may contain Protected Health Information (PHI), and this data format represents the bulk of RosettaHealth’s total data stream. These records must be maintained, both at rest and in transit, in a HIPAA-compliant way yet remain both searchable and rapidly retrievable. The standards-based format for said documents is XML and the original source documents must be stored as received alongside transformed versions created to support search indices.
In addition to health records, the RosettaHealth solution stored application support data and logs, in the form of transactional event data, in the same datastore as the health records. The datastore was a third-party hosted MongoDB solution, providing both storage and search options. Log data is not typically used after 3 months but must be archived and retained.
RosettaHealth needed to balance features with costs of storage in a scalable way, while also providing archival processes (including removal) that met operational and regulatory requirements. To stay focused on providing customer value, RosettaHealth prefered to use a hosted MongoDB database solution instead of architecting the infrastructure for, and administering, the database.
Finally, two distinct data storage models existed that we can refer to as, for simplicity, the “bus” and “repository” models. The bus model describes short-term (on the order of 30 days), intermediary storage purely for health record exchange. The evolving repository model describes long-term storage providing for enhanced search and analytics capabilities. Both models should be supportable by any proposed solution, with the repository model driving this case study.
The Solution
RosettaHealth performs custom data transformational processes upon the ingestion of clinical data records, including conversion from a complex standards-based XML format to a JSON format, as well as the creation of custom indices to support probabilistic searches for clinical data in MongoDB. To reduce infrastructure cost and complexity, the combined use of Elasticsearch for enhanced search and Amazon DynamoDB for persistence was discussed as a replacement for MongoDB. However, since Amazon Elastisearch is not HIPAA-compliant in AWS at this time, that combination was not operable for PHI. Instead, we recommended staying with MongoDB and utilizing the AWS MongoDB reference architecture.
Given the logical separation of the original XML documents from the searchable data in MongoDB, we recommended storing all such documents of origin in Amazon S3 buckets instead of MongoDB. The proposed mechanism enforces encryption at rest via Amazon S3 bucket policies, and provides an Amazon S3 object’s Amazon S3 URL to the MongoDB record as point of reference for rapid retrieval. An AWS CloudFormation template was provided demonstrating the general principles for both bus and repository storage models, but Stelligent recommended that RosettaHealth’s production implementation include the following:
Document storage solutions were recommended to match both bus and repository models. With the clinical data bus model, documents are stored in a bucket such that when business logic dictates that RosettaHealth delete its copies of the data, corresponding documents stored in the bus bucket are deleted in parallel via the AWS SDK. Additionally, by leveraging Amazon S3 life-cycle policies, bus bucket objects can expire after 30 days or other pre-defined period.
With the clinical data repository model, documents of origin are stored in warm storage via Amazon S3 for as long as necessary, with the option to configure Amazon S3 life-cycle policies for both Infrequent Access and Amazon Glacier storage, and configure aging policies independently for different clients in a manner consistent with HIPAA regulations.
In addition to clinical data, the log data in MongoDB represented an ever-growing load of potentially archivable data. We have recommended a periodic export of log data to S3, with query support for this non-PHI data using Athena and life-cycles policies to transition the log data to either cold storage or deleted via expiration.
Finally, enhancements to the analytical query capabilities were proposed using a dimensional data model and importing structured files into Amazon Redshift. As this was tangential to the storage solutions, infrastructure and data-process details were not provided at this time.
The Benefits/Results
The solution offered several key benefits for RosettaHealth:
- 
      Reduction in large-volume data hosted by third-party MongoDB database provider
      - Said data is instead stored in Amazon S3, leveraging S3-native archival capabilities for warm and cold storage via lifecycle policies
 
- Reduction in storage costs for clinical data repository needs
- 
      Analytics capability for enhanced query mechanics via Amazon Redshift using a dimensional model
      - Supports both patient search and extended research needs