Content Management with AWS Lambda

Crazy? Probably, but it won’t be the first time that has been suggested.

First,let me offer some background. Recently I have had the opportunity to experience what content management systems do and how they utilized. Products like Documentum and Alfresco are meant for general use. By their nature these systems are less efficient and more complex than something built for a specific purpose. For some agencies this works out well. They don’t have IT organizations that could develop it in house. But for many an ECM is their system of record(SOR) and this is a good solution. When  the system of record lies outside the ECM there is less to be gained. There maybe an existing workflow that doesn’t match the general flow that the ECM defines. I felt there had to be a simpler way. How difficult could it be?  I am not suggesting to build for the general use, instead build only what is needed.

Note: The model I used is an EHR, Electronic Health Record.

The core

Basically there are only three pieces. A database to store metadata,a file system to store the content and processes for creating, retrieving, updating and deleting (CRUD) the information.

Other stuff

Thumbnails: For an EHR system this would not likely be needed. There are not a great variety of document types where a “preview” would be useful. This could be a requirement in other applications.

Transformation: EHR systems use a small number of standard file formats. Its required to show the HL7 data in is native format. converting the data to an image or pdf is not done. But this could be a requirement in other applications.

Version: This could be useful in any content store.

Starting out old school

My first thought was to go with what I know; Java,Springboot and JPA. Start with a database. Since this will be on AWS, MariaDB is a good place to start. Its MySQL compatible and free to start with. For an EHR system the content is the patient data,stored in HL7 format. Since the system of record is the content, the database doesn’t have to be very complex. Two or three tables is more than enough.

 

 

Create an app

Using Eclipse I created a new Springboot, JPA application(included Hibernate). Eclipse  also generated the entities from the database and some of the support code. A few hours later I had a Springboot (CRUD ) app that could read and write to a database. S3 would be the choice for the content since the app would be running on AWS. Fortunately AWS offers nice Java support for S3.

With this done I had basic content management system. AWS suggests Elastic Beanstalk for deploying applications. Its not the simplest thing but it does work. My rest service was very simple, a JSON file for metadata and the HL7(xml) file for content.

This was not a ready for production system but with AWS it was pretty quick and simple to get something working.

But….

Something didn’t feel right. This is the same process/framework that everyone one is doing,its not new. Since I am studying the AWS exams shouldn’t I consider this from the AWS point of view?

Lambda

If you want to know what AWS thinks is the future, think Lamdba. “Run code without thinking about servers.Pay for only the compute time you consume(https://aws.amazon.com/lambda/). Lambda is what powers LEX and Alexia. I wont repeat what the AWS says about Lambda but AWS is putting a lot of effort into this.

Building a content management system based around Lambda

I still need a database(or do I?) and a file system to store the content. I already have the database and S3 from the Java project, no need to start over. What is missing is the CRUD app that I built with Java.

Since they are going to sit idle until needed Lambda functions should be light weight, quick to start up. AWS allows Lambda functions to be written in Javascript(node.js), Python, Java or C#. Java and C# seem too heavy.  Spring and Hibernate don’t fit into this picture.  I felt this left two options, Javascript or Python. Both have their advantages (I use both). I went with Python. I learned later that Javascript is the only choice for some third party tools. As in the Java application I have chosen to write the content and the meta data to S3. The meta data is written to the database as well. S3 has an option to add “metadata” to the object. By writing the data as file I could leverage Solr to search content and meta data! In theory this eliminate the need for a database.

AWS has support and examples for creating Lambda functions in Python. “pymysql” and “boto3” are Python libraries for MySQL and S3. Both are available and do not require the developer add them.

Python is deployed to Lambda as a deployment package. This is simply a zip file with your Python code and any external libraries not supported already with AWS. The trick to this is getting the python file and Lambda handler correct. I used contentHealthLambda.py and contentHealthLambdaHandler as the handler function. Below is how they are used in the Lambda configuration.

The code

Note: The code I am including in basic. Almost all of the error handling has been ignored.

Standard Python imports

import pymysql
import datetime
import boto3
from cStringIO import StringIO

The lambda handler function definition

def contentHealthLambdaHandler(event, context):

Lambda passes parameters as a Python Dictionary. I am passing in two parameters, metaData and content.

content = event['content']
metaData = event['metaData']

The meta data is patient  information.(patientNumber,patientFirstName,patientLastName,…)in a JSON format. I have left out the parsing of this as its trivial.

Make sure the parameters are included:

if 'metaData' in event and 'content' in event:
    content = event['content']
    metaData = event['metaData']
else:
    return "error"

Create an S3 resource. This is used to write or read to S3.

s3 = boto3.resource('s3')

Store the data to S3. In this case the bucket is fixed but it could be passed as a parameter.

target_bucket = "com.contentstore"

Create a file name.

target_file = metaData+"_md"+str(createDateTime)

#  temp_handle.read() it reads like a file handle

temp_handle = StringIO(metaData)

Create or get the  bucket

bucket = s3.Bucket(target_bucket)

Write to the bucket

result = bucket.put_object( Key=target_file, Body=temp_handle.read())

That is all there is, six lines of code(error handling is not included). Six more lines are required for storing the content to S3. I did not test with a very large file and there maybe more effort required in these cases. I have not noticed anyone talking about additional issues.

Write to the database

The connect string is familiar to anyone who has done Java database coding before.

conn = pymysql.connect(host='myurl',user='ec2user',passwd='xxxxyyy',db='mydb')

My schema contains two tables, a base and a patient document. Since the patient document has a foreign key to the base document I have to store them separately. There are Python ORM’s that would probably handle this but its so simple that basic SQL will suffice. All of the database code is wrapped in a try-except clause. If any of the execute’s fail the commit will never happen.

Store the base document

cursor = conn.cursor()
baseDocumentInsert = "INSERT INTO documentbase(createDateTime,description,contentURL) VALUES(%s,%s,%s)"
args = (createDateTime, "healthRecord", result.bucket_name +"/"+ result.key)
cursor.execute(baseDocumentInsert, args)

Store the patient document. Note that “cursor.lastrowid” is the id to the base document and will be the patient document foreign key.

patientDocumentInsert = "INSERT INTO patientdocument(patientNumber,patientFirstName,patientLastName,docBase) VALUES(%s,%s,%s,%s)"
args = (metaData, "Frank","smith",cursor.lastrowid)
cursor.execute(patientDocumentInsert, args)

If nothing has failed, commit the changes.

  conn.commit()

This is the complete function. It will store content and metadata to S3 and metadata to the database.

import pymysql
import pymysql
import datetime
import boto3
from cStringIO import StringIO
def contentHealthLambdaHandler(event, context):
    
    if 'metaData' in event and 'content' in event:    
       print  " metaData: " +event['metaData']    
       print  " content: " +event['content']    
       content = event['content']    
       metaData = event['metaData']        
   else:        
      return "error"   
    
createDateTime = datetime.datetime.now()    
s3 = boto3.resource('s3') 
    
# store the meta data    
contents = metaData    
target_bucket = "com.contentstore"    
target_file = metaData+"_md"+str(createDateTime)    
   
fake_handle = StringIO(contents)         
bucket = s3.Bucket(target_bucket)    

# fake_handle.read()reads like a file handle    
result = bucket.put_object( Key=target_file, Body=fake_handle.read())    
 
       
#store the content data    
contents = content    
target_bucket = "com.contentstore"    
target_file = metaData+str(createDateTime)    
fake_handle = StringIO(contents)        
    
bucket = s3.Bucket(target_bucket)    
    
result = bucket.put_object( Key=target_file, Body=fake_handle.read())    
        
conn  = pymysql.connect(host='myurl',user='ec2user',passwd='xxxxx',db='mydb')  
  try:       
    cursor = conn.cursor()        
    baseDocumentInsert = "INSERT INTO documentbase(createDateTime,description,contentURL) VALUES(%s,%s,%s)"        
    args = (createDateTime, "healthRecord", result.bucket_name +"/"+ result.key)                
    cursor.execute(baseDocumentInsert, args)        
    print "baseDocumentInsert id "+ str(cursor.lastrowid)            
    patientDocumentInsert = "INSERT INTO patientdocument(patientNumber,patientFirstName,patientLastName,docBase) VALUES(%s,%s,%s,%s)"        
    args = (metaData, "Frank","smith",cursor.lastrowid)        
    cursor.execute(patientDocumentInsert, args)                
    conn.commit()  
      
     cursor.execute("SELECT * FROM patientdocument")        
     # print all the first cell of all the rows        
     for row in cursor.fetchall():            
       print row                     
       conn.close()    
   except Exception as e:        
         print(e)
  return "got it"

Testing

The first level of testing is done from the Lambda AWS console

Logging

Cloudwatch logs all of the output so you can easily see what happened.

Rest service

In order to use the Lambda function it needs to be exposed as a REST endpoint. This is done using the API gateway. The process is well documented so I wont go into it. The API gateway can be done separately as I did or at the time the lambda function is created.

This link walks you through the process: Build an API to Expose a Lambda Function

Testing the new rest service

The simplest way to do this is using PostMan. When you create the REST endpoint the API gateway console will supply an Invoke URL. This can  be used in Postman to test the new service.

The other way to test is using a Python client. The central part of the client is:

requests.post('https://xxxxxxxxxxxxxxxxxxxxx.amazonaws.com/test/ehr', 
data) where 'data' is a JSON string.

    

     The content is is HL7 format. The metadata is simply a patient id randomly generated. The code below creates twenty post requests to upload content and metdata to the

Lambda client process.

starttime = datetime.datetime.now()
for i in range(1,20):
    id = str(randint(100000, 400000)) # generating a random patient ID
    print(id)
    data = json.dumps({'metaData': id, 'content':'data removed for simplicity  '})
    r = req = requests.post('https://1gndnfa0ni.execute-api.us-east-1.amazonaws.com/test/ehr', data)
endtime = datetime.datetime.now()
print str(endtime-starttime)

The result is less than 1 sec per post call. The data is small( 3K), but this was done from my home laptop into AWS. I would expect better rates in a “real” environment.

The content data in S3

AWS Maria DB

One issue with Lambda is that is is slower to respond the first time since it has to spin up. I am not clear on the time window where the function is active vs. idle. Its something I need to look into.

Other stuff

As mentioned earlier there are functions needed beyond storing data.

  • Read or search for meta data. def contentHealthLambdaHandlerRead(event, context):
  • def contentHealthLambdaHandlerReadContent(event, context):

Transformation

This involves converting various document formats into one standard format, likely PDF. Other ECM’s use third party tools to do this work. Using Lambda would not prevent using a similar third party tool but I would prefer that conversion be done beforehand, it the code calling the REST service. Its not an integral part of content store. Another way to achieve this is to use the Lambda trigger to start the transformation.  ImageMagick or LibreOffice can be used to convert the files as they are written to S3.

Thumbnails

This is also where third party tools come into play. Lambda has a great way to handle this, triggers. A function is setup to trigger on a file added to S3. The function handles the process of creating the image or images. The examples of this use something like ImageMagick. The only issue I found is that it is currently only usable in Lambda with Javascript. Its not a big deal but I’d have to part from Python for a while.

Versions

S3 can version documents automatically. AWS Lifecycle can use versions to move data to other storage options such as glacier. “boto3” supports S3 versions so its possible to filter and return information based on versions.

Conclusion

    Lamdba is becoming AWS’s path forward. They continue to improve and add features.

This effort was for educational purposes. But it shows how the tools we have today  can make building great software so much simpler.

About gricker

Living and learning
This entry was posted in Uncategorized. Bookmark the permalink.