Multi-channel Attribution Using Neo4j Graph Database

Neo4j Graph:Multi-channel Attribution

Business Applications

Introduction

Globally more than $500 Billion was spent on advertising (Lunden, 2013). One of the greatest challenges of spending money on advertising is trying to understand the impact of those dollars on sales. With the proliferation of multiple mediums or channels (TV, search engines, social media, gaming platforms and mobile) on which precious marketing dollars can be spent, a Chief Marketing Officer (CMO) is in dire need of insights into the return on his investment in each medium. More importantly, the CMO needs timely data to prove that spending on a specific channel has a good return on investment. Neo4j can be used to help marketing applications get answers to tough questions:

  • How much was the increase in web awareness of the product after a commercial was aired in a specific TV channel on a specific date in a specific geographic area?
  • How much of that web awareness translated into…

View original post 613 more words

Posted in Uncategorized

Image processing to find tissue contours

The GRiTs project(https://rickerg.com/projects/project/) considered how genes interact with each other
in space and time. Evaluating this the process begins by determining the border structure of the image. This is done manually using a drawing tool and outlining the border. The resulting border data points were captured and feed into a tool such as 3dMax to create a multidimensional shell. This allowed the image tool a reference point for aligning gene data points in the volume.

I have been interested in processes that would make this more automated. The GRiTS viewer tools were developed in C++ and QT. ImageJ was also used to pre-process the images to remove some of the extraneous information within the image.

OpenCV

OpenCV is a C++ library designed for various image processing and machine vision algorithms. Here is a sample image that I am using.

The first thing is that some of the images have color and some are gray scale. The colors are used in some cases to indicate specific piece of information. In this case I am looking for contours and for that a gray scale image works better.

Gray

The function cvtColor() will work to covert to gray scale.In the version of OpenCV I am using the call takes the enum CV_BGR2GRAY as the option. In later version I believe this has changed to COLOR_BGR2GRAY. The function takes a src and dest image along with the appropriate code.

Blur

There are a number of blur options available. I have started out with the basic normalized box filter blur. I have set the kernel size as 3×3 to start with.

Threshold

This process removes unwanted values. But “unwanted” depends on the image and what you are trying to eliminate. In this case I am looking for pixels that make up the boundary. I don’t want to be harsh in removing values since this causes large gaps that are to fill. For this test I have selected the value to be 50 and the max value to be 250. Of course these values will change depending on the image. I suspect that this will require applying some statistics and machine learning to create “best guess”   starting values. After all the goal is make this as automatic as possible.

threshold(src, dest, threshold_value, max_BINARY_value, THRESH_TOZERO);

Edge

After blur and threshold I applied an edge detection process. For this I used the Canny algorithm.  The  lowThresh and highThresh are used to define the threshold levels for hysteresis process. The edgeThresh is the window size for aperture size for the Sobel operator.

int edgeThresh = 3;
double lowThresh = 20;
double highThresh = 40;
Canny(src, dest, lowThresh , highThresh , edgeThresh );

openview contour

Contour

The edge detection process creates a lot of segments. Contouring will try to connect some of the segments into longer pieces.

CV_RETR_EXTERNAL: find only outer contours
CV_CHAIN_APPROX_SIMPLE: compresses segments
Point(0, 0) : offset for contour shift.
Contours are stored in the contours variable, a vector<Vec4i>
findContours( edgeDest, contours, hierarchy,CV_RETR_EXTERNAL,  , CV_CHAIN_APPROX_SIMPLE, Point(0, 0) );

openview

ImageJ

I took the same processes used with OpenCV and implemented them with ImageJ. Using ImageJ is different then OpenCV. It is really designed so that the developer create plugins that the ImageJ tool can use. I expected to use ImageJ as a library, part of a bigger app.

imagej

This was used to create the above image. The raw image is on the left.

ImagePlus imp =IJ.openImage("Dcc29.jpg");
ImageProcessor rawip = rawimp.getProcessor();
rawip = rawip.resize(rawip.getWidth()/2,rawip.getHeight()/2);
ImageConverter improc = new ImageConverter(imp);
improc.convertToGray16();
ImageProcessor ip = imp.getProcessor();
ip.blurGaussian(1.6);
ip.threshold(185);
ip = ip.resize(ip.getWidth()/2,ip.getHeight()/2);
ip.erode();
ip.findEdges();

Scale over time

Another issue is scale. The complete set of images represent development over a period of time. In the beginning the image are small. By the end of the series they are considerably larger. Mapping positions on the cell images as they grow is still a challenge. Landmarks change over time, coming and going, so they can’t be counted on. The images are obtained at intervals which are relative close to each other. This means that points on one image will be close to others images in similar positions at similar time periods.

Consider the an image in the middle(image #20) at day 15. Points on this image should be close to points on a middle image at day 18.

By interpolating between images it may be possible to track point movement over a period of time.

Posted in Uncategorized

Internet of Related Things IORT

The topic,Internet of Things(IOT) is quite popular today. There is a lot of discussion about IP enabled devices, API’s, storage, and other ideas that address the “How” of IOT. Having spent a good many years developing software this process is one I have seen over and over. A customer wants A and immediately everyone starts to suggest ways to implement A. Rarely does anyone ask “why A?”, what is the problem being solved here?

As I see it the idea behind IOT is that there are “things” in the world that we’d like to track and in order to do so they to be internet enabled. Of course some things have to be internet enabled but not all things. I spent a number of years working with RFID in both access control and asset tracking. It is not simple to track things based simply on location, typically there needs to be some additional context. This helps the system or user make more informed decision about an item’s real location. Location could be by RFID, BlueTooth, barcode, WiFi, or logged manually by a person(car is parked in the Donald Duck lot, isle 134, row 654). Context let the user or system make sense of the location. Is the item really in the paint shop or the radio shop? If the item is electronic, a radio, then is not likely in the paint shop.

When designing a system it is worth considering how people perform a similar activity. In this case the context of knowing what the item is may not help. An item such as car keys could be anywhere. knowing more about the item doesn’t really help. How do I find the car keys? One way is to wander around the house, look in the car, or check various coat pockets. More likely I will ask someone. That someone may not know but they may know someone who does. I might start by asking my daughter. If she doesn’t know she could suggest asking mom since they went to the store in my car recently. I’ll call my wife and she will tell me they are in the in the junk drawer on the counter in the kitchen.
Another scenario, tracking a package. If I inquire about a package I am expecting, the delivery service doesn’t know where the box is. They only know where it is in relation to other things,such as a truck, warehouse or plane. The service will likely know where the truck or driver is currently at or the most recent location. They can tell that my package is on a truck that is in Portland Maine. Its is 5:00pm and I am in Orono Maine, 3 hrs away and so not likely to see my package until tomorrow. The idea is that people manage things not by absolute location but through relationships. The package is related to the truck that has a known location. The car keys are in the junk drawer in the kitchen, I know where to look.

The process can best be described with a graph. Finding a thing involves finding a path between nodes, the car keys and me. Its about how nodes(Things) are related to each other. Below is a diagram of how relations might be described.

Relations are defined as we naturally see them. A Family is made up of persons, not rooms or items.

Relationships are further defined using graph terms, vertices or nodes together with edges or links. I am going to use a graph database and therefore I have assigned properties to the edges. They are not needed for what I am trying to do right now, they may be useful if ever the need to define more specific paths, or ignore some paths.

Test the idea

To try this idea I am using Neo4j(http://neo4j.com/) graph database along with Python(http://py2neo.org/). The basic process is to create nodes and then connect them by defining Edges(Relationships).

One aspect of Neo4j that I am not settled on is the ability to define “relationships” as needed. Its useful to the developer but to a user there is no way for them to know what relationships are used and how. Something like RDF triples maybe?

For this test there are three relationships. For the most part I use “CONTAINS”.

  1. BELONGTO
  2. CONTAINS
  3. LIVESIN
  1. A person “BELONGTO” a family
  2. A person “LIVESIN” a house
  3. A house “CONTAINS” a room.

Data

iort diag

 

Create nodes and edges

Use Python and py2neo to create nodes and then the relationship between them.

open a connection to the running service
graph = Graph()

create a node for the Family
family = Node(“Family”, name=”Family”, title=”Smith”)
graph.create(family )

create a node for the House
house = Node(“House”, name=”House”, title=”Smith Family House”)
graph.create(house)

create a Location node
kitchen =Node(“Location”, name=”Room”, title=”Kitchen”)
graph.create(kitchen)

create an Item node
oven =Node(“Item”, name=”Appliance”, title=”Oven”)
graph.create(oven)

Finally,create a person node
dad = Node(“Person”, name=”Person”, title=”Dad”)
graph.create(dad)

Relate the nodes

iort diag rel1

iort diag rel2

An Example of creating the relationships.
graph.create(Relationship(dad, “BELONGTO”, family))
graph.create(Relationship(house, “CONTAINS”, kitchen))
graph.create(Relationship(kitchen, “CONTAINS”, oven))

The graph

graph

Where are my keys?

Finally its time to find the car keys.
What I really want is the path from me(Dad in this example) to the keys. Neo4j can find the shortest path between two nodes fairly easily.

Using Cypher from the database browser I executed this MATCH query:

MATCH (person:Person{name:'Dad'}),(keys:Thing { name : 'Thing', title : 'Car Keys' }),
p = shortestPath((person)-[*]-(keys))
RETURN p

The result of the search indicates the path from “Dad” to the “car Keys”.

match keys

How does the junk drawer know about the car keys? Maybe the blue tooth or RFID fob on the car key? The junk drawer is a plain drawer with a “smart pad” inside? I am still focused on the What or Why, the “How” is about the technology and that will will work out as the needs become clearer.

One issue that will arise is when a “thing” is in two places at the same time. This wouldn’t likely happen in the package delivery scenario but with blue tooth devices in a home setting it might if two sensors “see’ the same device. A device such a smart pad should be designed(or configurable) to detect only a close range. The system could take advantage of historical data to predict the likely hood of a thing being in one of the two places.

To start with the “How” is to look for a problem to fit the solution. When people can understand what the idea of IOT can offer there will be a driving interest to figure out the “How”

 

Posted in Uncategorized

Graph Database – project Tycho with RNeo4j

Before I continue with R, I discovered another issue. When I started to process the data sets I obtained the Lat, Lon for each of the cities. I have used the spatial component of Neo4j before in a Java application. I was disappointed to discover that py2Neo doesn’t support this yet. The Neo4j Python REST Client does have support. Since spatial indexes can be added to existing data its not a big deal.

Having read Nicole White’s writings on Neo4j(http://nicolewhite.github.io/) and R I was excited to see how R works with Neo4j.

R and Neo4j
Having used R for spatial analysis I was curious to see what could be done with R and the Tycho data. I have yet to add spatial indexes but there is still a lot of other operations that don’t need spatial.

I am using R-Studio. RNeo4j can be found at http://nicolewhite.github.io/RNeo4j/. The instructions for installation  work well.  Following the examples given with the install I was able to run a simple query and obtain a data frame.

Load RNeo4j
library(RNeo4j)
create a connection
graph = startGraph("http://localhost:7474/db/data)"
This just tells me that the connection is good.
graph$version

Define a query. Here I am using one that just returns the list of states.
The first time I tried to run the query just as it was in Python.
query = "MATCH (n:`State`) RETURN n"
I received this strange error:

Error in as.data.frame.list(value, row.names = rlabs) :
supplied 54 row names for 1 rows

Nicole was quick to point out that to use Cypher to return nodes you should use getNodes(). For relations use getRels().
The new query: "MATCH (n:State) RETURN n.description"

Now running the query there is no error and I get a list of states.
df = cypher(graph, query)
Display the results

n.description
1 CA
2 GA
3 CT
4 AK
5 AL
6 CO
7 DC
and so on

The query "MATCH (n:Case) RETURN n.year" returns too much data to print out. Instead I filtered the data to a date range,n.year > 1900 and n.year 1900 and n.year < 1950 RETURN n.year,count(*) as count”
Looking at rows of data is okay but what I really wanted was something visual.
Again, Nicole’s site has examples of plotting. I started with plot().
This took a bit of experimenting to get it right.

plot(df,xlab="year",ylab="count",main="CASES counts, all states, between 1900 and 1950" )

cases 1900 and 1950

This is more like it. What else is there?
ggplot() ?

The command below took a while to get correct. The aes() function wasn’t clear, but an explanation from here(http://docs.ggplot2.org/0.9.3/aes.html) helped. There are enough examples already that explain ggplot that I am not going into details.

ggplot(df, aes(x = n.year, y =count))+geom_bar(stat = "identity",fill = "darkblue") + coord_flip() +labs(x = "Year", y = "Count", title = "Count of MEASLES over the entire range") +theme(axis.text = element_text(size = 12, color = "black"), axis.title = element_text(size = 14, color = "black"), plot.title = element_text(size = 16, color = "black"))

ggplot cases 1900 1950

With my confidence up, time to try something more interesting, more R like.
The query below returns all cases of the disease Measles. I want to try and fit a linear model to the results.

query = "match (d:Disease)<-[:CASE_OF]-(dt:Case)--(ct:City)(st:State) where d.description ="MEASLES" return dt.year,count(*) as count"

measles all years, no line

Fitting a linear model to the data.
model = lm(formula=count ~ dt.year, data=df)
And then draw the “Line of Best Fit”
abline(model)
measles all states all years

By adding a state qualifier to the query I could view measles counts for a given state.
MEASLES Maine all yearsMEASLES  Texas all yearsMEASLES Florida all yearsMeasles CA all years

It is interesting to see the differences between the states. Most of the queries showed a peak around the 1920’s and then a decline. I have found some information that indicates a peak or elevated number of cases reported in the 1920’s. The data would match that. What seems to be up for discussion is why. I am not an epidemiologist so I won’t go any further.

Posted in Uncategorized

Graph Database – project Tycho part 5

The loading process was taking too long. I felt there were several options ideas, was to switch to Java and load the data with an embedded connection. Another was to try and use the batch loaded. I looked at this and it looked difficult considering how my data was arranged. Or maybe I am not that smart? The final option was to py2Neo and its batch mode. I really want to stay with Python for now so I choose the later.

The batch process is a bit different and it took some time to rework the code. I have the data broken down by states. I tested loading a complete state in one batch as well as breaking each state data into smaller batches.

Setup:
This code is run once at the beginning.

 gdb =Neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
 batch = neo4j.WriteBatch(gdb)
Create an index:
  city_idx = gdb.get_or_create_index(neo4j.Node,'Cities')

for each city in each state create a city node. The State nodes are created in the same fashion. After both nodes are created a relation is created between the two.

 cityNode  = batch.create(node(name=cityName ))
 batch.set_properties(cityNode, {'description':cityName})
 batch.add_labels(cityNode,"City")
 batch.add_to_index(neo4j.Node,city_idx,'name',cityName,cityNode)
 # relate the state and the node:
   batch.create(rel(stateNode, "CITY_IN_STATE", cityNode))

There wasn’t any significant difference. The time was better. The old process would run all night. The batch process would complete in 4-5 hours. I didn’t expect to load the data over and over so this was okay.

There seemed to be an issue…

With the complete data set loaded the size of database was 1.2G, maybe that was okay? The next step was to try a few queries. One of the first things I tried was to count the number of nodes of a given type. The data I pulled from the site only had eight different diseases. So I was expecting to see a count of eight. I was surprised to find a lot more(I forget how many, just a lot more than eight). It was clear I was doing something incorrect. Clearly the extra nodes needed to be removed.

The Data.
In the files, each row contains either a death or reported case event. It also contains the disease. What I discovered was that for each event I was creating a new disease node. Not what I wanted! What I had intended to have was one node for each of the diseases and reference them to the events. The same process that had been used for states and cities.

Code re-work
There was already a file containing the dies ease information so it was simple to process it and create the required nodes. As each event was processed I was going to need to associate the appropriate disease node. Knowing there are records the last thing I wanted was to query the data base each time. After all the batch process was supposed to speed things up. This new change was likely to make the process much slower. The solution was to cache the disease nodes in a python list.

At the start of the load process all eight of the disease nodes were created and the node was added to the map, the key was the disease name.

 simplelist = {}
    load_disease()
    get_diseases(simplelist)

Once all of the diseases are load they are retrieved and added to a list. The ‘description’ is the disease name. The node is stored in the list.

    get_diseases(list):
      query = neo4j.CypherQuery(gdb, "MATCH (n:`Disease`) RETURN n")
      records = query.execute()
      for n in records:
        node = n[0]
        print(node ["description"])
        list[node["description"]]=node

 

When it came time to load the events all that was required was to parse out the disease name and locate the node in the map.

The process still takes 2-3 hours to complete but the size is down to about 800M. Running a variety of queries turned up an issue I hadn’t counted on, Some States had cities that didn’t belong. Was is the raw data or a bug in my load process. I looked at the raw data files and found the problem was there. I deleted the data and pulled it from the site again. This time I didn’t see the problem. After another reload of the database the queries looked better.

The total node count is about 700k but most queries run pretty quick.

Make the database public
As mentioned in an earlier section I loaded the database on to an Amazon EC2 micro instance. I wanted something that was free for the immediate future. I don’t have the extra money spend just for the heck of it.
The first time I loaded the data I had selected only about a third of the information. This made it easier to up load. With the corrected structure and all of the data I wanted to push the entire set up. All seemed fine until I tried anything beyond a very simple query. Using the servers browser viewer I would routinely get messages about being disconnected. But the queries ran fine on my laptop so I wasn’t sure if this was an installation/configuration issue on EC2. I had the idea to look at the EC2 console and see if that would tell me anything. It did, the cpu was showing 100%. Apparently the micro instance has less power the i7 on my laptop? You what you pay for I suppose.

I went back and created a new database again with only a third of the data( but with the new structure), it works on EC2 but its not what I really wanted.

Below are some queries from the data set.

 match (d:Disease)<-[:CASE_OF]-(dt:Case)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="CA" and dt.year=1920 and dt.week=10 return st,ct,dt,d

california-1920-week10-cases

 

match (d:Disease)<-[:DEATH_FROM]-(dt:Death)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="CA" and dt.year=1920 and dt.week=10
return st,ct,dt,d

california-1920-week10-deaths

match (d:Disease)<-[:DEATH_FROM]-(dt:Death)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="ME" and dt.year=1920 and dt.week=10
return st,ct,dt,d
maine-1920-week10-cases
match (d:Disease)<-[:CASE_OF]-(dt:Case)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="ME" and dt.year=1920 and dt.week=10
return st,ct,dt,d

maine-1920-week10-cases

Posted in Uncategorized

Graph Database – project Tycho part 4

The process of loading the database is slow. I’ll need to determine how to use the batch loader before I can utilize all of the data.

Looking at the data on my laptop was interesting but wanting to share this meant doing more. I decided to go the Amazon EC route.

Update: The micro EC is not capable of handling even a third of the data. I have pulled the instance and will look for another hosting solution.

The next step is to create cypher queries. These two give an idea of what can be found.

match (st:State)-[:CITY_IN_STATE]->(ct:City)
where st.name =”AL”
return ct;
match (st:State)-[:CITY_IN_STATE]-(ct:City)-[HAS_DEATH]-(d:Death)
where st.name =”CA” and d.year=1920
return ct,d

Since this is still in a testing phase I have limited the data set to 50 cases and deaths for each city. Also need to geocode the cities so interesting spatial queries can be done.

Next, use Xarmin to develop an app to view and query the data.

Or… node.js with D3.js  ?

 

Posted in Uncategorized

Graph Database – project Tycho part 3

With the data in files by state it is time to create and fill the database.

Again I went with Python since it was simple to connect using REST. The first step was to determine the nodes and relationships. As far as I can tell there is no clear path in this area. With a relational database one would start with tables and then add foreign keys. Using Neo4j I decided to create nodes based on key items in the data.

Nodes:
State, City, Dies ease, Case, Death

Relationships are the glue that connects nodes. Unlike FK they can be more descriptive. Consider State and City. The relationship I considered was “HAS_A”. A State “HAS_A” City. I chose “CITY_IN_STATE”. City “CITY_IN_STATE” State.

Relationships:

“CITY_IN_STATE”,”CASE_OF”,”HAS_CASE”,”DEATH_FROM”,”HAS_DEATH”

State -> CITY_IN_STATE -> City
Case -> CASE_OF -> Disease
City -> HAS_CASE -> Case
Death-> DEATH_FROM -> Disease
City -> HAS_DEATH -> Death

Using Python to create the database.

gdb = GraphDatabase(“localhost:7474/db/data”)
state_idx = gdb.node.indexes.create(‘states’)
city_idx = gdb.node.indexes.create(‘cities’)
disease_idx = gdb.node.indexes.create(‘diseases’)
stateLabel = gdb.labels.create(“State”)
cityLabel = gdb.labels.create(“City”)
diseaseLabel = gdb.labels.create(“Disease”)

I then created functions to create nodes driven by the data files.

def create_city(name):
cityNode  = gdb.node(name=name, description=name)
cityLabel.add(cityNode)
city_idx['name'][name] = cityNode
return cityNode

There are functions for States, cases, diseases, and deaths.
Create a city and state node. Then create the relationship:

   cityNode = create_city(cityName)
   stateNode.relationships.create("CITY_IN_STATE", cityNode)

Below is how cases are created and the relationships built.

caseNode= create_case_event(caseName,"CASE",year,week,number,stateName)
caseNode.relationships.create("CASE_OF", diseaseNode)
cityNode.relationships.create("HAS_CASE", caseNode)

The results

Posted in Uncategorized