Ontario First Nations Libraries Compared Using Ontario Open Data

I recently downloaded a very cool dataset on Ontario libraries from the Ontario Open Data Catalogue.  The dataset contains 142 columns of information describing 386 libraries in Ontario, representing a fantastically massive data collection effort for such important cultural institutions (although the most recent information available is as of 2010).  One column which particularly caught my interest was “Library Service Type”, which breaks the libraries down into:

  • Public or Union Library (247)
  • LSB Library (4)
  • First Nations Library (43)
  • County, County co-operative or Regional Municipality Library (13)
  • Contracting Municipality (49)
  • Contracting LSB (14)

I saw the First Nations Library type and thought it would be really educational for me to compare First Nations libraries against all the other library types combined and see how they compare based on some interesting indicators.  To make these comparisons in this post, I use a few violin plots; where you see more bulkiness in the plot, it tells you that the value on the y axis is more likely for a library compared to the thinner parts.

Our first comparison, shown below, reveals that local population sizes are a LOT more variable amongst the “Other” library types compared to First Nations libraries.  From first to third quartile, First Nations libraries tend to have around 250 to 850 local residents, whereas Other libraries tend to have around 1,110 to 18,530 local residents!

Local Population Sizes by Library Type
_

             isFN.Library 0%    25%  50%   75%    100%
1         Other Libraries 28 1113.5 5079 18529 2773000
2 First Nations Libraries 55  254.5  421   857   11297

Considering the huge difference in the population sizes that these libraries were made to serve, comparisons between library types need to be weighted according to those sizes, so that the comparisons are made proportionate.  In that spirit, the next plot compares the distribution of the number of cardholders per resident by library type.  Thinking about this metric for a moment, it’s possible that a person not living in the neighbourhood of the library can get a card there.  If all the residents of the library’s neighbourhood have a card, and there are people outside of that neighbourhood with cards, then a library could have over 1 cardholder per resident.

Looking at the plot, a couple of things become apparent: Firstly, First Nations libraries appear more likely to to be overloaded with cardholders (more cardholders than there are local residents, 14% of First Nations libraries, vs. 4% of Other libraries).  On the lower end of the spectrum, First Nations libraries show a slight (non-significant) tendency of having fewer cardholders per resident than Other libraries.

Cardholders per Resident by Library Type_

             isFN.Library 0%  25%  50%  75% 100%
1         Other Libraries  0 0.20 0.37 0.55  2.1
2 First Nations Libraries  0 0.19 0.32 0.77  2.8

Next we’ll look at a very interesting metric, because it looks so different when you compare it in its raw form to when you compare it in proportion to population size.  The plot below shows the distribution of English titles in circulation by library type.  It shouldn’t be too surprising that Other libraries, serving population sizes ranging from small to VERY large, also vary quite widely in the number of English titles in circulation (ranging from around 5,600 to 55,000, from first to third quartile).  On the other hand we have First Nations libraries, serving smaller population sizes, varying a lot less in this regard (from around 1,500 to 5,600 from first to third quartile).
Num English Titles in Circulation by Library Type
_

             isFN.Library 0%    25%   50%   75%   100%
1         Other Libraries  0 5637.5 21054 54879 924635
2 First Nations Libraries  0 1500.0  3800  5650  25180

Although the above perspective reveals that First Nations libraries tend to have considerably fewer English titles in circulation, things look pretty different when you weight this metric by the local population size.  Here, the plot for First Nations libraries looks very much like a Hershey’s Kiss, whereas the Other libraries plot looks a bit like a toilet plunger.  In other words, First Nations libraries tend to have more English titles in circulation per resident than Other libraries.  This doesn’t say anything about the quality of those books available in First Nations libraries.  For that reason, it would be nice to have a measure even as simple as median/average age/copyright date of the books in the libraries to serve as a rough proxy for the quality of the books sitting in each library.  That way, we’d know whether the books in these libraries are up to date, or antiquated.
English Titles in Circulation per Resident by Library Type
_

             isFN.Library 0%       25%      50%       75%      100%
1         Other Libraries  0 0.9245169 2.698802  5.179767 119.61462
2 First Nations Libraries  0 2.0614922 7.436399 13.387416  51.14423

For the next plot, I took all of the “per-person” values, and normed them.  That is to say, for any given value on the variables represented below, I subtracted from that value the minimum possible value, and then divided the result by the range of values on that measure.  Thus, any and all values close to 1 are the higher values, and those closer to 0 are the lower values.  I then took the median value (by library type) for each measure, and plotted below.  Expressed this way, flawed though it may be, we see that First Nations Libraries tend to spend more money per local resident, across areas, than Other libraries.  The revenue side looks a bit different.  While they tend to get more revenue per local resident, they appear to generate less self-generated revenue, get fewer donations, and get less money in local operating grants, all in proportion to the number of local residents.  The three areas where they are excelling (again, this is a median measure) are total operating revenue, provincial operating funding, and especially project grants.
Normed Costs and Revenues by Library Type
Here I decided to zero in on the distributional differences in net profit per resident by library type.  Considering that libraries are non-profit institutions, you would expect to see something similar to the plot shown for “Other” libraries, where the overwhelming majority are at or around the zero line.  It’s interesting to me then, especially since I work with non-profit institutions, to see the crazy variability in the First Nations libraries plot.  The upper end of this appears to be from some outrageously high outliers, so I decided to take them out and replot.
Net Profit per Resident Population
In the plot below, I’ve effectively zoomed in, and can see that there do seem to be more libraries showing a net loss, per person, than those in the net gain status.
Normed Costs and Revenues by Library Type - Zoomed In
_

             isFN.Library      0%    25%   50%  75%   100%
1         Other Libraries -149.87  -0.49  0.00 1.16  34.35
2 First Nations Libraries  -76.55 -17.09 -0.88 0.40 250.54

I wanted to see this net profit per person measure mapped out across Ontario, so I used the wonderful ggmap package, which to my delight is Canadian friendly!  Go Canada!  In this first map, we see that First Nations libraries in Southern Ontario (the part of Ontario that looks like the head of a dragon) seem to be “okay” on this measure, with one library at the “neck” of the dragon seeming to take on a little more red of a shade, one further west taking on a very bright green, and a few closer to Manitoba appearing to be the worst.

Net Profit per Local Resident Amongst First Nations LibrariesTo provide more visual clarity on these poorly performing libraries, I took away all libraries at or above zero on this measure.  Now there are fewer distractions, and it’s easier to see the worst performers.

Net Profit per Local Resident Amongst First Nations Libraries - in the red

Finally, as a sanity check, I re-expressed the above measure into a ratio of total operating revenue to total operating expenditure to see if the resulting geographical pattern was similar enough.  Anything taking on a value of less than 1 is spending more than they are making in revenue, and are thus “in the red”.  While there are some differences in how the colours are arrayed across Ontario, the result is largely the same.Operating Revenue to Cost Ratio Amongst First Nations Libraries

Finally, I have one last graph that does seem to show a good-news story.  When I looked at the ratio of annual program attendance to local population size, I found that First Nations libraries seem to attract more people every year, proportionate to population size, compared to Other libraries!  This might have something to do with the draw of a cultural institution in a small community, but feel free to tell me some first hand stories either running against this result, or confirming it if you will:

Annual Program Attendance by Library Type

_

             
          isFN.Library    0%   25%   50%   75%   100%
1         Other Libraries  0 0.018 0.155 0.307  8.8017
2 First Nations Libraries  0 0.113 0.357 2.686  21.361

That’s it for now! If you have any questions, or ideas for further analysis, don’t hesitate to drop me a line :)

As a final note, I think that it’s fantastic that this data collection was done, but the fact that the most recent data available is as of 2010 is very tardy.  What happened here?  Libraries are so important across the board, so please, Ontario provincial government, keep up the data collection efforts!

A Delicious Analysis! (aka topic modelling using recipes)

A few months ago, I saw a link on twitter to an awesome graph charting the similarities of different foods based on their flavour compounds, in addition to their prevalence in recipes (see the whole study, The Flavor Network and the Principles of Food Pairing).  I thought this was really neat and became interested in potentially using the data for something slightly different; to figure out which ingredients tended to correlate across recipes.  I emailed one of the authors, Yong-Yeol Ahn, who is a real mensch by the way, and he let me know that the raw recipe data is readily available on his website!

Given my goal of looking for which ingredients correlate across recipes, I figured this would be the perfect opportunity to use topic modelling (here I use Latent Dirichlet Allocation or LDA).  Usually in topic modelling you have a lot of filtering to do.  Not so with these recipe data, where all the words (ingredients) involved in the corpus are of potential interest, and there aren’t even any punctuation marks!  The topics coming out of the analysis would represent clusters of ingredients that co-occur with one another across recipes, and would possibly teach me something about cooking (of which I know precious little!).

All my code is at the bottom, so all you’ll find up here are graphs and my textual summary.  The first thing I did was to put the 3 raw recipe files together using python.  Each file consisted of one recipe per line, with the cuisine of the recipe as the first entry on the line, and all other entries (the ingredients) separated by tab characters.  In my python script, I separated out the cuisines from the ingredients, and created two files, one for the recipes, and one for the cuisines of the recipes.

Then I loaded up the recipes into R and got word/ingredient counts.  As you can see below, the 3 most popular ingredients were egg, wheat, and butter.  It makes sense, considering the fact that roughly 70% of all the recipes fall under the “American” cuisine.  I did this analysis for novelty’s sake, and so I figured I would take those ingredients out of the running before I continued on.  Egg makes me fart, wheat is not something I have at home in its raw form, and butter isn’t important to me for the purpose of this analysis!

Recipe Popularity of Top 30 Ingredients

Here are the top ingredients without the three filtered out ones:

Recipe Popularity of Top 30 Ingredients - No Egg Wheat or Butter

Finally, I ran the LDA, extracting 50 topics, and the top 5 most characteristic ingredients of each topic.  You can see the full complement of topics at the bottom of my post, but I thought I’d review some that I find intriguing.  You will, of course, find other topics intriguing, or some to be bizarre and inappropriate (feel free to tell me in the comment section).  First, topic 4:

[1] "tomato"  "garlic"  "oregano" "onion"   "basil"

Here’s a cluster of ingredients that seems decidedly Italian.  The ingredients seem to make perfect sense together, and so I think I’ll try them together next time I’m making pasta (although I don’t like tomatoes in their original form, just tomato sauce).

Next, topic 19:

[1] "vanilla" "cream"   "almond"  "coconut" "oat"

This one caught my attention, and I’m curious if the ingredients even make sense together.  Vanilla and cream makes sense… Adding coconut would seem to make sense as well.  Almond would give it that extra crunch (unless it’s almond milk!).  I don’t know whether it would be tasty however, so I’ll probably pass this one by.

Next, topic 20:

[1] "onion"         "black_pepper"  "vegetable_oil" "bell_pepper"   "garlic"

This one looks tasty!  I like spicy foods and so putting black pepper in with onion, garlic and bell pepper sounds fun to me!

Next, topic 23:

[1] "vegetable_oil" "soy_sauce"     "sesame_oil"    "fish"          "chicken"

Now we’re into the meaty zone!  I’m all for putting sauces/oils onto meats, but putting vegetable oil, soy sauce, and sesame oil together does seem like overkill.  I wonder whether soy sauce shows up with vegetable oil or sesame oil separately in recipes, rather than linking them all together in the same recipes.  I’ve always liked the extra salty flavour of soy sauce, even though I know it’s horrible for you as it has MSG in it.  I wonder what vegetable oil, soy sauce, and chicken would taste like.  Something to try, for sure!

Now, topic 26:

[1] "cumin"      "coriander"  "turmeric"   "fenugreek"  "lemongrass"

These are a whole lot of spices that I never use on my food.  Not for lack of wanting, but rather out of ignorance and laziness.  One of my co-workers recently commented that cumin adds a really nice flavour to food (I think she called it “middle eastern”).  I’ve never heard a thing about the other spices here, but why not try them out!

Next, topic 28:

[1] "onion"       "vinegar"     "garlic"      "lemon_juice" "ginger"

I tend to find that anything with an intense flavour can be very appetizing for me.  Spices, vinegar, and anything citric are what really register on my tongue.  So, this topic does look very interesting to me, probably as a topping or a sauce.  It’s interesting that ginger shows up here, as that neutralizes other flavours, so I wonder whether I’d include it in any sauce that I make?

Last one!  Topic 41:

[1] "vanilla"  "cocoa"    "milk"     "cinnamon" "walnut"

These look like the kinds of ingredients for a nice drink of some sort (would you crush the walnuts?  I’m not sure!)

Well, I hope you enjoyed this as much as I did!  It’s not a perfect analysis, but it definitely is a delicious one :)  Again, feel free to leave any comments about any of the ingredient combinations, or questions that you think could be answered with a different analysis!

UofT R session went well. Thanks RStudio Server!

Apart from going longer than I had anticipated, very little of any significance went wrong during my R session at UofT on friday!  It took a while at the beginning for everyone to get set up.  Everyone was connecting to my home RStudio server via UofT’s wireless network.  This meant that if any students weren’t set up to use wireless in the first place (they get a username and password from the library, a UTORid) then they wouldn’t be able to connect period.  For those students who were able to connect, I assigned each of them one of 30 usernames that I had laboriously set up on my machine the night before.

After connecting to my server, I then got them to click on the ‘data’ directory that I had set up in each of their home folders on my computer to load up the data that I prepared for them (see last post).  I forgot that they needed to set the data directory as their working directory… woops, that wasted some time!  After I realized that mistake, things went more smoothly.

We went over data import, data indexing (although I forgot about conditional indexing, which I use very often at work… d’oh!), merging, mathematical operations, some simple graphing (a histogram, scatterplot, and scatterplot matrix), summary stats, median splits, grouped summary stats using the awesome dplyr, and then nicer graphing using the qplot function from ggplot2.

I was really worried about being boring, but I found myself getting more and more energized as the session went on, and I think the students were interested as well!  I’m so glad that the RStudio Server I set up on my computer was able to handle all of those connections at once and that my TekSavvy internet connection didn’t crap out either :)  This is definitely an experience that I would like to have again.  Hurray!

Here’s a script of the analysis I went through:

Here’s the data:

http://bit.ly/MClPmK

Teaching a Class of Undergrads, RStudio Server, and My Ubuntu Machine

I was chatting about public speaking with my brother, who is a Lecturer in the Faculty of Pharmacy at UofT, when he offered me the opportunity to come to his class and teach about R.  Always eager to spread the analytical goodness, I said yes!  The class is this Friday, and I am excited.

For this class I’ll be making use of RStudio Server, rather than having to get R onto some 30 individual machines.  Furthermore, I’ll be using an installation of RStudio Server on my own home machine.  It gives me more control and the convenience of configuring things late at night when I have the time to.

While playing around with the server on my computer (connecting via my own browser) I noticed that for each user you create, a new package library gets built.  That’s too bad as it relates to this class, because it would be neat for everyone to be able to make use of additional packages like ggplot2, dplyr and such, but this is an extremely beginner class anyway.

I’ve signed up for a dynamic dns host name from no-ip.com, and have set the port forwarding on my router accordingly, so that seems to be working just fine.  I just hope that nothing goes wrong.  I need to remember to create enough accounts on my ubuntu machine to accommodate all the students, which will be a small pain in the you-know-what, but oh well.

As for the data side of things, I’ve compiled some mildly interesting data on drug-related deaths by council area in scotland, geographical coordinates, and levels of crime, employment, education, income and health.  I only have an hour, so we’ll see how much I can cover!  Wish me luck.  If you have any advice, I’d be happy to hear it.  I’ve already been told to start with graphics :)

Nuclear vs Green Energy: Share the Wealth or Get Your Own?

Thanks to Ontario Open Data, a survey dataset was recently made public containing peoples’ responses to questions about Ontario’s Long Term Energy Plan (LTEP).  The survey did fairly well in terms of raw response numbers, with 7,889 responses in total (although who knows how many people it was sent to!).  As you’ll see in later images in this post, two major goals of Ontario’s LTEP is to eliminate power generation from coal, and to maximize savings by encouraging lots of conservation.

For now though, I’ll introduce my focus of interest: Which energy sources did survey respondents think should be relied on in the future, and how does that correlate with their views on energy management/sharing?

As you can see in the graph below, survey respondents were given a 7 point scale and asked to use it to rate the importance of different energy source options (scale has been flipped so that 7 is the most important and 1 is the least).  Perhaps it’s my ignorance of this whole discussion, but it surprised me that 76% of respondents rated Nuclear power as at least a 5/7 on a scale of importance!  Nuclear power?  But what about Chernobyl and Fukushima?  To be fair, although terribly dramatic and devastating, those were isolated incidents.  Also, measures have been taken to ensure our current nuclear reactors are and will be disaster safe.  Realistically, I think most people don’t think about those things!  A few other things to notice here: conservation does have its adherents, with 37% giving a positive response.  Also, I think it was surprising (and perhaps saddening) to see that green energy has so few adherents, proportionately speaking.

Survey: Importance of Energy Sources

After staring at this graph for a while, I had the idea to see what interesting differences I could find between people who supported Nuclear energy versus those who support Green energy.  What I found is quite striking in its consistency:

  1. Those who believe Nuclear energy is important for Ontario’s future mix of energy sources seem to be more confident that there’s enough energy to share between regions and that independence in power generation is not entirely necessary.
  2. On the flip side, those who believe Green energy is important for Ontario’s future mix of energy sources seem to be more confident that there isn’t enough energy to share between regions and that independence in power generation should be nurtured.

See for yourself in the following graphs:

Survey: Regions Should Make Conservation their First Priority

Survey: Self Sustaining Regions

Survey: Region Responsible for Growing Demand

Survey: Regions buy Power

Does this make sense in light of actual facts?  The graph below comes from a very digestible page set up by the Ontario Ministry of Energy to communicate its Long Term Energy Plan.  As they make pretty obvious, Nuclear energy accounts for over half of energy production in Ontario in 2013, whereas the newer green energy sources (Solar, Bioenergy, Wind vs. Hydro ) amount to about 5%.  In their forecast for 2032, they are hopeful that they will account for 13% of energy production in Ontario.  Still not the lion’s share of energy, but if you add that to the 22% accounted for by Hydro, then you get 35% of all energy production, which admittedly isn’t bad!  Still, I wonder what people were thinking of when they saw “Green energy” on the survey.  If the new sources, then I think what is going on here is that perhaps people who advocate for Green energy sources such as wind and solar have an idea how difficult it is to power a land mass such as Ontario with these kinds of power stations.  People advocating for Nuclear, on the other hand, are either blissfully ignorant, or simply understand that Nuclear power plants are able to serve a wider area.
MOE: Screenshot from 2013-12-08 13:28:04

MOE: Screenshot from 2013-12-08 13:41:06

All of this being said, as you can see in the image above, the Ontario Provincial Government actually wants to *reduce* our province’s reliance on Nuclear energy in the next 5 years, and in fact they will not be building new reactors.  I contacted Mark Smith, Senior Media Relations Coordinator of the Ontario Ministry of Energy to ask him to comment about the role of Nuclear energy in the long run.  Following are some tidbits that he shared with me over email:

Over the past few months, we have had extensive consultations as part of our review of Ontario’s Long Term Energy Plan (LTEP). There is a strong consensus that now is not the right time to build new nuclear.

Ontario is currently in a comfortable supply situation and it does not require the additional power.

We will continue to monitor the demand and supply situation and look at building new nuclear in the future, if the need arises.

Nuclear power has been operating safely in our province for over 40 years, and is held to the strictest regulations and safety requirements to ensure that the continued operation of existing facilities, and any potential new build are held to the highest standards.

We will continue with our nuclear refurbishment plans for which there was strong province-wide support during the LTEP consultations.

During refurbishment, both OPG and Bruce Power will be subject to the strictest possible oversight to ensure safety, reliable supply and value for ratepayers.

Nuclear refurbishments will create thousands of jobs and extend the lives of our existing fleet for another 25-30 years, sustaining thousands of highly-skilled and high-paying jobs.

The nuclear sector will continue be a vital, innovative part of Ontario, creating new technology which is exported around the world.

Well, even Mr. Mark Smith seems confident about Nuclear energy!  I tried to contact the David Suzuki Foundation to see if they’d have anything to say on the role of Green Energy in Ontario’s future, but they were unavailable for comment.

Well, there you have it!  Despite confidence in Nuclear energy as a viable source for the future, the province will be increasing its investments in both Green energy and conservation!  Here’s hoping for an electric following decade :)

(P.S. As usual, the R code follows)

Enron Email Corpus Topic Model Analysis Part 2 – This Time with Better regex

After posting my analysis of the Enron email corpus, I realized that the regex patterns I set up to capture and filter out the cautionary/privacy messages at the bottoms of peoples emails were not working.  Let’s have a look at my revised python code for processing the corpus:

As I did not change the R code since the last post, let’s have a look at the results:

terms(lda.model,20)
      Topic 1   Topic 2   Topic 3     Topic 4   
 [1,] "enron"   "time"    "pleas"     "deal"    
 [2,] "busi"    "thank"   "thank"     "gas"     
 [3,] "manag"   "day"     "attach"    "price"   
 [4,] "meet"    "dont"    "email"     "contract"
 [5,] "market"  "call"    "enron"     "power"   
 [6,] "compani" "week"    "agreement" "market"  
 [7,] "vinc"    "look"    "fax"       "chang"   
 [8,] "report"  "talk"    "call"      "rate"    
 [9,] "time"    "hope"    "copi"      "trade"   
[10,] "energi"  "ill"     "file"      "day"     
[11,] "inform"  "tri"     "messag"    "month"   
[12,] "pleas"   "bit"     "inform"    "compani" 
[13,] "trade"   "guy"     "phone"     "energi"  
[14,] "risk"    "night"   "send"      "transact"
[15,] "discuss" "friday"  "corp"      "product" 
[16,] "regard"  "weekend" "kay"       "term"    
[17,] "team"    "love"    "review"    "custom"  
[18,] "plan"    "item"    "receiv"    "cost"    
[19,] "servic"  "email"   "question"  "thank"   
[20,] "offic"   "peopl"   "draft"     "purchas"

One at a time, I will try to interpret what each topic is trying to describe:

  1. This one appears to be a business process topic, containing a lot of general business terms, with a few even relating to meetings.
  2. Similar to the last model that I derived, this topic has a lot of time related words in it such as: time, day, week, night, friday, weekend.  I’ll be interested to see if this is another business meeting/interview/social meeting topic, or whether it describes something more social.
  3. Hrm, this topic seems to contain a lot of general terms used when we talk about communication: email, agreement, fax, call, message, inform, phone, send, review, question.  It even has please and thank you!  I suppose it’s very formal and you could perhaps interpret this as professional sounding administrative emails.
  4. This topic seems to be another case of emails containing a lot of ‘shop talk’

Okay, let’s see if we can find some examples for each topic:

sample(which(df.emails.topics$"1" > .95),3)
[1] 27771 45197 27597

enron[[27771]]

 Christi's call.
 
  
     
 
 	Christi has asked me to schedule the above meeting/conference call.  September 11th (p.m.) seems to be the best date.  Question:  Does this meeting need to be a 1/2 day meeting?  Christi and I were wondering.
 
 	Give us your thoughts.

Yup, business process, meeting. This email fits the bill! Next!

enron[[45197]]

 
 Bob, 
 
 I didn't check voice mail until this morning (I don't have a blinking light.  
 The assistants pick up our lines and amtel us when voice mails have been 
 left.)  Anyway, with the uncertainty of the future business under the Texas 
 Desk, the following are my goals for the next six months:
 
 1)  Ensure a smooth transition of HPL to AEP, with minimal upsets to Texas 
 business.
 2)  Develop operations processes and controls for the new Texas Desk.   
 3)  Develop a replacement
  a.  Strong push to improve Liz (if she remains with Enron and )
  b.  Hire new person, internally or externally
 4)  Assist in develop a strong logisitcs team.  With the new business, we 
 will need strong performers who know and accept their responsibilites.
 
 1 and 2 are open-ended.  How I accomplish these goals and what they entail 
 will depend how the Texas Desk (if we have one) is set up and what type of 
 activity the desk will be invovled in, which is unknown to me at this time.  
 I'm sure as we get further into the finalization of the sale, additional and 
 possibly more urgent goals will develop.  So, in short, who knows what I need 
 to do.
 
 D

This one also seems to fit the bill. “D” here is writing about his/her goals for the next six months and considers briefly how to accomplish them. Not heavy into the content of the business, so I’m happy here. On to topic 2:

sample(which(df.emails.topics$"2" > .95),3)
[1] 50356 22651 19259

enron[[50356]]

I agree it is Matt, and  I believe he has reviewed this tax stuff (or at 
 least other turbine K's) before.  His concern will be us getting some amount 
 of advance notice before title transfer (ie, delivery).  Obviously, he might 
 have some other comments as well.  I'm happy to send him the latest, or maybe 
 he can access the site?
 
 Kay
 
 
    
  
 Given that the present form of GE world hunger seems to be more domestic than 
 international it would appear that Matt Gockerman would be a good choice for 
 the Enron- GE tax discussion.  Do you want to contact him or do you want me 
 to.   I would be interested in listening in on the conversation for 
 continuity. 

Here, the conversants seem to be talking about having a phone conversation with “Matt” to get his ideas on a tax discussion. This fits in with the meeting theme. Next!

enron[[22651]]

 LOVE
 HONEY PIE

Well, that was pretty social, wasn’t it? :) Okay one more from the same topic:

enron[[19259]]

  Mime-Version: 1.0
  Content-Transfer-Encoding: 7bit
 X- X- X- X-b X-Folder: \ExMerge - Giron, Darron C.\Sent Items
 X-Origin: GIRON-D
 X-FileName: darron giron 6-26-02.PST
 
 Sorry.  I've got a UBS meeting all day.  Catch you later.  I was looking forward to the conversation.
 
 DG
 
  
     
 It seems everyone agreed to Ninfa's.  Let's meet at 11:45; let me know if a
 different time is better.  Ninfa's is located in the tunnel under the JP
 Morgan Chase Tower at 600 Travis.  See you there.
 
 Schroeder

Woops, header info that I didn’t manage to filter out :(. Anyway, DG writes about an impending conversation, and Schroeder writes about a specific time for their meeting. This fits! Next topic!

sample(which(df.emails.topics$"3" > .95),3)
[1] 24147 51673 29717

enron[[24147]]

Kaye:  Can you please email the prior report to me?  Thanks.
 
 Sara Shackleton
 Enron North America Corp.
 1400 Smith Street, EB 3801a
 Houston, Texas  77002
 713-853-5620 (phone)
 713-646-3490 (fax)


 	04/10/2001 05:56 PM
 			  		  
 
 At Alan's request, please provide to me by e-mail (with a  Thursday of this week your suggested changes to the March 2001 Monthly 
 Report, so that we can issue the April 2001 Monthly Report by the end of this 
 week.  Thanks for your attention to this matter.
 
 Nita

This one definitely fits in with the professional sounding administrative emails interpretation. Emailing reports and such. Next!

 I believe this was intended for Susan Scott with ETS...I'm with Nat Gas trading.
 
 Thanks
 
 
 
     
 FYI...another executed capacity transaction on EOL for Transwestern.
 
  
     
 This message is to confirm your EOL transaction with Transwestern Pipeline Company.
 You have successfully acquired the package(s) listed below.  If you have questions or
 concerns regarding the transaction(s), please call Michelle Lokay at (713) 345-7932
 prior to placing your nominations for these volumes.
 
 Product No.:	39096
 Time Stamp:	3/27/01	09:03:47 am
 Product Name:	US PLCapTW Frm CenPool-OasisBlock16
  
 Shipper Name:  E Prime, Inc.
 
 Volume:	10,000 Dth/d  
 					
 Rate:	$0.0500 /dth 1-part rate (combined  Res + Com) 100% Load Factor
 		+ applicable fuel and unaccounted for
 	
 TW K#: 27548		
 
 Effective  
 Points:	RP- (POI# 58649)  Central Pool      10,000 Dth/d
 		DP- (POI# 8516)   Oasis Block 16  10,000 Dth/d
 
 Alternate Point(s):  NONE
 
 
 Note:     	In order to place a nomination with this agreement, you must log 
 	            	off the TW system and then log back on.  This action will update
 	            	the agreement's information on your PC and allow you to place
 		nominations under the agreement number shown above.
 
 Contact Info:		Michelle Lokay
 	 			Phone (713) 345-7932
               			Fax       (713) 646-8000

Rather long, but even the short part at the beginning falls under the right category for this topic! Okay, let’s look at the final topic:

sample(which(df.emails.topics$"4" > .95),3)
[1] 39100  31681  6427

enron[[39100]]

 Randy, your proposal is fine by me.  Jim

Hrm, this is supposed to be a ‘business content’ topic, so I suppose I can see why this email was classified as such. It doesn’t take long to go from ‘proposal’ to ‘contract’ if you free associate, right? Next!

enron[[31681]]

 Attached is the latest version of the Wildhorse Entrada Letter.  Please 
 review.  I reviewed the letter with Jim Osborne and Ken Krisa yesterday and 
 should get their comments today.  My plan is to Fedex to Midland for Ken's 
 signature tomorrow morning and from there it will got to Wildhorse.  

This one makes me feel a little better, referencing a specific business letter that the emailer probably wants the emailed person to see. Let’s find one more for good luck:

enron[[6427]]

 At a ratio of 10:1, you should have your 4th one signed and have the fifth 
 one on the way...
 
  	09/19/2000 05:40 PM
   		  		  
 ONLY 450!  Why, I thought you guys hit 450 a long time ago.
 
 Marie Heard
 Senior Legal Specialist
 Enron Broadband Services
 Phone:  (713) 853-3907
 Fax:  (713) 646-8537

 	09/19/00 05:34 PM
		  		  		  
 Well, I do believe this makes 450!  A nice round number if I do say so myself!

 	Susan Bailey
 	09/19/2000 05:30 PM

 We have received an executed Master Agreement:
 
 
 Type of Contract:  ISDA Master Agreement (Multicurrency-Cross Border)
 
 Effective  
 Enron Entity:   Enron North America Corp.
 
 Counterparty:   Arizona Public Service Company
 
 Transactions Covered:  Approved for all products with the exception of: 
 Weather
           Foreign Exchange
           Pulp & Paper
 
 Special Note:  The Counterparty has three (3) Local Business Days after the 
 receipt of a Confirmation from ENA to accept or dispute the Confirmation.  
 Also, ENA is the Calculation Agent unless it should become a Defaulting 
 Party, in which case the Counterparty shall be the Calculation Agent.
 
 Susan S. Bailey
 Enron North America Corp.
 1400 Smith Street, Suite 3806A
 Houston, Texas 77002
 Phone: (713) 853-4737
 Fax: (713) 646-3490

That one was very long, but there’s definitely some good business content in it (along with some happy banter about the contract that I guess was acquired).

All in all, I’d say that fixing those regex patterns that were supposed to filter out the caution/privacy messages at the ends of peoples’ emails was a big boon to the LDA analysis here.

Let that be a lesson: half the battle in LDA is in filtering out the noise!

A Rather Nosy Topic Model Analysis of the Enron Email Corpus

Having only ever played with Latent Dirichlet Allocation using gensim in python, I was very interested to see a nice example of this kind of topic modelling in R.  Whenever I see a really cool analysis done, I get the urge to do it myself.  What better corpus to do topic modelling on than the Enron email dataset?!?!?  Let me tell you, this thing is a monster!  According to the website I got it from, it contains about 500k messages, coming from 151 mostly senior management users and is organized into user folders.  I didn’t want to accept everything into my analysis, so I made the decision that I would only look into messages contained within the “sent” or “sent items” folders.

Being a large advocate of R, I really really tried to do all of the processing and analysis in R, but it was just too difficult and was taking up more time than I wanted.  So I dusted off my python skills (thank you grad school!) and did the bulk of the data processing/preparation in python, and the text mining in R.  Following is the code (hopefully well enough commented) that I used to process the corpus in python:

After having seen python’s performance in rifling through these enron emails, I was very impressed!  It was very agile in creating a directory with the largest number of files I’d ever seen on my computer!

Okay, so now I had a directory filled with a whole lot of text files.  The next step was to bring them into R so that I could submit them to the LDA.  Following is the R code that I used:

Phew, that took a lot of computing power! Now that it’s done, let’s look at the results of the command on line 48 from the above gist:

      Topic 1   Topic 2     Topic 3      Topic 4     
 [1,] "time"    "thank"     "market"     "email"     
 [2,] "vinc"    "pleas"     "enron"      "pleas"     
 [3,] "week"    "deal"      "power"      "messag"    
 [4,] "thank"   "enron"     "compani"    "inform"    
 [5,] "look"    "attach"    "energi"     "receiv"    
 [6,] "day"     "chang"     "price"      "intend"    
 [7,] "dont"    "call"      "gas"        "copi"      
 [8,] "call"    "agreement" "busi"       "attach"    
 [9,] "meet"    "question"  "manag"      "recipi"    
[10,] "hope"    "fax"       "servic"     "enron"     
[11,] "talk"    "america"   "rate"       "confidenti"
[12,] "ill"     "meet"      "trade"      "file"      
[13,] "tri"     "mark"      "provid"     "agreement" 
[14,] "night"   "kay"       "issu"       "thank"     
[15,] "friday"  "corp"      "custom"     "contain"   
[16,] "peopl"   "trade"     "california" "address"   
[17,] "bit"     "ena"       "oper"       "contact"   
[18,] "guy"     "north"     "cost"       "review"    
[19,] "love"    "discuss"   "electr"     "parti"     
[20,] "houston" "regard"    "report"     "contract"

Here’s where some really subjective interpretation is required, just like in PCA analysis.  Let’s try to interpret the topics, one at a time:

  1. I see a lot of words related to time in this topic, and then I see the word ‘meet’.  I’ll call this the meeting (business or otherwise) topic!
  2. I’m not sure how to interpret this second topic, so perhaps I’ll chalk it up to noise in my analysis!
  3. This topic contains a lot of ‘business content’ words, so it appears to be a kind of ‘talking shop’ topic.
  4. This topic, while still pretty ‘businessy’, appears to be less about the content of the business and more about the processes, or perhaps legalities of the business.

For each of the sensible topics (1,3,4), let’s bring up some emails that scored highly on these topics to see if the analysis makes sense:

sample(which(df.emails.topics$"1" > .95), 10)
 [1] 53749 32102 16478 36204 29296 29243 47654 38733 28515 53254
enron[[32102]]

 I will be out of the office next week on Spring Break. Can you participate on 
 this call? Please report what is said to Christian Yoder 503-464-7845 or 
 Steve Hall 503-4647795

 	03/09/2001 05:48 PM

 I don't know, but I will check with our client.

 Our client Avista Energy has received the communication, below, from the ISO
 regarding withholding of payments to creditors of monies the ISO has
 received from PG&E.  We are interested in whether any of your clients have
 received this communication, are interested in this issue and, if so,
 whether you have any thoughts about how to proceed.

 You are invited to participate in a conference call to discuss this issue on
 Monday, March 12, at 10:00 a.m.

 Call-in number: (888) 320-6636
 Host: Pritchard
 Confirmation number: 1827-1922

 Diane Pritchard
 Morrison & Foerster LLP
 425 Market Street
 San Francisco, California 94105
 (415) 268-7188

So this one isn’t a business meeting in the physical sense, but is a conference call, which still falls under the general category of meetings.

enron[[29243]]
 Hey Fritz.  I am going to send you an email that attaches a referral form to your job postings.  In addition, I will also personally tell the hiring manager that I have done this and I can also give him an extra copy of youe resume.  Hopefully we can get something going here....

 Tori,

 I received your name from Diane Hoyuela. You and I spoke
 back in 1999 about the gas industry. I tried briefly back
 in 1999 and found few opportunities during de-regulations
 first few steps. Well,...I'm trying again. I've been
 applying for a few job openings at Enron and was wondering
 if you could give me an internal referral. Also, any advice
 on landing a position at Enron or in general as a scheduler
 or analyst.
 Last week I applied for these positions at Enron; gas
 scheduler 110360, gas analyst 110247, and book admin.
 110129. I have a pretty good understanding of the gas
 market.

 I've attached my resume for you. Congrats. on the baby!
 I'll give you a call this afternoon to follow-up, I know
 mornings are your time.
 Regards,

 Fritz Hiser

 __________________________________________________
 Do You Yahoo!?
 Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger. http://im.yahoo.com

That one obviously shows someone who was trying to get a job at Enron and wanted to call “this afternoon to follow-up”. Again, a ‘call’ rather than a physical meeting.

Finally,

enron[[29296]]

 Susan,

 Well you have either had a week from hell so far or its just taking you time
 to come up with some good bs.  Without being too forward I will be in town
 next Friday and wanted to know if you would like to go to dinner or
 something.  At least that will give us a chance to talk face to face.  If
 your busy don't worry about it I thought I would just throw it out there.

 I'll keep this one short and sweet since the last one was rather lengthy.
 Hope this Thursday is a little better then last week.

 Kyle

 _________________________________________________________________________
 Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

 Share information about yourself, create your own public profile at

http://profiles.msn.com.

Ahh, here’s a particularly juicy one. Kyle here wants to go to dinner, “or something” (heh heh heh) with Susan to get a chance to talk face to face with her. Finally, a physical meeting (maybe very physical…) lumped into a category with other business meetings in person or on the phone.

Okay, now let’s switch to topic 3, the “business content” topic.

sample(which(df.emails.topics$"3" > .95), 10)
 [1] 40671 26644  5398 52918 37708  5548 15167 56149 47215 26683

enron[[40671]]

 Please change the counterparty on deal 806589 from TP2 to TP3 (sorry about that).

Okay, that seems fairly in the realm of business content, but I don’t know what the heck it means. Let’s try another one:

enron[[5548]]

Phillip, Scott, Hunter, Tom and John -

 Just to reiterate the new trading guidelines on PG&E Energy Trading:

 1.  Both financial and physical trading are approved, with a maximum tenor of 18 months

 2.  Approved entities are:	PG&E Energy Trading - Gas Corporation
 				PG&E Energy Trading - Canada Corporation

 				NO OTHER PG&E ENTITIES ARE APPROVED FOR TRADING

 3.  Both EOL and OTC transactions are OK

 4.  Please call Credit (ext. 31803) with details on every OTC transaction.  We need to track all new positions with PG&E Energy Trading on an ongoing basis.  Please ask the traders and originators on your desks to notify us with the details on any new transactions immediately upon execution.  For large transactions (greater than 2 contracts/day or 5 BCF total), please call for approval before transacting.

 Thanks for your assistance; please call me (ext. 53923) or Russell Diamond (ext. 57095) if you have any questions.

 Jay

That one is definitely oozing with business content. Note the terms such as “Energy Trading”, and “Gas Corporation”, etc. Finally, one more:

enron[[26683]]

Hi Kathleen, Randy, Chris, and Trish,

 Attached is the text of the August issue of The Islander.  The headings will
 be lined up when Trish adds the art and ads.  A calendar, also, which is in
 the next e-mail.

 I'll appreciate your comments by the end of tomorrow, Monday.

 There are open issues which I sure hope get resolved before printing:

 1.  I'm waiting for a reply from Mike Bass regarding tenses on the Home Depot
 article.  Don't know if there's one developer or more and what the name(s)
 is/are.

 2.  Didn't hear back from Ted Weir regarding minutes for July's water board
 meeting.  I think there are 2 meetings minutes missed, 6/22 and July.

 3.  Waiting to hear back from Cheryl Hanks about the 7/6 City Council and 6/7
 BOA meetings minutes.

 4.  Don't know the name of the folks who were honored with Yard of the Month.
  They're at 509 Narcissus.

 I'm not feeling very good about the missing parts but need to move on
 schedule!  I'm also looking for a good dictionary to check the spellings of
 ettouffe, tree-house and orneryness.  (Makes me feel kind of ornery, come to
 think about it!)

 Please let me know if you have revisions.  Hope your week is starting out
 well.

 'Nita

Alright, this one seems to be a mix between business content and process. So I can see how it was lumped into this topic, but it doesn’t quite have the perfection that I would like.

Finally, let’s move on to topic 4, which appeared to be a ‘business process’ topic to me. I’m suspicious of this topic, as I don’t think I successfully filtered out everything that I wanted to:

sample(which(df.emails.topics$"4" > .95), 10)
 [1] 51205  5129 48826 51214 55337 15843 52543 11978 48337  2609

enron[[5129]]

very funny today...during the free fall, couldn't price jv and xh low enough 
 on eol, just kept getting cracked.  when we stabilized, customers came in to 
 buy and couldnt price it high enough.  winter versus apr went from +23 cents 
 when we were at the bottom to +27 when april rallied at the end even though 
 it should have tightened theoretically.  however, april is being supported 
 just off the strip.  getting word a lot of utilities are going in front of 
 the puc trying to get approval for hedging programs this year.  

 hey johnny. hope all is well. what u think hrere? utuilites buying this break
 down? charts look awful but 4.86 ish is next big level.
 jut back from skiing in co, fun but took 17 hrs to get home and a 1.5 days to
 get there cuz of twa and weather.

Hrm, this one appears to be some ‘shop talk’, and isn’t too general. I’m not sure how this applies to the topic 4 words. Let’s try another one:

enron[[55337]]

Fran, do you have an updated org chart that I could send to the Measurement group?
 	Thanks. Lynn

    Cc:	Estalee Russi

 Lynn,

 Attached are the org charts for ETS Gas Logistics:

 Have a great weekend.  Thanks!

 Miranda

Here we go. This one seems to fall much more into the ‘business process’ realm. Let’s see if I can find another good example:

enron[[11978]]

 Bill,

 As per our conversation today, I am sending you an outline of what we intend to be doing in Ercot and in particular on the real-time desk. For 2002 Ercot is split into 4 zones with TCRs between 3 of the zones. The zones are fairly diverse from a supply/demand perspective. Ercot has an average load of 38,000 MW, a peak of 57,000 MW with a breakdown of 30% industrial, 30% commercial and 40% residential. There are already several successful aggregators that are looking to pass on their wholesale risk to a credit-worthy QSE (Qualified Scheduling Entity). 

 Our expectation is that we will be a fully qualified QSE by mid-March with the APX covering us up to that point. Our initial on-line products will include a bal day and next day financial product. (There is no day ahead settlement in this market). There are more than 10 industrial loads with greater than 150 MW concentrated at single meters offering good opportunities for real-time optimization. Our intent is to secure one of these within the next 2 months.

 I have included some price history to show the hourly volatility and a business plan to show the scope of the opportunity. In addition, we have very solid analytics that use power flow simulations to map out expected outcomes in the real-time market.

 The initial job opportunity will involve an analysis of the real-time market as it stands today with a view to trading around our information. This will also drive which specific assets we approach to manage. As we are loosely combining our Texas gas and Ercot power desks our information flow will be superior and I believe we will have all the tools needed for a successful real-time operation.

 Let me know if you have any further questions.

 Thanks,

 Doug

Again, I seem to have found an email that straddles the boundary between business process and business content. Okay, I guess this topic isn’t the clearest in describing each of the examples that I found!

Overall, I probably could have done a bit more to filter out the useless stuff to construct topics that were better in describing the examples that they represent. Also, I’m not sure if I should be surprised or not that I didn’t pick up some sort of ‘social banter’ topic, where people were emailing about non-business topics. I suppose that social banter emails might be less predictable in their content, but maybe somebody much smarter than I am can tell me the answer :)

If you know how I can significantly ramp up the quality of this analysis, feel free to contribute your comments!