Data Until I Die: My blog title and statement of values

When I started keeping this Blog, my intent was to write about and keep helpful snippets of R code that I used in the line of work.  It was the start of my second job after grad school and I was really excited about getting to use R on a regular basis outside of academia!  Well, time went on and so did the number of posts I put on here.  After a while the posts related to work started to dip considerably.  Then, I found my third post grad-school job, which shifted the things I needed R for at work.  I still use R at work, but not for exactly the same things.

After I got my third job, I noticed that all my blog posts were for fun, and not for work.  That’s when I fiddled a little bit with my blog title to incorporate the concept of ‘fun’ in there.  Now that I’ve carried on these ‘for fun’ analyses which I’ve been posting every 1-2 months, I realize that it’s an obsession of mine that’s not going away.  Data, I realize, is a need of mine (probably more than sex, to be honest).  I work with data in the office, I play with data when I can find fun data sets every once in a while at home.

Data and data analysis are comfortably things that I can see myself doing for the rest of my life.  That’s why I decided to call this blog “Data Until I Die!”.  Sometimes I’ll post boring analyses, and sometimes I’ll post really interesting analyses.  The main thing is, this Blog is a good excuse to fuel my Data drive :)

Thanks for reading!

Ontario First Nations Libraries Compared Using Ontario Open Data

I recently downloaded a very cool dataset on Ontario libraries from the Ontario Open Data Catalogue.  The dataset contains 142 columns of information describing 386 libraries in Ontario, representing a fantastically massive data collection effort for such important cultural institutions (although the most recent information available is as of 2010).  One column which particularly caught my interest was “Library Service Type”, which breaks the libraries down into:

  • Public or Union Library (247)
  • LSB Library (4)
  • First Nations Library (43)
  • County, County co-operative or Regional Municipality Library (13)
  • Contracting Municipality (49)
  • Contracting LSB (14)

I saw the First Nations Library type and thought it would be really educational for me to compare First Nations libraries against all the other library types combined and see how they compare based on some interesting indicators.  To make these comparisons in this post, I use a few violin plots; where you see more bulkiness in the plot, it tells you that the value on the y axis is more likely for a library compared to the thinner parts.

Our first comparison, shown below, reveals that local population sizes are a LOT more variable amongst the “Other” library types compared to First Nations libraries.  From first to third quartile, First Nations libraries tend to have around 250 to 850 local residents, whereas Other libraries tend to have around 1,110 to 18,530 local residents!

Local Population Sizes by Library Type

             isFN.Library 0%    25%  50%   75%    100%
1         Other Libraries 28 1113.5 5079 18529 2773000
2 First Nations Libraries 55  254.5  421   857   11297

Considering the huge difference in the population sizes that these libraries were made to serve, comparisons between library types need to be weighted according to those sizes, so that the comparisons are made proportionate.  In that spirit, the next plot compares the distribution of the number of cardholders per resident by library type.  Thinking about this metric for a moment, it’s possible that a person not living in the neighbourhood of the library can get a card there.  If all the residents of the library’s neighbourhood have a card, and there are people outside of that neighbourhood with cards, then a library could have over 1 cardholder per resident.

Looking at the plot, a couple of things become apparent: Firstly, First Nations libraries appear more likely to to be overloaded with cardholders (more cardholders than there are local residents, 14% of First Nations libraries, vs. 4% of Other libraries).  On the lower end of the spectrum, First Nations libraries show a slight (non-significant) tendency of having fewer cardholders per resident than Other libraries.

Cardholders per Resident by Library Type_

             isFN.Library 0%  25%  50%  75% 100%
1         Other Libraries  0 0.20 0.37 0.55  2.1
2 First Nations Libraries  0 0.19 0.32 0.77  2.8

Next we’ll look at a very interesting metric, because it looks so different when you compare it in its raw form to when you compare it in proportion to population size.  The plot below shows the distribution of English titles in circulation by library type.  It shouldn’t be too surprising that Other libraries, serving population sizes ranging from small to VERY large, also vary quite widely in the number of English titles in circulation (ranging from around 5,600 to 55,000, from first to third quartile).  On the other hand we have First Nations libraries, serving smaller population sizes, varying a lot less in this regard (from around 1,500 to 5,600 from first to third quartile).
Num English Titles in Circulation by Library Type

             isFN.Library 0%    25%   50%   75%   100%
1         Other Libraries  0 5637.5 21054 54879 924635
2 First Nations Libraries  0 1500.0  3800  5650  25180

Although the above perspective reveals that First Nations libraries tend to have considerably fewer English titles in circulation, things look pretty different when you weight this metric by the local population size.  Here, the plot for First Nations libraries looks very much like a Hershey’s Kiss, whereas the Other libraries plot looks a bit like a toilet plunger.  In other words, First Nations libraries tend to have more English titles in circulation per resident than Other libraries.  This doesn’t say anything about the quality of those books available in First Nations libraries.  For that reason, it would be nice to have a measure even as simple as median/average age/copyright date of the books in the libraries to serve as a rough proxy for the quality of the books sitting in each library.  That way, we’d know whether the books in these libraries are up to date, or antiquated.
English Titles in Circulation per Resident by Library Type

             isFN.Library 0%       25%      50%       75%      100%
1         Other Libraries  0 0.9245169 2.698802  5.179767 119.61462
2 First Nations Libraries  0 2.0614922 7.436399 13.387416  51.14423

For the next plot, I took all of the “per-person” values, and normed them.  That is to say, for any given value on the variables represented below, I subtracted from that value the minimum possible value, and then divided the result by the range of values on that measure.  Thus, any and all values close to 1 are the higher values, and those closer to 0 are the lower values.  I then took the median value (by library type) for each measure, and plotted below.  Expressed this way, flawed though it may be, we see that First Nations Libraries tend to spend more money per local resident, across areas, than Other libraries.  The revenue side looks a bit different.  While they tend to get more revenue per local resident, they appear to generate less self-generated revenue, get fewer donations, and get less money in local operating grants, all in proportion to the number of local residents.  The three areas where they are excelling (again, this is a median measure) are total operating revenue, provincial operating funding, and especially project grants.
Normed Costs and Revenues by Library Type
Here I decided to zero in on the distributional differences in net profit per resident by library type.  Considering that libraries are non-profit institutions, you would expect to see something similar to the plot shown for “Other” libraries, where the overwhelming majority are at or around the zero line.  It’s interesting to me then, especially since I work with non-profit institutions, to see the crazy variability in the First Nations libraries plot.  The upper end of this appears to be from some outrageously high outliers, so I decided to take them out and replot.
Net Profit per Resident Population
In the plot below, I’ve effectively zoomed in, and can see that there do seem to be more libraries showing a net loss, per person, than those in the net gain status.
Normed Costs and Revenues by Library Type - Zoomed In

             isFN.Library      0%    25%   50%  75%   100%
1         Other Libraries -149.87  -0.49  0.00 1.16  34.35
2 First Nations Libraries  -76.55 -17.09 -0.88 0.40 250.54

I wanted to see this net profit per person measure mapped out across Ontario, so I used the wonderful ggmap package, which to my delight is Canadian friendly!  Go Canada!  In this first map, we see that First Nations libraries in Southern Ontario (the part of Ontario that looks like the head of a dragon) seem to be “okay” on this measure, with one library at the “neck” of the dragon seeming to take on a little more red of a shade, one further west taking on a very bright green, and a few closer to Manitoba appearing to be the worst.

Net Profit per Local Resident Amongst First Nations LibrariesTo provide more visual clarity on these poorly performing libraries, I took away all libraries at or above zero on this measure.  Now there are fewer distractions, and it’s easier to see the worst performers.

Net Profit per Local Resident Amongst First Nations Libraries - in the red

Finally, as a sanity check, I re-expressed the above measure into a ratio of total operating revenue to total operating expenditure to see if the resulting geographical pattern was similar enough.  Anything taking on a value of less than 1 is spending more than they are making in revenue, and are thus “in the red”.  While there are some differences in how the colours are arrayed across Ontario, the result is largely the same.Operating Revenue to Cost Ratio Amongst First Nations Libraries

Finally, I have one last graph that does seem to show a good-news story.  When I looked at the ratio of annual program attendance to local population size, I found that First Nations libraries seem to attract more people every year, proportionate to population size, compared to Other libraries!  This might have something to do with the draw of a cultural institution in a small community, but feel free to tell me some first hand stories either running against this result, or confirming it if you will:

Annual Program Attendance by Library Type


          isFN.Library    0%   25%   50%   75%   100%
1         Other Libraries  0 0.018 0.155 0.307  8.8017
2 First Nations Libraries  0 0.113 0.357 2.686  21.361

That’s it for now! If you have any questions, or ideas for further analysis, don’t hesitate to drop me a line :)

As a final note, I think that it’s fantastic that this data collection was done, but the fact that the most recent data available is as of 2010 is very tardy.  What happened here?  Libraries are so important across the board, so please, Ontario provincial government, keep up the data collection efforts!

A Delicious Analysis! (aka topic modelling using recipes)

A few months ago, I saw a link on twitter to an awesome graph charting the similarities of different foods based on their flavour compounds, in addition to their prevalence in recipes (see the whole study, The Flavor Network and the Principles of Food Pairing).  I thought this was really neat and became interested in potentially using the data for something slightly different; to figure out which ingredients tended to correlate across recipes.  I emailed one of the authors, Yong-Yeol Ahn, who is a real mensch by the way, and he let me know that the raw recipe data is readily available on his website!

Given my goal of looking for which ingredients correlate across recipes, I figured this would be the perfect opportunity to use topic modelling (here I use Latent Dirichlet Allocation or LDA).  Usually in topic modelling you have a lot of filtering to do.  Not so with these recipe data, where all the words (ingredients) involved in the corpus are of potential interest, and there aren’t even any punctuation marks!  The topics coming out of the analysis would represent clusters of ingredients that co-occur with one another across recipes, and would possibly teach me something about cooking (of which I know precious little!).

All my code is at the bottom, so all you’ll find up here are graphs and my textual summary.  The first thing I did was to put the 3 raw recipe files together using python.  Each file consisted of one recipe per line, with the cuisine of the recipe as the first entry on the line, and all other entries (the ingredients) separated by tab characters.  In my python script, I separated out the cuisines from the ingredients, and created two files, one for the recipes, and one for the cuisines of the recipes.

Then I loaded up the recipes into R and got word/ingredient counts.  As you can see below, the 3 most popular ingredients were egg, wheat, and butter.  It makes sense, considering the fact that roughly 70% of all the recipes fall under the “American” cuisine.  I did this analysis for novelty’s sake, and so I figured I would take those ingredients out of the running before I continued on.  Egg makes me fart, wheat is not something I have at home in its raw form, and butter isn’t important to me for the purpose of this analysis!

Recipe Popularity of Top 30 Ingredients

Here are the top ingredients without the three filtered out ones:

Recipe Popularity of Top 30 Ingredients - No Egg Wheat or Butter

Finally, I ran the LDA, extracting 50 topics, and the top 5 most characteristic ingredients of each topic.  You can see the full complement of topics at the bottom of my post, but I thought I’d review some that I find intriguing.  You will, of course, find other topics intriguing, or some to be bizarre and inappropriate (feel free to tell me in the comment section).  First, topic 4:

[1] "tomato"  "garlic"  "oregano" "onion"   "basil"

Here’s a cluster of ingredients that seems decidedly Italian.  The ingredients seem to make perfect sense together, and so I think I’ll try them together next time I’m making pasta (although I don’t like tomatoes in their original form, just tomato sauce).

Next, topic 19:

[1] "vanilla" "cream"   "almond"  "coconut" "oat"

This one caught my attention, and I’m curious if the ingredients even make sense together.  Vanilla and cream makes sense… Adding coconut would seem to make sense as well.  Almond would give it that extra crunch (unless it’s almond milk!).  I don’t know whether it would be tasty however, so I’ll probably pass this one by.

Next, topic 20:

[1] "onion"         "black_pepper"  "vegetable_oil" "bell_pepper"   "garlic"

This one looks tasty!  I like spicy foods and so putting black pepper in with onion, garlic and bell pepper sounds fun to me!

Next, topic 23:

[1] "vegetable_oil" "soy_sauce"     "sesame_oil"    "fish"          "chicken"

Now we’re into the meaty zone!  I’m all for putting sauces/oils onto meats, but putting vegetable oil, soy sauce, and sesame oil together does seem like overkill.  I wonder whether soy sauce shows up with vegetable oil or sesame oil separately in recipes, rather than linking them all together in the same recipes.  I’ve always liked the extra salty flavour of soy sauce, even though I know it’s horrible for you as it has MSG in it.  I wonder what vegetable oil, soy sauce, and chicken would taste like.  Something to try, for sure!

Now, topic 26:

[1] "cumin"      "coriander"  "turmeric"   "fenugreek"  "lemongrass"

These are a whole lot of spices that I never use on my food.  Not for lack of wanting, but rather out of ignorance and laziness.  One of my co-workers recently commented that cumin adds a really nice flavour to food (I think she called it “middle eastern”).  I’ve never heard a thing about the other spices here, but why not try them out!

Next, topic 28:

[1] "onion"       "vinegar"     "garlic"      "lemon_juice" "ginger"

I tend to find that anything with an intense flavour can be very appetizing for me.  Spices, vinegar, and anything citric are what really register on my tongue.  So, this topic does look very interesting to me, probably as a topping or a sauce.  It’s interesting that ginger shows up here, as that neutralizes other flavours, so I wonder whether I’d include it in any sauce that I make?

Last one!  Topic 41:

[1] "vanilla"  "cocoa"    "milk"     "cinnamon" "walnut"

These look like the kinds of ingredients for a nice drink of some sort (would you crush the walnuts?  I’m not sure!)

Well, I hope you enjoyed this as much as I did!  It’s not a perfect analysis, but it definitely is a delicious one :)  Again, feel free to leave any comments about any of the ingredient combinations, or questions that you think could be answered with a different analysis!

UofT R session went well. Thanks RStudio Server!

Apart from going longer than I had anticipated, very little of any significance went wrong during my R session at UofT on friday!  It took a while at the beginning for everyone to get set up.  Everyone was connecting to my home RStudio server via UofT’s wireless network.  This meant that if any students weren’t set up to use wireless in the first place (they get a username and password from the library, a UTORid) then they wouldn’t be able to connect period.  For those students who were able to connect, I assigned each of them one of 30 usernames that I had laboriously set up on my machine the night before.

After connecting to my server, I then got them to click on the ‘data’ directory that I had set up in each of their home folders on my computer to load up the data that I prepared for them (see last post).  I forgot that they needed to set the data directory as their working directory… woops, that wasted some time!  After I realized that mistake, things went more smoothly.

We went over data import, data indexing (although I forgot about conditional indexing, which I use very often at work… d’oh!), merging, mathematical operations, some simple graphing (a histogram, scatterplot, and scatterplot matrix), summary stats, median splits, grouped summary stats using the awesome dplyr, and then nicer graphing using the qplot function from ggplot2.

I was really worried about being boring, but I found myself getting more and more energized as the session went on, and I think the students were interested as well!  I’m so glad that the RStudio Server I set up on my computer was able to handle all of those connections at once and that my TekSavvy internet connection didn’t crap out either :)  This is definitely an experience that I would like to have again.  Hurray!

Here’s a script of the analysis I went through:

Here’s the data:

Teaching a Class of Undergrads, RStudio Server, and My Ubuntu Machine

I was chatting about public speaking with my brother, who is a Lecturer in the Faculty of Pharmacy at UofT, when he offered me the opportunity to come to his class and teach about R.  Always eager to spread the analytical goodness, I said yes!  The class is this Friday, and I am excited.

For this class I’ll be making use of RStudio Server, rather than having to get R onto some 30 individual machines.  Furthermore, I’ll be using an installation of RStudio Server on my own home machine.  It gives me more control and the convenience of configuring things late at night when I have the time to.

While playing around with the server on my computer (connecting via my own browser) I noticed that for each user you create, a new package library gets built.  That’s too bad as it relates to this class, because it would be neat for everyone to be able to make use of additional packages like ggplot2, dplyr and such, but this is an extremely beginner class anyway.

I’ve signed up for a dynamic dns host name from, and have set the port forwarding on my router accordingly, so that seems to be working just fine.  I just hope that nothing goes wrong.  I need to remember to create enough accounts on my ubuntu machine to accommodate all the students, which will be a small pain in the you-know-what, but oh well.

As for the data side of things, I’ve compiled some mildly interesting data on drug-related deaths by council area in scotland, geographical coordinates, and levels of crime, employment, education, income and health.  I only have an hour, so we’ll see how much I can cover!  Wish me luck.  If you have any advice, I’d be happy to hear it.  I’ve already been told to start with graphics :)

Nuclear vs Green Energy: Share the Wealth or Get Your Own?

Thanks to Ontario Open Data, a survey dataset was recently made public containing peoples’ responses to questions about Ontario’s Long Term Energy Plan (LTEP).  The survey did fairly well in terms of raw response numbers, with 7,889 responses in total (although who knows how many people it was sent to!).  As you’ll see in later images in this post, two major goals of Ontario’s LTEP is to eliminate power generation from coal, and to maximize savings by encouraging lots of conservation.

For now though, I’ll introduce my focus of interest: Which energy sources did survey respondents think should be relied on in the future, and how does that correlate with their views on energy management/sharing?

As you can see in the graph below, survey respondents were given a 7 point scale and asked to use it to rate the importance of different energy source options (scale has been flipped so that 7 is the most important and 1 is the least).  Perhaps it’s my ignorance of this whole discussion, but it surprised me that 76% of respondents rated Nuclear power as at least a 5/7 on a scale of importance!  Nuclear power?  But what about Chernobyl and Fukushima?  To be fair, although terribly dramatic and devastating, those were isolated incidents.  Also, measures have been taken to ensure our current nuclear reactors are and will be disaster safe.  Realistically, I think most people don’t think about those things!  A few other things to notice here: conservation does have its adherents, with 37% giving a positive response.  Also, I think it was surprising (and perhaps saddening) to see that green energy has so few adherents, proportionately speaking.

Survey: Importance of Energy Sources

After staring at this graph for a while, I had the idea to see what interesting differences I could find between people who supported Nuclear energy versus those who support Green energy.  What I found is quite striking in its consistency:

  1. Those who believe Nuclear energy is important for Ontario’s future mix of energy sources seem to be more confident that there’s enough energy to share between regions and that independence in power generation is not entirely necessary.
  2. On the flip side, those who believe Green energy is important for Ontario’s future mix of energy sources seem to be more confident that there isn’t enough energy to share between regions and that independence in power generation should be nurtured.

See for yourself in the following graphs:

Survey: Regions Should Make Conservation their First Priority

Survey: Self Sustaining Regions

Survey: Region Responsible for Growing Demand

Survey: Regions buy Power

Does this make sense in light of actual facts?  The graph below comes from a very digestible page set up by the Ontario Ministry of Energy to communicate its Long Term Energy Plan.  As they make pretty obvious, Nuclear energy accounts for over half of energy production in Ontario in 2013, whereas the newer green energy sources (Solar, Bioenergy, Wind vs. Hydro ) amount to about 5%.  In their forecast for 2032, they are hopeful that they will account for 13% of energy production in Ontario.  Still not the lion’s share of energy, but if you add that to the 22% accounted for by Hydro, then you get 35% of all energy production, which admittedly isn’t bad!  Still, I wonder what people were thinking of when they saw “Green energy” on the survey.  If the new sources, then I think what is going on here is that perhaps people who advocate for Green energy sources such as wind and solar have an idea how difficult it is to power a land mass such as Ontario with these kinds of power stations.  People advocating for Nuclear, on the other hand, are either blissfully ignorant, or simply understand that Nuclear power plants are able to serve a wider area.
MOE: Screenshot from 2013-12-08 13:28:04

MOE: Screenshot from 2013-12-08 13:41:06

All of this being said, as you can see in the image above, the Ontario Provincial Government actually wants to *reduce* our province’s reliance on Nuclear energy in the next 5 years, and in fact they will not be building new reactors.  I contacted Mark Smith, Senior Media Relations Coordinator of the Ontario Ministry of Energy to ask him to comment about the role of Nuclear energy in the long run.  Following are some tidbits that he shared with me over email:

Over the past few months, we have had extensive consultations as part of our review of Ontario’s Long Term Energy Plan (LTEP). There is a strong consensus that now is not the right time to build new nuclear.

Ontario is currently in a comfortable supply situation and it does not require the additional power.

We will continue to monitor the demand and supply situation and look at building new nuclear in the future, if the need arises.

Nuclear power has been operating safely in our province for over 40 years, and is held to the strictest regulations and safety requirements to ensure that the continued operation of existing facilities, and any potential new build are held to the highest standards.

We will continue with our nuclear refurbishment plans for which there was strong province-wide support during the LTEP consultations.

During refurbishment, both OPG and Bruce Power will be subject to the strictest possible oversight to ensure safety, reliable supply and value for ratepayers.

Nuclear refurbishments will create thousands of jobs and extend the lives of our existing fleet for another 25-30 years, sustaining thousands of highly-skilled and high-paying jobs.

The nuclear sector will continue be a vital, innovative part of Ontario, creating new technology which is exported around the world.

Well, even Mr. Mark Smith seems confident about Nuclear energy!  I tried to contact the David Suzuki Foundation to see if they’d have anything to say on the role of Green Energy in Ontario’s future, but they were unavailable for comment.

Well, there you have it!  Despite confidence in Nuclear energy as a viable source for the future, the province will be increasing its investments in both Green energy and conservation!  Here’s hoping for an electric following decade :)

(P.S. As usual, the R code follows)

Enron Email Corpus Topic Model Analysis Part 2 – This Time with Better regex

After posting my analysis of the Enron email corpus, I realized that the regex patterns I set up to capture and filter out the cautionary/privacy messages at the bottoms of peoples emails were not working.  Let’s have a look at my revised python code for processing the corpus:

As I did not change the R code since the last post, let’s have a look at the results:

      Topic 1   Topic 2   Topic 3     Topic 4   
 [1,] "enron"   "time"    "pleas"     "deal"    
 [2,] "busi"    "thank"   "thank"     "gas"     
 [3,] "manag"   "day"     "attach"    "price"   
 [4,] "meet"    "dont"    "email"     "contract"
 [5,] "market"  "call"    "enron"     "power"   
 [6,] "compani" "week"    "agreement" "market"  
 [7,] "vinc"    "look"    "fax"       "chang"   
 [8,] "report"  "talk"    "call"      "rate"    
 [9,] "time"    "hope"    "copi"      "trade"   
[10,] "energi"  "ill"     "file"      "day"     
[11,] "inform"  "tri"     "messag"    "month"   
[12,] "pleas"   "bit"     "inform"    "compani" 
[13,] "trade"   "guy"     "phone"     "energi"  
[14,] "risk"    "night"   "send"      "transact"
[15,] "discuss" "friday"  "corp"      "product" 
[16,] "regard"  "weekend" "kay"       "term"    
[17,] "team"    "love"    "review"    "custom"  
[18,] "plan"    "item"    "receiv"    "cost"    
[19,] "servic"  "email"   "question"  "thank"   
[20,] "offic"   "peopl"   "draft"     "purchas"

One at a time, I will try to interpret what each topic is trying to describe:

  1. This one appears to be a business process topic, containing a lot of general business terms, with a few even relating to meetings.
  2. Similar to the last model that I derived, this topic has a lot of time related words in it such as: time, day, week, night, friday, weekend.  I’ll be interested to see if this is another business meeting/interview/social meeting topic, or whether it describes something more social.
  3. Hrm, this topic seems to contain a lot of general terms used when we talk about communication: email, agreement, fax, call, message, inform, phone, send, review, question.  It even has please and thank you!  I suppose it’s very formal and you could perhaps interpret this as professional sounding administrative emails.
  4. This topic seems to be another case of emails containing a lot of ‘shop talk’

Okay, let’s see if we can find some examples for each topic:

sample(which(df.emails.topics$"1" > .95),3)
[1] 27771 45197 27597


 Christi's call.
 	Christi has asked me to schedule the above meeting/conference call.  September 11th (p.m.) seems to be the best date.  Question:  Does this meeting need to be a 1/2 day meeting?  Christi and I were wondering.
 	Give us your thoughts.

Yup, business process, meeting. This email fits the bill! Next!


 I didn't check voice mail until this morning (I don't have a blinking light.  
 The assistants pick up our lines and amtel us when voice mails have been 
 left.)  Anyway, with the uncertainty of the future business under the Texas 
 Desk, the following are my goals for the next six months:
 1)  Ensure a smooth transition of HPL to AEP, with minimal upsets to Texas 
 2)  Develop operations processes and controls for the new Texas Desk.   
 3)  Develop a replacement
  a.  Strong push to improve Liz (if she remains with Enron and )
  b.  Hire new person, internally or externally
 4)  Assist in develop a strong logisitcs team.  With the new business, we 
 will need strong performers who know and accept their responsibilites.
 1 and 2 are open-ended.  How I accomplish these goals and what they entail 
 will depend how the Texas Desk (if we have one) is set up and what type of 
 activity the desk will be invovled in, which is unknown to me at this time.  
 I'm sure as we get further into the finalization of the sale, additional and 
 possibly more urgent goals will develop.  So, in short, who knows what I need 
 to do.

This one also seems to fit the bill. “D” here is writing about his/her goals for the next six months and considers briefly how to accomplish them. Not heavy into the content of the business, so I’m happy here. On to topic 2:

sample(which(df.emails.topics$"2" > .95),3)
[1] 50356 22651 19259


I agree it is Matt, and  I believe he has reviewed this tax stuff (or at 
 least other turbine K's) before.  His concern will be us getting some amount 
 of advance notice before title transfer (ie, delivery).  Obviously, he might 
 have some other comments as well.  I'm happy to send him the latest, or maybe 
 he can access the site?
 Given that the present form of GE world hunger seems to be more domestic than 
 international it would appear that Matt Gockerman would be a good choice for 
 the Enron- GE tax discussion.  Do you want to contact him or do you want me 
 to.   I would be interested in listening in on the conversation for 

Here, the conversants seem to be talking about having a phone conversation with “Matt” to get his ideas on a tax discussion. This fits in with the meeting theme. Next!



Well, that was pretty social, wasn’t it? :) Okay one more from the same topic:


  Mime-Version: 1.0
  Content-Transfer-Encoding: 7bit
 X- X- X- X-b X-Folder: \ExMerge - Giron, Darron C.\Sent Items
 X-Origin: GIRON-D
 X-FileName: darron giron 6-26-02.PST
 Sorry.  I've got a UBS meeting all day.  Catch you later.  I was looking forward to the conversation.
 It seems everyone agreed to Ninfa's.  Let's meet at 11:45; let me know if a
 different time is better.  Ninfa's is located in the tunnel under the JP
 Morgan Chase Tower at 600 Travis.  See you there.

Woops, header info that I didn’t manage to filter out :(. Anyway, DG writes about an impending conversation, and Schroeder writes about a specific time for their meeting. This fits! Next topic!

sample(which(df.emails.topics$"3" > .95),3)
[1] 24147 51673 29717


Kaye:  Can you please email the prior report to me?  Thanks.
 Sara Shackleton
 Enron North America Corp.
 1400 Smith Street, EB 3801a
 Houston, Texas  77002
 713-853-5620 (phone)
 713-646-3490 (fax)

 	04/10/2001 05:56 PM
 At Alan's request, please provide to me by e-mail (with a  Thursday of this week your suggested changes to the March 2001 Monthly 
 Report, so that we can issue the April 2001 Monthly Report by the end of this 
 week.  Thanks for your attention to this matter.

This one definitely fits in with the professional sounding administrative emails interpretation. Emailing reports and such. Next!

 I believe this was intended for Susan Scott with ETS...I'm with Nat Gas trading.
 FYI...another executed capacity transaction on EOL for Transwestern.
 This message is to confirm your EOL transaction with Transwestern Pipeline Company.
 You have successfully acquired the package(s) listed below.  If you have questions or
 concerns regarding the transaction(s), please call Michelle Lokay at (713) 345-7932
 prior to placing your nominations for these volumes.
 Product No.:	39096
 Time Stamp:	3/27/01	09:03:47 am
 Product Name:	US PLCapTW Frm CenPool-OasisBlock16
 Shipper Name:  E Prime, Inc.
 Volume:	10,000 Dth/d  
 Rate:	$0.0500 /dth 1-part rate (combined  Res + Com) 100% Load Factor
 		+ applicable fuel and unaccounted for
 TW K#: 27548		
 Points:	RP- (POI# 58649)  Central Pool      10,000 Dth/d
 		DP- (POI# 8516)   Oasis Block 16  10,000 Dth/d
 Alternate Point(s):  NONE
 Note:     	In order to place a nomination with this agreement, you must log 
 	            	off the TW system and then log back on.  This action will update
 	            	the agreement's information on your PC and allow you to place
 		nominations under the agreement number shown above.
 Contact Info:		Michelle Lokay
 	 			Phone (713) 345-7932
               			Fax       (713) 646-8000

Rather long, but even the short part at the beginning falls under the right category for this topic! Okay, let’s look at the final topic:

sample(which(df.emails.topics$"4" > .95),3)
[1] 39100  31681  6427


 Randy, your proposal is fine by me.  Jim

Hrm, this is supposed to be a ‘business content’ topic, so I suppose I can see why this email was classified as such. It doesn’t take long to go from ‘proposal’ to ‘contract’ if you free associate, right? Next!


 Attached is the latest version of the Wildhorse Entrada Letter.  Please 
 review.  I reviewed the letter with Jim Osborne and Ken Krisa yesterday and 
 should get their comments today.  My plan is to Fedex to Midland for Ken's 
 signature tomorrow morning and from there it will got to Wildhorse.  

This one makes me feel a little better, referencing a specific business letter that the emailer probably wants the emailed person to see. Let’s find one more for good luck:


 At a ratio of 10:1, you should have your 4th one signed and have the fifth 
 one on the way...
  	09/19/2000 05:40 PM
 ONLY 450!  Why, I thought you guys hit 450 a long time ago.
 Marie Heard
 Senior Legal Specialist
 Enron Broadband Services
 Phone:  (713) 853-3907
 Fax:  (713) 646-8537

 	09/19/00 05:34 PM
 Well, I do believe this makes 450!  A nice round number if I do say so myself!

 	Susan Bailey
 	09/19/2000 05:30 PM

 We have received an executed Master Agreement:
 Type of Contract:  ISDA Master Agreement (Multicurrency-Cross Border)
 Enron Entity:   Enron North America Corp.
 Counterparty:   Arizona Public Service Company
 Transactions Covered:  Approved for all products with the exception of: 
           Foreign Exchange
           Pulp & Paper
 Special Note:  The Counterparty has three (3) Local Business Days after the 
 receipt of a Confirmation from ENA to accept or dispute the Confirmation.  
 Also, ENA is the Calculation Agent unless it should become a Defaulting 
 Party, in which case the Counterparty shall be the Calculation Agent.
 Susan S. Bailey
 Enron North America Corp.
 1400 Smith Street, Suite 3806A
 Houston, Texas 77002
 Phone: (713) 853-4737
 Fax: (713) 646-3490

That one was very long, but there’s definitely some good business content in it (along with some happy banter about the contract that I guess was acquired).

All in all, I’d say that fixing those regex patterns that were supposed to filter out the caution/privacy messages at the ends of peoples’ emails was a big boon to the LDA analysis here.

Let that be a lesson: half the battle in LDA is in filtering out the noise!