Some Notes from AAAI-16 in Phoenix

This is a summary of what I gleaned from the academic AI research conference AAAI 2016 in Phoenix, AZ.  There is some good stuff here for machine learning users. There are also a few famous AI names here and there.  I will bold words to help you skim.  I did not take notes during all the talks, just the tutorials and invited talks that I found engaging.  I went to standard research talks on game theory, reinforcement learning, NLP, etc., but did not take notes.  I even met and talked with a few people—one about ad tech, one about social diffusion, and a few others briefly.  For the most part, I was a nerd among nerds, a loner among loners.

Tutorial on CP-Nets (Conditional Preference Networks) by Thomas E. Allen, Judy Goldsmith, and Francesca Rossi

Using CP-Nets is a way to model combinations of preferences in a more-tractable way.  Instead of creating a total-order among all possible complete situations or states, they decompose each state into variables and orient those variables as nodes in a directed acyclic graph (DAG, just like a Bayesian network does).  These nodes express which decisions or aspects of the whole state are most important given the "parent variables", i.e. the decisions made already at the more important nodes.  Each child node in the network contains a table of preferences among the values of its variable given all possible values of the parent variables.  This is a qualitative ranking of preferences, not a quantitative relationship.

I also went to lunch with these guys.  Very friendly and inclusive.

Deep Learning by Reza Borhani, Jeremy Watt, and Aggelos K. Katsaggelos

A while back, an important problem was function approximation, and the technique people used was to learn a mixture of fixed basis functions.  There are a number of types of fixed basis functions that are popular.  For frequency (Fourier) type data, you might use a set of sinusoid functions of different frequencies.  For other kinds of data, you might use a set of polynomials of different orders (powers).  There are many others.  A good thing about fixed basis functions is, they have convex error surfaces, which means you don't have to worry about falling into local optima (weight settings that terminate search but which are not globally optimal).  And yet they have other problems, and it turns out that the local optima problem with non-convex error surface of neural nets is not really that big of a problem.  The problem with fixed basis functions is, they are not very efficient.  You need to combine a lot of different basis functions from your class to approximate an arbitrary function very well.

A neural network is a set of feature basis functions for the purpose of function approximation.  They overcome some limitations of using fixed basis functions.  In a sense, neural networks "scale" better than fixed basis functions because they require a smaller number of features or basis functions than the fixed approach.  tanh is a good basis function for neural nets because it is adjustable (a lot like the logistic function which is what I learned in school).  That kind of adjustable basis function has been popular for a while, but recent improvements have given us a number of new basis functions that actually perform better than tanh and logistic (sigmoidal).

Training data quality is a spectrum from ideal to bad.  Ideal training data is characterized by two things: (1) it is not noisy and (2) it is evenly distributed over the input feature space (and he probably implied also balanced in output space too) which means no empty areas.  If your training data is close to "ideal", then it does not matter which basis function you use.  If not, then select your basis function wisely.

Gradient descent is equivalent to the back-propagation algorithm in this context.  Traditionally, you take one big step in weight-space after a full scan of all the training data.  Stochastic gradient descent is more incremental (less batch-oriented) in that it takes small steps, one for each datum in the training data; and it works better than batch updates.

A theoretical note with practical application: the target function must fall within the "span" (or space, think linear combinations of basis vectors in linear algebra) of the features or basis functions.

Neural networks are popular now because we have more data and Moore's law.  The recent big advancements in neural network research were inspired by biological neural networks, e.g. human brain and perceptual systems.  Convolutional neural networks use more biology knowledge in terms of layers of dimensionality reduction, function composition (or decomposition), and structure.

Other recent advances in neural nets include the following: (1) new choices for activation functions beyond tanh and sigmoidal, e.g. something he called "max" converges much faster requiring fewer iterations of learning, (2) better ways to initialize weights, (3) theory advances showing that non-convex error surfaces are not so bad, (4) new optimization processes, (5) new regularization strategies.

For those of you who are doing deep learning with normal neural networks containing a single layer of hidden units, here is a definition for you.  A Deep Learning model is defined to have more than two or three hidden layers in a neural network.

Diffusion in Social Networks by Paulo Shakarian

This lecture was a survey of the field of research that models how ideas spread in social networks, with applications to marketing and "going viral".  On a high level, there are two approaches: One, you could construct a detailed graph (network) model of the social network and run simulations to see what happens.  There are many competing theories and none of them look detailed enough to work very well.  This approach is better for explaining what might happen in the network than actually predicting diffusion.  Two, you could use machine learning to more directly learn to predict how far ideas might spread based on a variety of feature of the phenomenon, including the early stages of propagation.  This approach does better than the first approach at prediction, but neither approach works as well as you might hope.

Here are a few sketchy details.

Questions to answer:  How do thoughts spread in a social network?  How many infectees after T time?  Influence Maximization Problem.  Profit Maximization.  Takes theories from various fields like mathematical sociology.  Perform simulations.  The first iteration of the simulation is most sensitive to out-degree and probability (beta) of transmission.

Models: SIR and SIS models, tipping model and target set selection, Kempe-Kleinberg-Tardos (KKT) framework, logic programming based diffusion, evolutionary graph theory (EGT).

SIR model: Susceptible means able to be infected.  Infected means able to infect.  Removed means not able to infect or be infected.

Tipping Model: Polynomial-time infection-count prediction.  Non-submodularity.  Algorithm for finding a seed set that can spread infection to the whole graph in minutes with millions of nodes and edges.  Linear Threshold: uses probabilistic thresholds.

Kempe-Kleinberg-Tardos (KKT): Independent Cascade (IC) Model allows different probability for each connection. Transmission is from sum of weights. Live-Edge (Live-Arc) Model.  Generalized Threshold Model.  Logic-based Diffusion: Annotated rules for diffusion. Write an "annotated program" in logic, based on attributes of nodes and relations, (including probabilities? Maybe not).  Like deterministic tipping model.  Computing fixed point (maximum spread), which is polynomial in time and space.  Multiple labels on nodes and relations.

Evoluationary Graph Theory: Mutants taking over.  Moran Process.  Goal to compute fixation or extinction probability.  Isothermal graph.

Data-driven approaches: Learn model parameters from past diffusion processes.  Regression model to predict spread from features of past diffusion processes.  EM process hard to scale but accurate for early stage diffusion (non-viral).  Tends to over-fit using small influence probabilities (10^-7).  Temporal effects: exponential decay of influence is slow.  Discrete time model is a good compromise.  Dynamic models are better, Bernoulli model better than Jaccard.  Discrete time comparable to continuous, which is nice.

Credit Distribution Model: Fast runtime for Influence Maximization.

Power-law of cascades, large cascades are rare.  Features that are predictive: seed user's number of followers, number of friends, past local influence; content's tweet length, hashtags, divergence of sentiment; early diffusion's number of out degrees in subgraph, etc.  Regression trees, 0.98 R^2, but poor final result.  Logistic regression with balanced data got good results.  Original poster and content becomes less important as initial cascade progresses.  Number of influencers (incoming edges) has less effect above 3. Number of influencing communities has a greater impact.  Structural diversity is a good feature.

His startup: CrossViral

I also had a good chat with Dr. Paulo a few days later.

Inference in Structured Prediction by Dan Roth, Kai-Wei Chang, Gourab Kundu, and Vivek Srikumar

This was very related to my research area in school, so a lot of this was review for me.  Dan Roth was one of the few names I recognized at the conference.  I have to agree that structured prediction problems are among the most important (and still very general) prediction problems.  They include NLP, information extraction, and image object recognition problems.

Integer Linear Programming (ILP) is cool and very useful and popular these days in ML and AI.

Reading the Web

This talk was very interesting and also related to many of the papers I read in preparation for my PhD dissertation, so a lot of this talk was a review for me.  He talked about bootstrapping (the unsupervised and highly scalable process of learning extraction rules from extracted knowledge and vice-versa) and the strategy of using multiple constraints or sources of evidence to prevent semantic drift.  Tom Mitchell's Never Ending Language Learning (NELL) project and Orin Etzioni's Web-reading work are the more famous projects in this area.

I saw a slide with Justin Betteridge on it, who I already knew was working in the NELL group at CMU.  I new Justin during my masters degree at BYU.  Very nice guy.

How AI influences Labor Market (Panel Discussion)

Random bits of pessimism and optimism and ego-boosting comparisons among panelists (mostly Oren Etzioni) interspersed with a few random but interesting line plots and quotes.

Google DeepMind (Invited Talk) by Demis Hassabis

DeepMind was sold to Google for $500M.  Demis Hassabis and his company have a very impressive background.  He has been planning DeepMind for 20 years, and intentionally worked in the best of academia and startups as preparation.  Their goal is to "solve intelligence", by which catchy phrase I believe he means to "solve the impossibly hard problems necessary for creating a truly general-purpose artificial intelligence."  Strategies they adopted early on were (1) learn from raw inputs like the pixels on a screen of a video game, (2) be general not brittle and narrow.  As an counter-example, he mentioned that DeepBlue was narrow though still impressive because it would have to be completely retrained to play anything other than chess.

They are known for deep learning, but they are by no means only about deep learning.  They prefer thinking in terms of reinforcement learning.  This involves learning statistical model of world, then learning actions.  They believe that solving reinforcement learning would solve general learning and intelligence.  They believe in "grounded cognition".  Robitics hardware is expensive, slow, and it breaks.  Therefore they use simulators which can make as much training data as you want, and they ground themselves in pixels of video game images.  They prevent testing bias, but I forget how.  They can run tests on thousands and millions of agents at once.

They use Deep Reinforcement Learning, which to them means getting reinforcement learning to work at scale.  Their early task was to learn how to play 8-bit games from 80's, Atari 2600.  DQN.  Code released and discussed in articles in Nature.  General purpose game learner.

They draw inspiration from brain and neuroscience.  Their system is a Neural Turing Machine: ANNs with memory.  It is a full Turing Machine. AlphaGo is their new project.  It is based on pattern-recognition, planning, tree search, Monte Carlo Rollouts.  aGo was the first program to beat a professional Go player on a full sized board.  They have hundre Alphds of papers on their website.  Next month, in March, they will live stream on YouTube AlphaGo playing a big pro, Lee Sedol. This will be a $1M 5-game challenge match.  In preparation, they are retraining everything.

Difference between AlphaGo and DeepBlue: AlphaGo is a general-purpose, modular system combining deep learning based pattern recognition with planning and search.  They will later remove supervised learning; training time will be longer.

AlphaGo is more human-like.  Human players say AlphaGo plays like a human; so they say it passes a "Turing test" for gamers.

Montezuma's Revenge is hard for AlphaGo.

They used 100 GPUs before being bought by Google while working on Atari games.  This was a response to a question about how to do this kind of research on a small budget.  He says it's a good domain for small budgets.

Extra Bits

I had fun watching the ASU robots do their thing in the foyer between research paper sessions.

One of the posters in the poster session was about learning to automatically generate posters from scientific papers.  I asked her the obvious question.  No, her poster was not automatically generated in toto.  (One of the figures was.)  But she did get some good attention and chuckles when she introduced her topic during her spotlight talk.

Here is a web app (and a university research project) helping people making "fair" decisions involving finite resources:

It was interesting to eavesdrop on Peter Norvig and someone else I did not recognize who were talking about "academic genealogy" and their "shared ancestors".  It seams that the better your academic "pedigree" is, the more you worry about academic pedigrees.


Various Artworks

Here I collect a few random pieces of my visual artistry.

This drawing tells me what I want to do and be in the future.  In particular, it tells me two things: I wish to draw pictures with words as a writer, and I wish to fade into the background as the words of this parable is the real message.  This drawing was done in pen-and-ink on paper by hand before being scanned into computer and color added.

My first attempt in many, many years to combine computer science with art.  I call this a cuneiform binary fractal tree to 19 levels of recursion.  If you have Googled for fractal tree, you will have seen similar trees.  What makes mine unique, I think, is that it is entirely algorithmically generated and yet it has a semi-natural shape by my adding a small amount of uniform-random noise to the branching angle and also choosing a random order for growing the left and right child branches at each branch point.  I think it strikes a balance between too regular and too irregular.  I also like the fact that it is composed entirely of such a simple shape: triangles.  Oh, and I also prevent overlap among all branches in the tree by keeping track of points on lines and pruning any branch before it crosses an existing one.  You must understand, my intention is to refine the two-dimensional appearance, preventing most or all appearance of depth.  I could do more.  I plan to leave more space between branches to prevent the crowding-look.  I could also add a little noise to the branch length.  Needless to say, there are more trees to come.


Drawing Megan

This is the story of how I drew Megan.  Sorry that most of the photographs were taken by my extremely cheap phone camera.

There once was a beautiful young woman who got married to a nerd.  Before the wedding, there were pictures taken of the beautiful young woman.  The nerd has ever since been especially fond of one picture in particular.

So I decided to use this picture as the basis for a large pen-and-ink drawing.  Here is the beginning of the initial pencil drawing of the main elements.

Here is the completed pencil drawing of the main elements (darkened so you can see the pencil lines better).

I had originally planned to draw a plain cross-hatched grid for the background, like you can see in traditional engraved portraits.  Then I realized that I should fully execute my plan for practicing techniques to put in future illustrations by adding something more fractal-like.  What I needed was a fractal-like curve that had similar properties to a grid: it filled the space evenly with a uniform line thickness.  It did not take me long to discover the Hilbert Space-filling Curve, or Hilbert Curve for short.  What did take me a while to find was an existing image of it pre-rendered at a sufficient level of detail to cover my entire picture.  But then I had to draw it without being able to trace it.  That required more time than it took to find it.

Next is the Hilbert space-filling curve print-out that I traced with pencil on the print-out, itself, as I copied it line by line to my large drawing's background on a pencil-grid that I had already drawn as a guide.  I drew a few tiny line segments at a time.  At one point I discovered that I had drawn a section of the curve (maybe a few hundred line segments big) in the wrong place—it was shifted by one unit of the grid.  I had to erase that part of the background and redraw it.  Thankfully I had first drawn the curve in pencil before inking it.  I remember doing much of the inking of the Hilbert curve while I "watched" the extended Hobbit movies with my family.  Actually, I also associate this pen-and-in portrait with a number of fascinating audio-books that I listened to while drawing it, including biographies of C.S. Lewis and J.R.R. Tolkien, The Wood Beyond the World by William Morris, Phantastes by George MacDonald, a lecture-summary of the book "Goedel, Escher, Bach", a long series of lectures about life in the Middle Ages, the extended Lord of the Rings movies, and several others.  One can listen to a lot of good material in about 150 hours.

Here is the final drawing sitting on my drawing table.  You can see my lap top plugged in to my big Altec Lansing speakers in the background ready to play me another audio-book.

Here is the final pen-and-ink drawing professionally digitally scanned at Replicolor in Salt Lake City, Utah.

Here I have digitally made some minor corrections to the black and white line drawing and have added color.  I consider this the final version.  This is the same version I have listed in my American Frame Art Galary.  The coloring is subtle—on purpose.  On my laptop, I can barely see the colors, while on my desktop they are easier to see.

After finishing the color version, I tried out some of the filters that come with the free image editing tool called Gimp.  This image shows the mosaic distortion filter.  I love the blending of the colors in tiles around the top of the rose, including the brown of Megan's hair, the pink of her lips and the rose, the subtle blue of the background, and the green of the leaves.

This image shows double-thresholding—that is, mapping light and dark greys to black and middle-tone greys to white.

That about wraps it up. 


What Does the Blog Say?

Hello blog-lovers, who also happen to be parody-lovers, and who also happen to be "The Fox" lovers, who also happen to love commentary on digital and social media.  Yep, I knew that included all of you.  I proudly present a refrain that woke me up in the middle of the night, last night, as I lay in bed, unconsciously merging two things I had been "listening to" the evening before. 

"The Blog"

Pandora goes woof, YouTube goes meow.
Twitter goes tweet, and FaceBook goes like.
Flickr goes snap, MySpace goes croak, and LinkedIn goes toot.
WebMD says quack and eBay goes bling, and CNN goes OUCH OUCH OUCH.
But there's one sound that no one knows...



I'll finish this later.   

If anyone would like to collaborate to finish this, perform it, record it, let's talk.  


Three Stages of Leadership

Hello leaders and followers

A quick thought about levels of leadership.  There are three stages or levels of leadership because there three stages of followership.  There are three stages of followership because there are three stages of developing independence. 
  1. The employee needs to know what to do, so you tell them what to do.
  2. The employee needs more independence, so instead of telling them what to do in each situation, you give them a policy, and that policy tells them what to do in each situation.
  3. The employee needs even more independence and can create his own policy, so instead of giving him a ready-made policy, you give him values, goals, and resources, authority and responsibility. They can come up with their own policy or strategy. 
This idea must have come from a combination of three things: (1) being a parent and parenting, (2) reading popular business/management books, and (3) thinking about AI and designing intelligent agents. 

More later


Web Page Layout Analysis

Hello web and printed document analysis people

Some of you may find this paper interesting. They advocate using layout analysis to improve information extraction from web pages and in preliminary results find that rendering a web page and treating it like a printed document provides better clues than using the HTML DOM tree.

Online Medical Journal Article Layout Analysis

There are other people who believe the same thing:

Towards Domain Independent Information Extraction from Web Tables

Extracting Semantic Structure of Web Documents Using Content and Visual Information

This makes sense as web page writers are primarily concerned with how the browser renders their page, and how their audience reads the rendered page, not with how the HTML code looks.

Within a domain, visual layout can be used in the same way as domain knowledge (ontologies), in that both sources of information tend to be more invariant across documents within a domain than the particular HTML tags used.

Reasons given in the paper for not relying on the DOM tree:

1. The DOM nodes may not be in a semantically meaningful order.
2. Simple text lines can be broken into several nodes at different levels of a DOM tree.
3. Visually similar pages can have completely different DOM trees.

Of course, an approach that combines DOM tree information, domain knowledge and visual layout is probably best.  This gives me some assurance that I’m not wasting my time focusing my research on printed documents in which there are no HTML tags. It is also applicable to web extraction.