Some Notes from AAAI-16 in Phoenix

This is a summary of what I gleaned from the academic AI research conference AAAI 2016 in Phoenix, AZ.  There is some good stuff here for machine learning users. There are also a few famous AI names here and there.  I will bold words to help you skim.  I did not take notes during all the talks, just the tutorials and invited talks that I found engaging.  I went to standard research talks on game theory, reinforcement learning, NLP, etc., but did not take notes.  I even met and talked with a few people—one about ad tech, one about social diffusion, and a few others briefly.  For the most part, I was a nerd among nerds, a loner among loners.

Tutorial on CP-Nets (Conditional Preference Networks) by Thomas E. Allen, Judy Goldsmith, and Francesca Rossi

Using CP-Nets is a way to model combinations of preferences in a more-tractable way.  Instead of creating a total-order among all possible complete situations or states, they decompose each state into variables and orient those variables as nodes in a directed acyclic graph (DAG, just like a Bayesian network does).  These nodes express which decisions or aspects of the whole state are most important given the "parent variables", i.e. the decisions made already at the more important nodes.  Each child node in the network contains a table of preferences among the values of its variable given all possible values of the parent variables.  This is a qualitative ranking of preferences, not a quantitative relationship.

I also went to lunch with these guys.  Very friendly and inclusive.

Deep Learning by Reza Borhani, Jeremy Watt, and Aggelos K. Katsaggelos

A while back, an important problem was function approximation, and the technique people used was to learn a mixture of fixed basis functions.  There are a number of types of fixed basis functions that are popular.  For frequency (Fourier) type data, you might use a set of sinusoid functions of different frequencies.  For other kinds of data, you might use a set of polynomials of different orders (powers).  There are many others.  A good thing about fixed basis functions is, they have convex error surfaces, which means you don't have to worry about falling into local optima (weight settings that terminate search but which are not globally optimal).  And yet they have other problems, and it turns out that the local optima problem with non-convex error surface of neural nets is not really that big of a problem.  The problem with fixed basis functions is, they are not very efficient.  You need to combine a lot of different basis functions from your class to approximate an arbitrary function very well.

A neural network is a set of feature basis functions for the purpose of function approximation.  They overcome some limitations of using fixed basis functions.  In a sense, neural networks "scale" better than fixed basis functions because they require a smaller number of features or basis functions than the fixed approach.  tanh is a good basis function for neural nets because it is adjustable (a lot like the logistic function which is what I learned in school).  That kind of adjustable basis function has been popular for a while, but recent improvements have given us a number of new basis functions that actually perform better than tanh and logistic (sigmoidal).

Training data quality is a spectrum from ideal to bad.  Ideal training data is characterized by two things: (1) it is not noisy and (2) it is evenly distributed over the input feature space (and he probably implied also balanced in output space too) which means no empty areas.  If your training data is close to "ideal", then it does not matter which basis function you use.  If not, then select your basis function wisely.

Gradient descent is equivalent to the back-propagation algorithm in this context.  Traditionally, you take one big step in weight-space after a full scan of all the training data.  Stochastic gradient descent is more incremental (less batch-oriented) in that it takes small steps, one for each datum in the training data; and it works better than batch updates.

A theoretical note with practical application: the target function must fall within the "span" (or space, think linear combinations of basis vectors in linear algebra) of the features or basis functions.

Neural networks are popular now because we have more data and Moore's law.  The recent big advancements in neural network research were inspired by biological neural networks, e.g. human brain and perceptual systems.  Convolutional neural networks use more biology knowledge in terms of layers of dimensionality reduction, function composition (or decomposition), and structure.

Other recent advances in neural nets include the following: (1) new choices for activation functions beyond tanh and sigmoidal, e.g. something he called "max" converges much faster requiring fewer iterations of learning, (2) better ways to initialize weights, (3) theory advances showing that non-convex error surfaces are not so bad, (4) new optimization processes, (5) new regularization strategies.

For those of you who are doing deep learning with normal neural networks containing a single layer of hidden units, here is a definition for you.  A Deep Learning model is defined to have more than two or three hidden layers in a neural network.

Diffusion in Social Networks by Paulo Shakarian

This lecture was a survey of the field of research that models how ideas spread in social networks, with applications to marketing and "going viral".  On a high level, there are two approaches: One, you could construct a detailed graph (network) model of the social network and run simulations to see what happens.  There are many competing theories and none of them look detailed enough to work very well.  This approach is better for explaining what might happen in the network than actually predicting diffusion.  Two, you could use machine learning to more directly learn to predict how far ideas might spread based on a variety of feature of the phenomenon, including the early stages of propagation.  This approach does better than the first approach at prediction, but neither approach works as well as you might hope.

Here are a few sketchy details.

Questions to answer:  How do thoughts spread in a social network?  How many infectees after T time?  Influence Maximization Problem.  Profit Maximization.  Takes theories from various fields like mathematical sociology.  Perform simulations.  The first iteration of the simulation is most sensitive to out-degree and probability (beta) of transmission.

Models: SIR and SIS models, tipping model and target set selection, Kempe-Kleinberg-Tardos (KKT) framework, logic programming based diffusion, evolutionary graph theory (EGT).

SIR model: Susceptible means able to be infected.  Infected means able to infect.  Removed means not able to infect or be infected.

Tipping Model: Polynomial-time infection-count prediction.  Non-submodularity.  Algorithm for finding a seed set that can spread infection to the whole graph in minutes with millions of nodes and edges.  Linear Threshold: uses probabilistic thresholds.

Kempe-Kleinberg-Tardos (KKT): Independent Cascade (IC) Model allows different probability for each connection. Transmission is from sum of weights. Live-Edge (Live-Arc) Model.  Generalized Threshold Model.  Logic-based Diffusion: Annotated rules for diffusion. Write an "annotated program" in logic, based on attributes of nodes and relations, (including probabilities? Maybe not).  Like deterministic tipping model.  Computing fixed point (maximum spread), which is polynomial in time and space.  Multiple labels on nodes and relations.

Evoluationary Graph Theory: Mutants taking over.  Moran Process.  Goal to compute fixation or extinction probability.  Isothermal graph.

Data-driven approaches: Learn model parameters from past diffusion processes.  Regression model to predict spread from features of past diffusion processes.  EM process hard to scale but accurate for early stage diffusion (non-viral).  Tends to over-fit using small influence probabilities (10^-7).  Temporal effects: exponential decay of influence is slow.  Discrete time model is a good compromise.  Dynamic models are better, Bernoulli model better than Jaccard.  Discrete time comparable to continuous, which is nice.

Credit Distribution Model: Fast runtime for Influence Maximization.

Power-law of cascades, large cascades are rare.  Features that are predictive: seed user's number of followers, number of friends, past local influence; content's tweet length, hashtags, divergence of sentiment; early diffusion's number of out degrees in subgraph, etc.  Regression trees, 0.98 R^2, but poor final result.  Logistic regression with balanced data got good results.  Original poster and content becomes less important as initial cascade progresses.  Number of influencers (incoming edges) has less effect above 3. Number of influencing communities has a greater impact.  Structural diversity is a good feature.

His startup: CrossViral

I also had a good chat with Dr. Paulo a few days later.

Inference in Structured Prediction by Dan Roth, Kai-Wei Chang, Gourab Kundu, and Vivek Srikumar

This was very related to my research area in school, so a lot of this was review for me.  Dan Roth was one of the few names I recognized at the conference.  I have to agree that structured prediction problems are among the most important (and still very general) prediction problems.  They include NLP, information extraction, and image object recognition problems.

Integer Linear Programming (ILP) is cool and very useful and popular these days in ML and AI.

Reading the Web

This talk was very interesting and also related to many of the papers I read in preparation for my PhD dissertation, so a lot of this talk was a review for me.  He talked about bootstrapping (the unsupervised and highly scalable process of learning extraction rules from extracted knowledge and vice-versa) and the strategy of using multiple constraints or sources of evidence to prevent semantic drift.  Tom Mitchell's Never Ending Language Learning (NELL) project and Orin Etzioni's Web-reading work are the more famous projects in this area.

I saw a slide with Justin Betteridge on it, who I already knew was working in the NELL group at CMU.  I new Justin during my masters degree at BYU.  Very nice guy.

How AI influences Labor Market (Panel Discussion)

Random bits of pessimism and optimism and ego-boosting comparisons among panelists (mostly Oren Etzioni) interspersed with a few random but interesting line plots and quotes.

Google DeepMind (Invited Talk) by Demis Hassabis

DeepMind was sold to Google for $500M.  Demis Hassabis and his company have a very impressive background.  He has been planning DeepMind for 20 years, and intentionally worked in the best of academia and startups as preparation.  Their goal is to "solve intelligence", by which catchy phrase I believe he means to "solve the impossibly hard problems necessary for creating a truly general-purpose artificial intelligence."  Strategies they adopted early on were (1) learn from raw inputs like the pixels on a screen of a video game, (2) be general not brittle and narrow.  As an counter-example, he mentioned that DeepBlue was narrow though still impressive because it would have to be completely retrained to play anything other than chess.

They are known for deep learning, but they are by no means only about deep learning.  They prefer thinking in terms of reinforcement learning.  This involves learning statistical model of world, then learning actions.  They believe that solving reinforcement learning would solve general learning and intelligence.  They believe in "grounded cognition".  Robitics hardware is expensive, slow, and it breaks.  Therefore they use simulators which can make as much training data as you want, and they ground themselves in pixels of video game images.  They prevent testing bias, but I forget how.  They can run tests on thousands and millions of agents at once.

They use Deep Reinforcement Learning, which to them means getting reinforcement learning to work at scale.  Their early task was to learn how to play 8-bit games from 80's, Atari 2600.  DQN.  Code released and discussed in articles in Nature.  General purpose game learner.

They draw inspiration from brain and neuroscience.  Their system is a Neural Turing Machine: ANNs with memory.  It is a full Turing Machine. AlphaGo is their new project.  It is based on pattern-recognition, planning, tree search, Monte Carlo Rollouts.  aGo was the first program to beat a professional Go player on a full sized board.  They have hundre Alphds of papers on their website.  Next month, in March, they will live stream on YouTube AlphaGo playing a big pro, Lee Sedol. This will be a $1M 5-game challenge match.  In preparation, they are retraining everything.

Difference between AlphaGo and DeepBlue: AlphaGo is a general-purpose, modular system combining deep learning based pattern recognition with planning and search.  They will later remove supervised learning; training time will be longer.

AlphaGo is more human-like.  Human players say AlphaGo plays like a human; so they say it passes a "Turing test" for gamers.

Montezuma's Revenge is hard for AlphaGo.

They used 100 GPUs before being bought by Google while working on Atari games.  This was a response to a question about how to do this kind of research on a small budget.  He says it's a good domain for small budgets.

Extra Bits

I had fun watching the ASU robots do their thing in the foyer between research paper sessions.

One of the posters in the poster session was about learning to automatically generate posters from scientific papers.  I asked her the obvious question.  No, her poster was not automatically generated in toto.  (One of the figures was.)  But she did get some good attention and chuckles when she introduced her topic during her spotlight talk.

Here is a web app (and a university research project) helping people making "fair" decisions involving finite resources:

It was interesting to eavesdrop on Peter Norvig and someone else I did not recognize who were talking about "academic genealogy" and their "shared ancestors".  It seams that the better your academic "pedigree" is, the more you worry about academic pedigrees.


Jay Miller said...

I love this sort of stuff (is "stuff" explanatory enough?) I am amazed at once how far we have come using predictive models which are based on probability but achieving a high degree of accuracy based on those models. Since research takes money - generally at a large scale anyway - I think you see a conglomeration of talent working on such problems making bank on Wall Street. But it is never perfect, past performance is always the basis for statistical decision making, perfectly tweaked as it may be through series of algorithms. Computers beat humans hands down at almost everything, except spark, imagination, inspiration, and sometimes intuition (we are generally pretty bad at that one). I think we always hold that edge, but the possibilities for the future are astounding. I could go on, especially at how far we lag in using some of the technology for improving lives and businesses, egos can be insurmountable obstacles.

Keep it coming.

Thomas Packer said...

Thanks, Jay.

Now I need to figure out how to get my blog to tell me when there are comments so it doesn't take me four months to see them. :-)