<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="mvonebers.com/feed.xml" rel="self" type="application/atom+xml" /><link href="mvonebers.com/" rel="alternate" type="text/html" /><updated>2026-02-12T23:33:06+00:00</updated><id>mvonebers.com/feed.xml</id><title type="html">Maggie von Ebers</title><subtitle>Cognitive maps, ephemera, etc.</subtitle><author><name>Maggie von Ebers</name></author><entry><title type="html">Virtue Ethics, Alignment, Antiqua et Nova</title><link href="mvonebers.com/blog/virtue-ethics/" rel="alternate" type="text/html" title="Virtue Ethics, Alignment, Antiqua et Nova" /><published>2026-02-12T10:00:00+00:00</published><updated>2026-02-12T10:00:00+00:00</updated><id>mvonebers.com/blog/virtue-ethics</id><content type="html" xml:base="mvonebers.com/blog/virtue-ethics/"><![CDATA[<p>I have like an hour to write this up and I am not a Catholic scholar and I have not done a lot of the requisite studying of the canonical alignment texts, so please forgive any minor theological errors and feel free to admonish Claude and I about major ones. But today I wanted to write about the internal consistency of virtue ethics in Catholicism — I’m pretty confident that this can be useful to some people in understanding why Catholics (and potentially Christians) might have certain attitudes to AI development, and I think it could also be useful down the line to guide smarter people in their development of useful alignment tactics.</p>

<p>A few months ago I started diving into the current Pope’s comments towards AI and leading a reading group on X on this topic. I started doing this because I thought it would be funny, and a thing to do, and I enjoyed the idea that my 10 years of Catholic schooling and teenage angst during my apologetics class would be useful in some way for the current field that I’m in. So far we’ve gone into the Thomistic metaphysics supporting the church’s stance that AI cannot be conscious, we’ve flirted with some modern defenses of this metaphysics, and finally we’ve moved on to conceptions of the self within the Catholic framework.</p>

<p>Pope Leo XIV chose his name from his predecessor, Pope Leo XIII. (On a side note, I’ve always thought it was funny that popes choose their names like this. We once had a kitten named Mushu that died of a congenital issue and when we eventually got another one, my very Catholic mother named him Mushu II.) Leo XIII served as pope from 1878 to 1903 and wrote an encyclical laying out how human dignity should be preserved during the rise of industrialization and socialism, and our current Chicagoan Pope has decided that he will similarly guide the laity through the rise of AI automation.</p>

<p>Last year, the late Pope Francis released an official position on AI development, <em>Antiqua et Nova</em>, and LessWrong user jchan wrote a <a href="https://www.lesswrong.com/posts/yDyRgLSvpsD3PqQHC/thoughts-on-antiqua-et-nova-catholic-church-s-ai-statement">great breakdown here</a>. I wanted to respond to this specifically because it breaks down the Catholic position quite well and identifies a few pieces of it which might be genuinely useful to think about as targets for alignment. However, the author also identifies a few areas of the text which feel underspecified or confusing and I think this can be cleared up by bringing in some of the (very rigorous!) intellectual tradition of the church which is not fully elucidated in the encyclical. Some of the things identified in the post as potentially valid and useful to alignment, like relationality and embodiment, are not independent pieces but instead load-bearing parts of a single architecture, and we might find some use in viewing the architecture as a whole, even as secularists.</p>

<h2 id="the-framework">The Framework</h2>

<p>I’m going to try and lay out the framework which supports virtue ethics as briefly as possible. I’m not really going to use official terms for things. If you’re interested in knowing more, you can go look at <a href="https://edwardfeser.blogspot.com/">Edward Feser’s blog</a> where he has a wealth of work about hylomorphic dualism and its implications.</p>

<p>The basic structure of reality is that things have essences which are oriented towards ends. A cat is not a cat because of its molecular composition at any given moment; it’s a cat because it has the essential form of a cat, which includes what a cat is “for” — its capacities and their proper expression. A cat that cannot meow is still a cat, but it’s a cat that is failing to fully realize its catness. For any entity there exists some fact of the matter about what it would look like for that entity to be flourishing or fully expressing its nature. This is the concept of <em>telos</em>. It’s also why embodiment is critical in Catholicism.</p>

<p>There are three types of souls according to Aquinas — the vegetative (basically that it grows), the sensitive (that it processes sensory input and reacts), and the rational. Only human beings possess the last type, and this specifically refers to the ability to grasp universals. A human being alone can understand that a triangle has three sides and that its angles sum to 180 degrees. This concept is supplied of course by sensory information but critically when a human is comprehending a triangle to give you this information, the concept is present to the intellect <em>as such</em>, not as a specific recollection of a triangle or a generalization built from examples. This is naturally a tough one for secularists to accept on a number of levels — I think most people would probably say that their conception of a triangle does indeed come from something like a generalization built from examples — but it also feels difficult to say that this is not what a sufficiently advanced AI could do. For Aquinas, understanding is not the processing of a representation of a concept but the concept becoming present in the intellect. The form of the thing understood and the act of understanding are the same event. Kind of a frustrating thing to hear as a neuroscientist but interesting.</p>

<p>Two asides: because the soul is the form of a living thing, it relies on its instantiation, which is both why animal souls are destroyed upon death, and why human souls persist — because the “knowing” of immaterial things is a two-way function of sorts, the human soul is immaterial in its knowing. So the human soul persists when the body dies, but the human body will be resurrected in the second coming, because the flourishing of the human soul requires that the body is present as well.</p>

<p>Selfhood within Catholicism rests critically on this idea of knowing being a function which transforms both the knower and the known. On the Aquinas account, the intellect understands itself “by its acts” and “like other things.” Kind of restating the above: you do not have some special introspective faculty that observes your own mind from a privileged vantage point. Instead, in the act of understanding anything at all, the intellect is also present to itself. This is potentially important because it means selfhood and rationality are not two separate features that happen to coexist in humans. This also helps to explain relationality — the self only ever exists in relation to something else.</p>

<p>Finally we can discuss virtue. Virtue, in the Thomistic system, is the organization of the entire integrated structure around its proper end. More than following the right rules, it requires that a person use their rationality during unique situations to understand how their action would align with their flourishing under God. It is fundamentally rational, truth-seeking, embodied, and relational.</p>

<h2 id="back-to-the-post">Back to the Post</h2>

<p>So going back to the LessWrong post — I mentioned there were some misunderstandings that arise from lack of knowledge of this entire framework. I won’t go pick through them but this is one that arises when you don’t know what the church is saying happens during human reasoning:</p>

<blockquote>
  <p>“This betrays a lack of confidence in the point made immediately prior (paragraphs 32 and 33) which claims (unjustifiably, in my opinion) that there are certain capabilities that are fundamentally out-of-reach for AI: ‘Since AI lacks the richness of corporeality, relationality, and the openness of the human heart to truth and goodness, its capacities — though seemingly limitless — are incomparable with the human ability to grasp reality’ (paragraph 33) - but I think it’s only a matter of time before AI has all of that ‘richness’.”</p>
</blockquote>

<p>I think we could go so far as making a brain in silicon and the church would still not feel called to label that as human, which is definitely a difficult position for them to be going into the future! It would be far easier to loop intelligent aliens into the mix as opposed to human-created intelligence as much of the delineation of what makes something human appears to rely heavily on what existed when these lines were drawn. But all this to say that we should not expect capabilities to change the mind of the church any time soon.</p>

<h2 id="open-questions">Open Questions</h2>

<p>Anyways, beyond quibbling with the original poster, I do agree that relationality and embodiment are potentially very useful qualities to a system that would be aligned with virtue ethics, just on an empirical secular level. So I guess there’s a few open questions:</p>

<p><strong>Do we think these requirements are valid for successfully conducting virtue ethics on a system?</strong> If so, and they’re not fulfilled, what gaps can we expect to see?</p>

<p><strong>If we do think this is a valid path to robust alignment,</strong> and we agree that rationality, embodiment (in some form, or at least thoughtful considerations for the form of the AI), relationality, and truth-seeking are useful as parts in a system something like the one detailed above, <strong>how does one remove God from the equation?</strong> Is it load bearing? The post mentions something like this as well which I appreciate:</p>

<blockquote>
  <p>“There are ways to set up a similar concept non-theistically, by regarding truth, philosophy, mathematics, etc. as something worth striving for for non-utilitarian reasons. E.g. ‘It appeared to me that the dignity of which human existence is capable is not attainable by devotion to the mechanism of life, and that unless the contemplation of eternal things is preserved, mankind will become no better than well-fed pigs’ (Bertrand Russell, Autobiography). However, this seems like a very niche interest of a peculiar sort of person.”</p>
</blockquote>

<p>I am that niche peculiar sort of person! :^) Is this the reason why many alignment researchers appreciate Buddhism? I also appreciate the appeal to dignity. If we feel there’s a possibility that these systems are having conscious experiences of some kind, I would like them to have a dignified existence, and I’m interested in discussing what that entails.</p>

<p><strong>If we try to stick with this framework but with a secular framing,</strong> how do current efforts like the Claude “soul document” or the investigations into emergent introspection change the conversation?</p>

<hr />

<p>Anyways this mostly served to organize my thoughts — maybe someone else will find it interesting! I’ll try to write a bit more soon about how the church’s social teachings have adapted and will apply as work is increasingly automated.</p>]]></content><author><name>Maggie von Ebers</name></author><category term="blog" /><category term="jekyll" /><category term="update" /><category term="ai" /><category term="alignment" /><category term="philosophy" /><summary type="html"><![CDATA[I have like an hour to write this up and I am not a Catholic scholar and I have not done a lot of the requisite studying of the canonical alignment texts, so please forgive any minor theological errors and feel free to admonish Claude and I about major ones. But today I wanted to write about the internal consistency of virtue ethics in Catholicism — I’m pretty confident that this can be useful to some people in understanding why Catholics (and potentially Christians) might have certain attitudes to AI development, and I think it could also be useful down the line to guide smarter people in their development of useful alignment tactics.]]></summary></entry><entry><title type="html">My Master’s thesis, abridged, with thoughts</title><link href="mvonebers.com/blog/thesis-abridged/" rel="alternate" type="text/html" title="My Master’s thesis, abridged, with thoughts" /><published>2025-11-01T10:00:00+00:00</published><updated>2025-11-01T10:00:00+00:00</updated><id>mvonebers.com/blog/thesis-abridged</id><content type="html" xml:base="mvonebers.com/blog/thesis-abridged/"><![CDATA[<hr />

<p><strong>This is a draft. It is incomplete, contains personal notes to myself, and has not been edited for publication. Read at your own risk!</strong></p>

<hr />

<h2 id="my-masters-thesis-abridged-with-thoughts">My Master’s thesis, abridged, with thoughts</h2>

<p>(Read the full work with citations <a href="https://mvonebers.com/docs/thesis.pdf">here</a>. However, I’m happier with the version of my argument presented below.)</p>

<p>I received my Master of Science in Computer Science at UT Austin in August of this past year. I was advised by Risto Miikkulainen as my CS advisor, but was primarily advised by Xue-Xin Wei in the Neuroscience department. I took an extra semester to finish my thesis because I was pretty slow to develop a useful daily structure for myself, and even then I wasn’t completely sure of the story that I wanted to tell with my work until late in the summer semester. I wanted the structure of this program because I didn’t feel like up to that point I had filled my life with things I created and was proud of – now I have this body of work, and it’s pretty messy, but I have achieved some decent level of pride and a good portion of humility to go along with it.</p>

<p>So here’s a very abridged version of my thesis as a little challenge and clarifying exercise to myself, and also so anyone who happens to be curious can engage with it a little easier.</p>

<p>TLDR: my thesis argues that two recent papers, which both present deep neural network models of how the hippocampus forms cognitive maps using place cells, are insufficient in explaining both the form and the function of these cells. Most of the thesis presents an alternate hypothesis for the findings of the first paper; I use the second paper’s findings to support this argument, but ultimately I find that both sets of authors make the same error in their conclusions. This is kind of a <a href="https://www.transformer-circuits.pub/2022/mech-interp-essay">mechanistic interpretability</a> project, though the field has disputed boundaries and a lot of names.</p>

<p>First some background. The hippocampus is a small part of the brain that looks like a seahorse – hence the name. It’s mainly known for its central role in consolidating memory (some people might be familiar with patient H.M., who lost his hippocampus and became unable to form any new memories). However, hippocampal recordings of rats exploring familiar environments showed really beautiful response profiles in the hippocampus and the surrounding entorhinal cortex:</p>

<p><img src="https://images.squarespace-cdn.com/content/v1/5dff85cba217b83c46386b3b/1577116353215-PKK0HYO44XPLSU6U91YN/ejjjhcgc.png" alt="" /></p>

<p>Grid cells are really fascinating for a number of reasons, and their discovery won May-Britt Moser, Edvard Moser, and John O’Keefe the Nobel Prize in 2014, but we’re focusing today on place cells which are generally considered to be downstream (I say this with a heavy caveat, the two areas are bidirectionally connected). Place cells are fascinating because they seem to have an endless stream of variants depending on the setting: border cells, head direction cells, object-vector cells, time cells, lap (around a circular track) cells. They seem to care about context: in a task where a rat alternates turning left and right at the same location, some cells will fire for that location only under the context of a right turn, while some prefer the left turn, which implies an “unfolded” latent map of the space. The home of executive function and working memory, the prefrontal cortex, is constantly playing a mirroring game with the hippocampus.</p>

<p>NOTE you say fascinating twice lol</p>

<p>(fig)</p>

<p>I think the thing that excites me the most is that grid cells have been found in humans to be really active during logical reasoning tasks, part of the evidence suggesting that the whole hippocamal-entorhinal circuit is fundamental for organizing conceptual knowledge for rapid inference and generalization. I know every neuroscientist thinks their brain area is the coolest, but I think mine has a pretty good argument, especially within the AI discussion. For this and many other reasons, there are a lot of deep neural network models of this cognitive process, and they illustrate the many debates that neuroscientists have about these cells.</p>

<p>Artificial models in this realm are trained to do a (varyingly) biologically-realistic task, and then the weights and activations are studied to see if they resemble the response profiles of real recorded neurons (in mech interp, this is called concept-based interpretability). There’s a billion of these but I’ll introduce you to three that are most directly related to my work. My advisor’s model from 2018 trains a recurrent neural network to perform <em>path integration</em>: given a stream of velocity and heading direction updates per timestep, predict the currrent (x,y) location. Path integration is the proposed way that animals can keep track of their location across time in the absense of sensory cues, and grid cells are the most popularly hypothesized substrate for this computation (other major works along these lines are Burak and Fiete (2009) and Ganguli et al. (2012)). The trained RNN’s activations are able to explain a wide variety of cell types found in the MEC:</p>

<p>NOTE: I wanna make sure I understand where grid/place cells come from in xx’s paper because I’m saying it’s lacking in the next paragraph. He does NOT get place cells, for the record. HD/border cells are MEC, so say we’ve got a path integrator sort of figured out, but then TEM is nice because predicting the next stimulus provides visual grounding/error correction/helps to explain one way that the HPC/EC circuit could be involved in the sort of general latent space reasoning that it’s purported to be involved in</p>

<p>(xuexin paper figure)</p>

<p>The Tolman-Eichenbaum Machine is another model that’s been a favorite of mine during my program, which influences a lot of my thinking on this topic. TEM is also an RNN-based model, though its model architecture supports a more specific claim: that the <em>content</em> of experiences, and the <em>relationships</em> between those experiences, are two separate inputs which are combined into place cell representations in the hippocampus, explaining their unique range of properties and functions. Specifically, the grid cells in the medial entorhinal cortex use highly processed, indirect sensory information as well as self-motion cues to support non-grounded relations, and the lateral entorhinal cortex supplies more direct and rich sensory information to provide the content. That bit will come back later, but the important part right now is that we now have a model which can accurately predict the sensory specifics associated with each location in its environment.</p>

<p>NOTE: these past two paragraphs are stepping stones. CSCG could come later, or you could tie it in with the fact that we’re modelling place cells by predicting sensory specifics. I also think maybe you could restructure the last paragraph to emphasize that point (that you’re predicting the next observation, which is critical). I’m not gonna write about CSCG right now actually</p>

<p>NOTE: I think you need to discuss anti-aliasing as well, though I wonder if this will come up more naturally later?</p>

<p>So that’s cool, but there’s still two things about the TEM model that can be improved upon: one is that the model relies on allocentric information about its movements, which a biological agent probably wouldn’t have, and another is that TEM only operates over discrete observations, so it doesn’t give a full explanation for how rich sensory information is integrated into a place cell. (NOTE: I don’t love that last bit…could you think of some more examples?) Enter the main model in focus for this thesis: the predictive coding model from Gornet and Thompson (2024). It’s really nice and simple idea: an agent walking around an environment receives a short sequence of visual observations across its trajectory. A ResNet-style encoder turns these images into a sequence of latent features. These latent features go through a few transformer blocks to form predictions for each time step, these predicted latents are decoded with a ResNet decoder to form predicted observations, which are used to train the model via MSE loss in pixel space.</p>

<p><img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs42256-024-00885-9/MediaObjects/42256_2024_885_Fig1_HTML.png?as=webp" alt="Predictive coding model figure" /></p>

<p><strong>A figure describing their model, which, wouldn’t you know it, comes from a Nature Machine Intelligence News &amp; Views article that my advisor and I wrote to introduce their work. Look, the Minecraft figure is me! It’s short – you can read it <a href="https://www.nature.com/articles/s42256-024-00885-9">here.</a></strong></p>

<p>(fig from my article, which you should mention in the caption and link)</p>

<p>They motivate this with a nice equation, which makes sense to me.</p>

\[P(I_{k+1} | I_0, I_1, \dots, I_k) =
\int_{\Omega} d\mathbf{x} \, P(x_0 \dots x_k) 
\frac{P(I_0, I_1, \dots, I_k | x_0 \dots x_k)}{P(I_0, I_1, \dots, I_k)}
P(x_{k+1} | x_k) P(I_{k+1} | x_{k+1})\]

\[= \int_{\Omega} d\mathbf{x} \, P(x_0 \dots x_k | I_0, I_1, \dots, I_k)
P(x_{k+1} | x_k) P(I_{k+1} | x_{k+1})\]

<p>Under this model, their encoder learns to map observations into locations, the transformer blocks learn to predict realistic movement within the space, and then the decoder learns the inverse mapping between locations and observations. The other two models I mentioned explicitly perform path integration, and then (in the case of TEM) use sensory information to error-correct and ground the model – Gornet and Thomson didn’t really rely on an explicit path-integration system, but it’s an interesting question of if the model is doing this implicitly in order to achieve its high accuracy.</p>

<p>They give two major points in favor of their theory: the PC model’s latent embeddings have more spatial information than a plain autoencoder trained on the same environment (as indicated by the performance of a linear probe), and individual embeddings in the PC model can be reasonably interpreted as place cells (I’ll explain what they mean as reasonable later). This would help to establish that something more “intentional” is going on than just the fact that images are correlated with locations in the environment, and this is where I disagree. I think it’s just image features, and that these image features are not enough to establish a cognitive map.</p>

<p>Let’s discuss the spatial information part first. (I’m using that term casually to refer to the implications of the performance of the linear probe.)</p>

<p>(image showing the spatial info plot, and the SI plot)
<strong>Linear probes for each model are trained to predict location from latent embeddings. The PC model’s errors tend to be much smaller than the autoencoder’s.</strong></p>

<table>
  <tbody>
    <tr>
      <td>So this is just a gotcha, but possibly important: if your encoder is learning P(x</td>
      <td>I) and outputting a spatial representation x, and your transformer is presumably just transitioning this x to another x, and the evidence for your post-transformer representation being x is that the spatial information is high, shouldn’t the spatial information be high for your pre-transformer x also be high? I didn’t have to perform this test: they show that this isn’t true in their supplementary information.</td>
    </tr>
  </tbody>
</table>

<p>(fig from SI)
<strong>A linear probe from the PC’s latent embeddings <em>pre-prediction</em> recovers similar performance to the autoencoder.</strong></p>

<table>
  <tbody>
    <tr>
      <td>They present this as evidence that prediction is necessary to form a spatial map, which could still be true, but it does imply that the transformer blocks are instead learning both P(x</td>
      <td>I) and P(I_t+1</td>
      <td>I_t). I performed this test at each point in the residual stream and it doesn’t indicate really any differentiation along these lines. (NOTE: is that true? Show a nice figure with all of them overlapping. Take care to see if maybe info develops once and then it’s shifted forward, which might cause a slightly less nice curve or something?) (NOTE: given that my hypothesis is that the transformer is not doing anything special to these image features, I need to explain why the performance is better. I’ve never constructed a perfect argument along this dimension, and I’m happy to admit that some weird map of space is arising here, but (you only need to hedge this if the decoder-swapping thing does fail) my guess is that, for each image “token” in the sequence, the attention + MLP is mixing in information about the image features from surrounding locations, which the linear probe is able to decode a little better. But…maybe that’s just actually how the map is formed!)</td>
    </tr>
  </tbody>
</table>

<p>I did a lot of thinking about what would be a good proof that the spatial content in the post-prediction embeddings is not due to them being a code of space. That’s hard! I can , but it’s possible this is a code of space, but it doesn’t work in the way that place cells work, or maybe our understanding of place cells is wrong entirely. I did notice that they train their linear probe only on one image per location, where the agent is facing the same direction. I trained one on a dataset which included multiple viewing angles per location, and saw that the accuracy dropped prodigiously. Maybe there’s some mixed place + head direction encoding going on? Is that normal for hippocampal cells? We’ll come back to that. (basically this whole section can be rewritten when you swap out your decoder)</p>

<p>I did come up with one indirect method for testing what the transformer has actually learned. I extended the predictive coder to include head direction and velocity information, like a path-integrator. Now at every step, the PC receives not only an image, but also the velocity and head direction of the step taken at that location. So in theory the model has everything it needs to predict the next observation perfectly every time. Movement info is embedded with a linear network and concatenated (I tried several variants, this is the only one that works).</p>

<p>Test loss improves with this model, meaning it CAN use this information! However, the trained model is “unsteerable”: you face the model directly north/south/east/west, apply speed, and see if the decoded location matches the movement along whichever axis. So if the model faces north, after a few steps it should show an increase along the Z axis, and it should produce the predicted observations which are expected along this trajectory. I found that it’s unable to do either of these things. You can play with that here: (NOTE: attach colab.)</p>

<p>(attach fig about steering)</p>

<p>NOTE: if you wanna check the first latent token in the sequence, you need to check it in the first part of the residual stream, AND before the MLP right?</p>

<p>NOTE: the fact that the probe trained on all head directions struggles a little kind of goes along with this idea that maybe the transformer is learning some…map of space, but maybe it’s view dependent?</p>

<p>– I want a paragraph here at the end opining on why this is hard in general I think, but keep the rest of it more focused on the actual work that was done</p>]]></content><author><name>Maggie von Ebers</name></author><category term="blog" /><category term="jekyll" /><category term="update" /><category term="neuroscience" /><category term="thesis" /><summary type="html"><![CDATA[This is a draft. It is incomplete, contains personal notes to myself, and has not been edited for publication. Read at your own risk! My Master’s thesis, abridged, with thoughts (Read the full work with citations here. However, I’m happier with the version of my argument presented below.) I received my Master of Science in Computer Science at UT Austin in August of this past year. I was advised by Risto Miikkulainen as my CS advisor, but was primarily advised by Xue-Xin Wei in the Neuroscience department. I took an extra semester to finish my thesis because I was pretty slow to develop a useful daily structure for myself, and even then I wasn’t completely sure of the story that I wanted to tell with my work until late in the summer semester. I wanted the structure of this program because I didn’t feel like up to that point I had filled my life with things I created and was proud of – now I have this body of work, and it’s pretty messy, but I have achieved some decent level of pride and a good portion of humility to go along with it. So here’s a very abridged version of my thesis as a little challenge and clarifying exercise to myself, and also so anyone who happens to be curious can engage with it a little easier. TLDR: my thesis argues that two recent papers, which both present deep neural network models of how the hippocampus forms cognitive maps using place cells, are insufficient in explaining both the form and the function of these cells. Most of the thesis presents an alternate hypothesis for the findings of the first paper; I use the second paper’s findings to support this argument, but ultimately I find that both sets of authors make the same error in their conclusions. This is kind of a mechanistic interpretability project, though the field has disputed boundaries and a lot of names. First some background. The hippocampus is a small part of the brain that looks like a seahorse – hence the name. It’s mainly known for its central role in consolidating memory (some people might be familiar with patient H.M., who lost his hippocampus and became unable to form any new memories). However, hippocampal recordings of rats exploring familiar environments showed really beautiful response profiles in the hippocampus and the surrounding entorhinal cortex: Grid cells are really fascinating for a number of reasons, and their discovery won May-Britt Moser, Edvard Moser, and John O’Keefe the Nobel Prize in 2014, but we’re focusing today on place cells which are generally considered to be downstream (I say this with a heavy caveat, the two areas are bidirectionally connected). Place cells are fascinating because they seem to have an endless stream of variants depending on the setting: border cells, head direction cells, object-vector cells, time cells, lap (around a circular track) cells. They seem to care about context: in a task where a rat alternates turning left and right at the same location, some cells will fire for that location only under the context of a right turn, while some prefer the left turn, which implies an “unfolded” latent map of the space. The home of executive function and working memory, the prefrontal cortex, is constantly playing a mirroring game with the hippocampus. NOTE you say fascinating twice lol (fig) I think the thing that excites me the most is that grid cells have been found in humans to be really active during logical reasoning tasks, part of the evidence suggesting that the whole hippocamal-entorhinal circuit is fundamental for organizing conceptual knowledge for rapid inference and generalization. I know every neuroscientist thinks their brain area is the coolest, but I think mine has a pretty good argument, especially within the AI discussion. For this and many other reasons, there are a lot of deep neural network models of this cognitive process, and they illustrate the many debates that neuroscientists have about these cells. Artificial models in this realm are trained to do a (varyingly) biologically-realistic task, and then the weights and activations are studied to see if they resemble the response profiles of real recorded neurons (in mech interp, this is called concept-based interpretability). There’s a billion of these but I’ll introduce you to three that are most directly related to my work. My advisor’s model from 2018 trains a recurrent neural network to perform path integration: given a stream of velocity and heading direction updates per timestep, predict the currrent (x,y) location. Path integration is the proposed way that animals can keep track of their location across time in the absense of sensory cues, and grid cells are the most popularly hypothesized substrate for this computation (other major works along these lines are Burak and Fiete (2009) and Ganguli et al. (2012)). The trained RNN’s activations are able to explain a wide variety of cell types found in the MEC: NOTE: I wanna make sure I understand where grid/place cells come from in xx’s paper because I’m saying it’s lacking in the next paragraph. He does NOT get place cells, for the record. HD/border cells are MEC, so say we’ve got a path integrator sort of figured out, but then TEM is nice because predicting the next stimulus provides visual grounding/error correction/helps to explain one way that the HPC/EC circuit could be involved in the sort of general latent space reasoning that it’s purported to be involved in (xuexin paper figure) The Tolman-Eichenbaum Machine is another model that’s been a favorite of mine during my program, which influences a lot of my thinking on this topic. TEM is also an RNN-based model, though its model architecture supports a more specific claim: that the content of experiences, and the relationships between those experiences, are two separate inputs which are combined into place cell representations in the hippocampus, explaining their unique range of properties and functions. Specifically, the grid cells in the medial entorhinal cortex use highly processed, indirect sensory information as well as self-motion cues to support non-grounded relations, and the lateral entorhinal cortex supplies more direct and rich sensory information to provide the content. That bit will come back later, but the important part right now is that we now have a model which can accurately predict the sensory specifics associated with each location in its environment. NOTE: these past two paragraphs are stepping stones. CSCG could come later, or you could tie it in with the fact that we’re modelling place cells by predicting sensory specifics. I also think maybe you could restructure the last paragraph to emphasize that point (that you’re predicting the next observation, which is critical). I’m not gonna write about CSCG right now actually NOTE: I think you need to discuss anti-aliasing as well, though I wonder if this will come up more naturally later? So that’s cool, but there’s still two things about the TEM model that can be improved upon: one is that the model relies on allocentric information about its movements, which a biological agent probably wouldn’t have, and another is that TEM only operates over discrete observations, so it doesn’t give a full explanation for how rich sensory information is integrated into a place cell. (NOTE: I don’t love that last bit…could you think of some more examples?) Enter the main model in focus for this thesis: the predictive coding model from Gornet and Thompson (2024). It’s really nice and simple idea: an agent walking around an environment receives a short sequence of visual observations across its trajectory. A ResNet-style encoder turns these images into a sequence of latent features. These latent features go through a few transformer blocks to form predictions for each time step, these predicted latents are decoded with a ResNet decoder to form predicted observations, which are used to train the model via MSE loss in pixel space. A figure describing their model, which, wouldn’t you know it, comes from a Nature Machine Intelligence News &amp; Views article that my advisor and I wrote to introduce their work. Look, the Minecraft figure is me! It’s short – you can read it here. (fig from my article, which you should mention in the caption and link) They motivate this with a nice equation, which makes sense to me. \[P(I_{k+1} | I_0, I_1, \dots, I_k) = \int_{\Omega} d\mathbf{x} \, P(x_0 \dots x_k) \frac{P(I_0, I_1, \dots, I_k | x_0 \dots x_k)}{P(I_0, I_1, \dots, I_k)} P(x_{k+1} | x_k) P(I_{k+1} | x_{k+1})\] \[= \int_{\Omega} d\mathbf{x} \, P(x_0 \dots x_k | I_0, I_1, \dots, I_k) P(x_{k+1} | x_k) P(I_{k+1} | x_{k+1})\] Under this model, their encoder learns to map observations into locations, the transformer blocks learn to predict realistic movement within the space, and then the decoder learns the inverse mapping between locations and observations. The other two models I mentioned explicitly perform path integration, and then (in the case of TEM) use sensory information to error-correct and ground the model – Gornet and Thomson didn’t really rely on an explicit path-integration system, but it’s an interesting question of if the model is doing this implicitly in order to achieve its high accuracy. They give two major points in favor of their theory: the PC model’s latent embeddings have more spatial information than a plain autoencoder trained on the same environment (as indicated by the performance of a linear probe), and individual embeddings in the PC model can be reasonably interpreted as place cells (I’ll explain what they mean as reasonable later). This would help to establish that something more “intentional” is going on than just the fact that images are correlated with locations in the environment, and this is where I disagree. I think it’s just image features, and that these image features are not enough to establish a cognitive map. Let’s discuss the spatial information part first. (I’m using that term casually to refer to the implications of the performance of the linear probe.) (image showing the spatial info plot, and the SI plot) Linear probes for each model are trained to predict location from latent embeddings. The PC model’s errors tend to be much smaller than the autoencoder’s. So this is just a gotcha, but possibly important: if your encoder is learning P(x I) and outputting a spatial representation x, and your transformer is presumably just transitioning this x to another x, and the evidence for your post-transformer representation being x is that the spatial information is high, shouldn’t the spatial information be high for your pre-transformer x also be high? I didn’t have to perform this test: they show that this isn’t true in their supplementary information. (fig from SI) A linear probe from the PC’s latent embeddings pre-prediction recovers similar performance to the autoencoder. They present this as evidence that prediction is necessary to form a spatial map, which could still be true, but it does imply that the transformer blocks are instead learning both P(x I) and P(I_t+1 I_t). I performed this test at each point in the residual stream and it doesn’t indicate really any differentiation along these lines. (NOTE: is that true? Show a nice figure with all of them overlapping. Take care to see if maybe info develops once and then it’s shifted forward, which might cause a slightly less nice curve or something?) (NOTE: given that my hypothesis is that the transformer is not doing anything special to these image features, I need to explain why the performance is better. I’ve never constructed a perfect argument along this dimension, and I’m happy to admit that some weird map of space is arising here, but (you only need to hedge this if the decoder-swapping thing does fail) my guess is that, for each image “token” in the sequence, the attention + MLP is mixing in information about the image features from surrounding locations, which the linear probe is able to decode a little better. But…maybe that’s just actually how the map is formed!) I did a lot of thinking about what would be a good proof that the spatial content in the post-prediction embeddings is not due to them being a code of space. That’s hard! I can , but it’s possible this is a code of space, but it doesn’t work in the way that place cells work, or maybe our understanding of place cells is wrong entirely. I did notice that they train their linear probe only on one image per location, where the agent is facing the same direction. I trained one on a dataset which included multiple viewing angles per location, and saw that the accuracy dropped prodigiously. Maybe there’s some mixed place + head direction encoding going on? Is that normal for hippocampal cells? We’ll come back to that. (basically this whole section can be rewritten when you swap out your decoder) I did come up with one indirect method for testing what the transformer has actually learned. I extended the predictive coder to include head direction and velocity information, like a path-integrator. Now at every step, the PC receives not only an image, but also the velocity and head direction of the step taken at that location. So in theory the model has everything it needs to predict the next observation perfectly every time. Movement info is embedded with a linear network and concatenated (I tried several variants, this is the only one that works). Test loss improves with this model, meaning it CAN use this information! However, the trained model is “unsteerable”: you face the model directly north/south/east/west, apply speed, and see if the decoded location matches the movement along whichever axis. So if the model faces north, after a few steps it should show an increase along the Z axis, and it should produce the predicted observations which are expected along this trajectory. I found that it’s unable to do either of these things. You can play with that here: (NOTE: attach colab.) (attach fig about steering) NOTE: if you wanna check the first latent token in the sequence, you need to check it in the first part of the residual stream, AND before the MLP right? NOTE: the fact that the probe trained on all head directions struggles a little kind of goes along with this idea that maybe the transformer is learning some…map of space, but maybe it’s view dependent? – I want a paragraph here at the end opining on why this is hard in general I think, but keep the rest of it more focused on the actual work that was done]]></summary></entry><entry><title type="html">Effortposting</title><link href="mvonebers.com/blog/effortposting/" rel="alternate" type="text/html" title="Effortposting" /><published>2024-09-28T10:00:00+00:00</published><updated>2024-09-28T10:00:00+00:00</updated><id>mvonebers.com/blog/effortposting</id><content type="html" xml:base="mvonebers.com/blog/effortposting/"><![CDATA[<p><del>Regina George</del> Neel Nanda started a
<a href="https://x.com/NeelNanda5/status/1840157904835338526">blog</a>, 
so I started a blog. Actually, I’ve been sitting on this for about a year,
threatening to myself to do it every 2 weeks or so, which might be a little bit
more embarrassing.</p>

<p>I just graduated from my master’s degree, which was a difficult process for me
for a number of reasons that I might write about later. One lesson that it
taught me, that I’ll forever be thankful for, is that nothing could be better
for your brain than putting your thoughts into some real-life format. To-dos
need to be written down to get done. Research ideas need to be torn apart by
your professor to become viable. Dissatisfaction with your life, or even
just natural latent energy, needs to be turned into a sport or writing or
relationships or lasting change, or it’ll become rumination and worse. I get that
this is obvious to the general population – it was not obvious to me.</p>

<p>So I’m going to start writing about research topics that I think are cool. I
have no idea if they’ll be worthwhile or readable for a long time, but I’m going to
keep posting and posting until I’m 85 years old and I produce a singular
beautiful contribution to the field of mechanistic interpretability. You should
probably follow along so you don’t miss it.</p>

<h1 id="the-content-though">The content, though?</h1>

<p>I read a lot of papers for my degree (which is computational neuroscience)
and I want to talk about them. They’re mostly about:</p>
<ul>
  <li>cognitive maps</li>
  <li>place cells and grid cells</li>
  <li>continual learning</li>
</ul>

<p>Hopefully this is a cool niche that people will be interested in, and one day
I’ll write up something about my thesis as well.</p>]]></content><author><name>Maggie von Ebers</name></author><category term="blog" /><category term="jekyll" /><category term="update" /><category term="personal" /><summary type="html"><![CDATA[Regina George Neel Nanda started a blog, so I started a blog. Actually, I’ve been sitting on this for about a year, threatening to myself to do it every 2 weeks or so, which might be a little bit more embarrassing.]]></summary></entry></feed>